On the retrieval of binary-coded mass spectra

(1)

ON THE RETRIEVAL

OF BINARY-CODED

MASS SPECTRA

(2)

o

o o

pj o 0> UI "O

c- o

ON THE RETRIEVAL

OF BINARY-CODED MASS SPECTRA

PROEFSCHRIFT

TER VERKRIJGING VAN DE GRAAD VAN DOCTOR IN DE TECHNISCHE WETENSCHAPPEN AAN DE TECHNISCHE HOGESCHOOL DELFT, OP GEZAG VAN DE RECTOR MAGNIFICUS PROF. DR. IR. F. J. KIEVITS, VOOR EEN COMMISSIE AANGEWEZEN DOOR HET COLLEGE VAN DEKANEN, TE VER-DEDIGEN OP WOENSDAG 24 OKTOBER 1979 TE

13.30 UUR

DOOR

GEERT VAN MARLEN

scheikundig ingenieur geboren te Rotterdam

a)12688;

iELENSIR.101

b o i

I H lbC|

u ib

DELFTSCHE UITGEVERS MAATSCHAPPIJ B.V. DELFT 1979

BIBLIOTHEEK TU Delft P 1601 4169

(3)

Dit proefschrift is goedgekeurd door de promotor en de co-promotor

(4)

Aan mijn ouders Aan Yvonne

(5)

Drawings : Mr. F.Bolman Photography: I-Ir, C.Wamaar

(6)

CONTENTS

p a g e

1 INTRODUCTION 7

2 INFORMATION THEORY APPLIED TO SELECTION OF PEAKS 1? FOR RETRIEVAL OF MASS SPECTRA

3 CALCULATION OF THE INFORMATION CONTENT OF RETRIEVAL 33 PROCEDURES, APPLIED TO MASS SPECTRAL DATA BASES

4 SEARCH STRATAGY AND DATA COMPRESSION FOR A RETRIEVAL 45 SYSTEM WITH BINARY-CODED MASS SPECTRA

5 INFLUENCE OP ERRORS AND MATCHING CRITERIA UPON THE 53 RETRIEVAL OF BINARY-CODED LOW RESOLUTION MASS SPECTRA

SUMMARY 69 SAMENVATTING 73 CURRICULUM VITAE 77

(7)

1 niTRODUCTION

Scope

During the last decade the computer has become an important tool for the acquisition and the processing of data in the analytical laboratory, especially for analytical techniques such as infrared and mass spectrometry,

As a result large volumes of computer or manually processed reference data have become available. Commercially available libraries have grovm to the current size of for instance 100,000 spectra for infrared spectrometry and 40»000 for mass spectrometry. These libraries even outgrow the also increasing storage capacity and processing speed of the today's laboratory computersysterns. The use of data compression techniques has therefore become inevitable,

The high speed in which spectral data is generated and the availability of large volumes of reference data has led to the development of a large number of automated identifi-cation procedures. Use has been made of various techniques such as factor analysis (l), pattern recognition (2), learning machines (3), artificial intelligence (4), and, most important of all, file searching or retrieval techniques (5),

In order to select the best possible identification procedure for a distinct application, all procedures need to be evaluated, or rather should have been evaluated prior to their implementation (6), For this evaluation a niomber of criteria have been suggested and used, for instance recall/reliability figures (6,7), matching histogreuns (8), percentage of correctly identified compo\mds or recognition performance (9-11) and the information content (12,13).

The concepts of information theory have been introduced in analytical chemistry by Kaiser (13). Several authors

(8)

have applied information theory concepts to the evaluation, comparison and optimization of analytical techniques such as thin-layer chromatography (I4)» gas chromatography (l5,l6), infrared spectrometry (17,18) and mass spectrometry (12,19-22).

In this chapter some considerations for the design of a retrieval system and the basic information theory concepts applied to mass spectral retrieval procedures are discussed. Also a brief introduction to mass spectrometry is presented. Mass spectrometry

Low resolution mass spectrometry is a particularly

suitable technique for the identification or classification of (almost) pure organic compounds. The identification of the components in mixtures can in many instances be achieved-by combining mass spectrometry with a separation technique such as gas chromatography.

Briefly a mass spectrum is created as follov/s (23):

A vaporized sample is fragmented into a set of (positively) charged ions by high energy electrons. The ions are

accelerated in an electric field. Ions v/ith different m/e value (mass-charge ratio) caji be separated by various

techniques, for instance by magnetic deflection. After deflection, the ions with a specific m/e value are captured and measured by an ion collector. Varying the magnetic field changes the degree of deflection and thus the focussing of other ions on the collector. Scanning the ion intensity as a function of the magnetic field generates a rav/ mass spectrum. After establishing the relationship between the m/e value and the magnetic field, and the calculation and the correction of the relative ion intensities, a list or plot of ra/e values and corresponding intensities is obtained, This list is called a mass spectrum.

v/lien entered in an interpretation or file searching procedure, a number of possibilities for the identity of

the recorded compound, or at least some structural characteristics of that compound, can be obtained.

(9)

Retrieval systems General

Retrieval of mass and infrared spectral data is widely used for the identification of organic compounds. The most important retrieval systems are the I>JSSS system described by Heller (24-26), STIRS (27) and PBM (7,28) by the group of KcLafferty, the BIG-6 system by Knock et.al. (29), the BiemaJin-KIT system (30), and a few systems using binary-coded spectra (ll,12,19,3l)« oome of these systems, for instance M3SS, PBM and STIRS, are even accessable through telecommimi-cation networks.

For the selection and/or the design of a retrieval system, the performance of the system is an important parameter. Other aspects to be considered are the type of application

(routine or research), the reference file, the quality of the recorded spectra, the available hardware, the desired tum-around time and the permitted cost of the retrieval, Some of these aspects might be equally or even more important than the system's performance. If retrieval of spectra of mixtures is involved, a 'forward' search system with a direct comparison of unknown and reference spectrum usually will not lead to an identification. Then the use of a 'reverse' search technique, where the degree in which the reference compound is contained in the unknown spectrum is calculated, will be advisable (28,32),'If the tum-around time of the system is not important, a large set of unknown spectra can be processed simultanuously in a batch job to save cpu require-ments and consequently to reduce the cost of the retrieval,

In general the design of a retrieval system involves techniques for data compression, presearch and calculation of the degree of matching. A short review of these techniques is given for some of the retrieval systems,

Data reduction

In order to compress the storage of large files of

reference spectra and to speed up the comparison of unknown and reference spectra, two types of data reduction techniques can be distinguished:

(10)

-the reduction of the number of peaks per spectrum.

-the reduction of the nrnnher of intensity levels.

The reduction of the number of peaks per spectrum can be achieved by the selection of a small number, between 5 and 10, of the most abundant m/e values in the spectrum, together with their intensities, or in sequence of decreasing intensity

(29,33). This can also be achieved by the selection of one or tvra most intensive peaks in each window of I4 m/e values of the spectrum (30),

The number of intensity levels can be reduced for instance by binary-coding. Binary-coded masses are obtained by intensity

thresholding, i,e, the decision about the presence of a peak above a given intensity threshold, coded as '1', or the absence, coded as '0' (11,12,19,31). A logic combination of peaks using

a clustering technique can even lead to a further compression

of the binary-coded spectrum (31). A similar result is obtained by the application of a feature selection procedure using

information theory concepts (21,22), Presearch

Even with the data reduction methods described, a comparison of the unknown spectrum with all spectra in a large reference file would be inefficient and time-consuming, since a limited number of possible identities is required as a result. Mien before or during the comparison it appears that a certain reference spectrum is not likely to belong to the desired set of solutions, the matching for this reference spectrum can be omitted or terminated in order to save time (37). A number of characteristics extracted from the unknown and reference spectrum can be used to perform a presearch, for instance:

-the molecular weights have to be identical (34). -the number of peaks in the (coded) spectra

should not differ too much (37»30),

-the base peaks must be identical, or the base peak of one spectrum should be a peak with a minimum

(scaled) intensity in the other spectrum (30), The implementation of a presearch method is highly dependent

(11)

of the data reduction technique used, the data base organi-zation (sequential, indexed sequential or random), the available hardware, etc. The use of a presearch criterion in order to decrease the number of spectra to be compared and thus to increase the search speed, is effective and justified only for retrieval with large reference files. i'ktching

In the ideal situation, i.e, when the spectra are (almost) uniquely and highly reproducably coded and when all possible compounds are contained in the data base, special techniques can be implemented to obtain a high speed retrieval system, such as dictionaries (55) and inverted files (36). In practice however, this ideal situation of 'perfect' matching is not attained. Small variations in experimental conditions cause differences in the coded spectra and obviously not all possible reference compounds can be compiled. A desirable characteristic of a retrieval procedure is to obtain at least some structural characteristics of the unknown compound. In general a retrieval system will provide a (small) number of reference spectra

resembling the unknown spectrum. This resemblance is represented by a parameter called the degree of matching or the similarity, The degree of matching can for instance be defined as:

-the 'city-block' distance, i,e, the sum of absolute differences in the intensities ( 9) or intensity sequences (29) of the two coded spectra,

-the 'euclidic' distance, i,e, the sum of the squares of differences in the intensities or intensity sequences,

For binary-coded spectra the city-block and the euclidic distance are equal to the number of bit mismatches between the two spectra and can be calculated with an 'exclusive or' operator (ll,12). To emphasize the peaks coded present in both spectra, a linear combination of an 'exclusive or' and a

'logical and' operator can be adopted (lO), A discrimination between masses and types of matching (a peaJc present or absent in both spectra, or present in either one spectrum)

(12)

Information theory

Information theory provides the mathematical and statistical means to measure the amount of information transmitted in a system, and consequently in analytical systems such as retrieval procedures. In the qualitative spectral analysis of chemical compounds the information content is used to quantify the degree in which the uncertainty about the identity of the compound is changed by the analysis. The information content I of a retrieval procedure using a certain spectral code is related to the number of spectra N„, which can be distinguished, by the equation (39):

I = log^ IIQ (1)

With a set of for instance 1 million possible compounds a spectral code is required which containes at least 20 bits of information to enable compound identification,

V/hen different compounds result in an identical code, the identity of the compounds analysed is not fully established, and the information content of the retrieval procedure is given by the fallowing equation:

I = log^ N Q / W ^

with N„ and K being the number of possible identities

before and afler the analysis. Averaging I for all compounds analysed would give the information content of the identifi-cation procedure. In practice however, equation 2 can hardly be used to quantify the information content of the procedure: first of all the number of possible compounds M_ is not

kno^m generally, and secondly this approach only gives the information content after all (or at least a large number of) compounds have been analysed,

The same equations caji also be used when the code is divided in a set of coded spectral features. Measuring for instance a binary-coded feature (two states 0 and l) the average information content becomes:

I = ( N^Q log2 NQ/N^Q -. N^^ log^ N Q / N ^ ^ ) / N Q (5)

(2)

(13)

After the introduction of the probabilities p_ for state 0

(the number of spectra N _ coded as 0 divided by N_) and p,

for state 1, equation 3 results to:

I = - p^ log2

Vi -

P Q 10^2 "^0

^^^

These probabilities and consequently the information content

can be estimated from the frequencies observed for a limited

set of coded spectra. When the features are mutually

indepen-dent, the information content of the identification procedure

is assessed by adding the information contents of every

feature in the code (9»12), With dependent features the

information content of the identification procedure can

be estimated by making a proper correction for the

correlations between the features (l5,17»18,21,22),

Synopsis

Chapter 2 and 3 of this thesis describe the development

of a model for the calculation of the information content

of a retrieval procedure.

In chapter 2 the information theory and statistical

principles used are introduced and the model is applied to

a binary-coded mass spectral reference file (2l),

Chapter 3 discusses more extensively the approximations

and the feature selection procedure used in the calculations.

The influence of coding errors and an extension of the model

for the calculation of the information content of retrieval

procedures with non-binary-coded features are also discussed,

In addition the results are given for the calculations of

the information content of several retrieval procedures

with binary-coded and non-binary-coded mass spectra (22),

Chapter 4 describes a retrieval system with binary-coded

mass spectra, emphasizing design features such as data

compression, file organization, search strategy and speed

considerations (37).

Chapter 5 gives an evaluation of the results obtained

with the retrieval system. The influence of errors in the

spectra, and of the matching algorithm implemented, upon

the performance of the system are studied (ll),

(14)

References

1 R.W.Rozett and E.McLaughlin P e t e r s e n , Anal.Chem,,48(1976)817, 2 N.M.Frew, L.E.Wangen and T . L . I s e n h o u r , P a t t e r n R e c o g n i t i o n ,

3(1971)281,

3 P,Kent and J.GSumann, Helv,Chim,Acta,58(1975)787,

4 J , L e d e r b e r g , Biochemical A p p l i c a t i o n s of Fass S p e c t r o m e t r y , G,R.Waller,Ed,,Wiley,New York,1972,Ch,7,

5 R , G . R i d l e y , Biochemical A p p l i c a t i o n s of Mass S p e c t r o m e t i y , G.R.Waller,Ed,,Wiley,New York,1972,Ch,6,

6 F,W.McLafferty, Anal,Chem,,49(1977)1441.

7 G,M,Pesyna, R,Venkataraghavan, H,E,Dayringer and F,W,HcLafferty, Anal,Chem,,48(1976)1362,

8 S,L,Grotch, Anal,Chem,,46(1974)526,

9 K.Varmuza, Fresenius Z,Anal,Chem,,282(1976)129, 10 S,L.Grotch, Anal,Chem,,43(l97l)l362,

11 C v a n > k r l e n , A.Dijkstra and H.A.van ' t K l o o s t e r , Anal,Chem,,51(1979)420; t h i s Thesis C h , 5 ,

12 S . L , G r o t c h , A n a l . C h e m , , 4 2 ( l 9 7 0 ) l 2 1 4 , 13 H , K a i s e r , Anal,Chem,,42(l970)24A,

14 D , L . ^ ^ s s a r t , J . C h r o r a a t o g r , , 7 9 ( 1 9 7 3 ) 1 5 7 ,

15 P,P.Dupuis and A.Dijkstra, Anal,Chem,,47(1975)379.

16 A.Eskes, P . F . D u p u i s , A.Dijkstra, H.de C l e r c and D,L.I"Jassart, Anal,Chem,,47(1975)2168,

17 P.F.Dupuis and A.Dijkstra, F r e s e n i u s Z.Anal.Chem, ,290(1978)357. 18 P . F . D u p u i s , A.Dijkstra and J . H . v a n d e r I'laas, F r e s e n i u s

Z, Anal,Chem,,291(1978)27,

19 P , E m i , B e i t r a g z u r C o m p u t e r u n t e r s t u t z t e n S t r u k t u r a u f k l S r u n g , T h e s i s Nr, 4296,x!Jidgen6*ssischen Technischen Hochschule,

Z u r i c h , 1 9 7 2 ,

20 F , E m i and J . T . C l e r c , Helv.Chim,Acta,55(1972)489. 21 G,van F^rlen and A,Dijkstra, Anal,Chem,,48(1976)595;

t h i s Thesis C h , 2 ,

22 G,van I l a r l e n , A,Dijkstra and H.A.van ' t K l o o s t e r , A n a l . C h i m . A c t a , 1 1 2 ( l 9 7 9 ) i n p r i n t ; t h i s Thesis C h , 3 , 23 F.W,McLafforty. I n t e r p r e t a t i o n of Fass S p e c t r a , 2 n d E d , ,

W.A.Benjamin I n c , , R e a d i n g , T ' a s s a c h u s e t t s , 1 9 7 3 » C h , l , 24 S . R . H e l l e r , Anal,Chem,,44(1972)1951.

25 S , R . H e l l e r , H.M.Fales and G.W,A.Malne, Org.Mass S p e c t r , , 7(1973)107,

(15)

26 R . S . H e l l e r , G.W.A.Mlne, R,J.Peldmann and S . R , H e l l e r , J . C h e m , I n f o r m . C o m p u t , S c i , , l 6 ( l 9 7 6 ) l 7 6 ,

27 K.S.Kwok, R.Venkataraghavan and F.W.McLafferty, J.Am,Chem,Soc,95(1973)4185,

28 F.W.McLafferty, R . H . Ï Ï e r t e l and R.D.Villwoch, Org.I'ass S p e c t r , , 9 ( 1 9 7 4 ) 6 9 0 ,

29 B,A.Knock, I . C . S m i t h , D.E.Wright, R.G.Ridley and W.Kelly, Anal.Chem,,42(1970)1516,

30 H . S . H e r t z , R.A.Kites and K.Biemann, Anal.Chem,,43(1971)681, 31 L.E.Wangen, W.S.Woodward and T . L . I s e n h o u r , Anal,Chem.,

43(1971)1605,

32 F.P.Abrarason, Anal,Chem,,47(1975)45,

33 L.R.Crawford and J . D . H o s s i s o n , Anal,Chem,,40(1968)1464, 34 S . L . C r o t c h , Anal.Chem.,45(1973)2.

35 R.G.Dromey, Anal,Chem.,51(1979)229, 36 R.G.Dromey, Anal,Chem.,49(1977)1982.

37 G.van I-'fe.rlen and J.ÏÏ.Van den Hende, Anal.Chim,Acta, 112(1979)143; t h i s Thesis Gh,4.

38 S . L . C r o t c h , Anal,Chem,,47(1975)1285.

39 C.E.Shannon and W.V/eaver, The Mathematical Theory of Communication,The U n i v e r s i t y of I l l i n o i s P r e s s , U r b a n a ,

(16)

2 DIEORMATION THEORY APPLIED TO SELECTION OF PPJAKS*

FOR RETRIEVAL OP MASS SPECTRA

By using Shannon's formula, amounts of Information have been calculated for Identification of binary coded low reso-lution mass spectra by retrieval. When a threshold of 1 % of the base peak Is used for the decision about the presence or absence of a pealc, these binary coded mass spectra yield an amount of information of approximately 40 bits, it Is found that, for a library of ca. 10 000 mass spectra, a set of 120 preselected mass values In the range 1-300 contains the total Information: e.g., the nonselected masses do not supply any additional Information.

The principles of information theory can be used to

as-sess the amount of information obtained from the

measure-ment of physical quantities. Using the amount of

informa-tion and the correlainforma-tion between these physical quantities,

a set of characteristics can be selected which yields a

maxi-mum amount of information. A few years ago, Grotch (1)

introduced the concept of information as defined by

Shan-non (2) in mass spectrometry. Grotch indicated that a mass

spectrum yields an enormous amount of information, the

exact amount depending on the number of peaks and the

intensity levels that can be distinguished measuring these

peaks. It was shown that, for binary coded spectra (peaks

either absent or present), the number of bits obtained

amounts roughly to 150 depending on the threshold level

taken for the decision about the presence or absence of a

peak.

* A r e p r i n t of G. van Marlen and A,Dijkstra,

Anal. Chem., 48(1976)595.

(17)

Erni (3) also calculated the information for a set of bina-ry coded mass spectra. In a qualitative way, the correla-tions between the various masses were taken into account in a procedure for selecting the most suitable masses for re-trieval purposes.

In this paper, the results of some calculations of the in-formation obtained from a retrieval procedure with mass spectra are presented. Using the information as a criterion and taking into account the correlations between the peak occurrences, an optimal set of masses for retrieval has been selected. This study runs parallel to the calculation of in-formation and the selection of gas chromatographic col-umns given in a recent paper by Dupuis and Dijkstra (4).

The procedures used for calculating the amount of infor-mation and for selecting an optimal set of gas chromato-graphic columns can, in principle, be used for calculating the information obtained from retrieval of mass spectra. However, calculations are less straightforward due to the binary coding of the spectra.

The efficiencies of mass spectra coded in several ways might be measured in terms of information obtained. As such, the amount of information might serve as an alterna-tive to the matching histograms developed by Grotch (5). For a more detailed review of literature about retrieval of mass spectra, the reader is referred to (5).

AMOUNT OF INFORMATION

The amount of information from measuring a physical quantity—in this paper, the measurement of the intensity of a peak in a mass spectrum—is equal to the uncertainty with respect to the magnitude of this physical quantity be-fore the experiment minus the uncertainty with respect to this magnitude remaining if the measurement is per-formed. Neglecting the uncertainty remaining after the ex-periment, so in the absence of experimental errors, the amount of information I according to Shannon (2) equals

(18)

where p, is the probability of measuring the intensity level i, and n is the number of intensity levels that can be distin-guished.

If only two intensity levels are distinguished. Equation 1 reduces to

/ „ ( I ) = -p\dp - {1 - p) Id (1 - p) (2) where p is the probability of the presence of a certain peak in a spectrum and Id stands for log2. The notation / H ( 1 ) is used in order to indicate that the information calculated by using Equation 2 equals the information obtained from the measurement of one peak. The probability p in this case can be derived from a two-step histogram representing the number of times that a certain peak is present or absent in a set of mass spectra. This two-step histogram has only two intensity levels, viz. '0' and ' 1 ' and the class width Ax in this histogram equals 1.

Replacing the discontinuous distribution represented in the histogram by a continuous Gaussian (normal) distribu-tion, Equation 2 can be converted via integration to:

where a is the standard deviation of the normal distribu-tion. The index G indicates that the information is calcu-lated with the assumption of the Gaussian distribution.

Replacing the discontinuous distribution by other than normal continuous distributions affects Equation 3 only with respect to the value of the constant 2we.

If the information obtained from n binary coded masses is to be calculated. Equation 2 must be replaced by

/ H ( 1 , 2 , . . « )

= - X E - - Z ^ i . 2 . . " l d ^ i . 2 . . " (4)

1 2 n

where / H ( 1 , 2 , . . . M) is the amount of information obtained from the 1st, 2nd, . . . nth peaks. The sums are to be taken over the two possible values of the indexes 1,2,... n, i.e. the probabilities of the peaks 1,2 . . . n being absent or present.

(19)

A total number of 2" values of pi,2.,.n expressing the probabilities of the various possible combinations of mass-es must be mass-estimated in order to be able to calculate In- In practice, this is impossible if only a limited set of spectra is available for the calculations and if a large number of mass-es is considered.

In general, an approximate value of the information for n masses can be obtained by using the n dimensional equiva-lent of Equation 3:

;3U,2...„) = i . d ( g ) ' l c o v |

(5) where / G ( 1 , 2 , . . .n) is the amount of information for the 1st, 2nd and nth peak and ICOV] is the determinant of the covariance matrix (GOV) defined as:

(GOV) - I I (6) ,<7n1 0-r,2 ^ n n ,

(T„ = a, 2 represents the variance of the distribution of the intensities for peak i and tr,; the covariance of the intensi-ties for peaks i and ;'. Estimations s,^ and stj of the vari-ances an and covarivari-ances CT^ can be obtained by making use of the equations 5 2 _ M * ~ m - 1 m E hki - Vj)(>'*; - >j) _ Ml >» - 1

(20)

where m is the number of spectra, yki and ykj are the inten-sities (either 0 or 1) of the ith and yth peak in the kih spec-trum, and y, and yj are the average intensities of the ith and ; t h peak. Introducing these values in Equation 5 is only allowed when the two-step histogram closely resem-bles the normal distribution, an assumption that hardly can be made. However, it can be calculated that the errors introduced by making this assumption do not exceed a value of 0.05 bit per peak provided that the values of p are between 0.15 and 0.85. In order to reduce these errors Equation 3 has been replaced by

/„'(I) = I w ^ ^ <')

The correction factor / can be found from the values of / H ( 1 ) that can be calculated without error by using Equa-tion 2 and the value of / G ( 1 ) as calculated from the esti-mated variances using Equation 3 and equaling / H ( 1 ) and

/ G ' ( 1 ) . (Id ƒ is smaller than 0.05, provided 0.15 < p < 0.85). Then Equation 5 can be used for calculating the informa-tion obtained from n peaks provided that an is replaced by fi^ffii and (Tij by fifiOij. Thus the covariance determinant of Equation 6 is multiplied by {\,f2 • • • fn and the correction in terms of information equals Id /i,/2 • • fn- This proce-dure applied to an artificial set of 128 peaks known to yield 7 bits of information leads to a result with an error of less than 0.4 bit. Equation 5 is only valid when experimental er-rors are absent. The selection of an optimal set of masses for retrieval purposes requires the estimation of the amount of information obtained with various combinations of peaks. The mass to be selected as the first is the one that yields the highest amount of information. The next masses are added using the criterion:

( '^^^A

^fc\l ~ 'b - •[ J = maximum (8)

(21)

Table I. The Amount of Information for m/e 1-300 (bits)

m/e Inf. mje Inf. mje Inf. mje Inf.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 0.10 0.07 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.03 0.04 0.24 0.27 0.60 0.89 0.31 0.44 0.69 0.30 0.07 0.02 0.03 0.03 0.15 0.37 0.95 0.97 0.98 1.00 0.71 0.86 0.64 0.30 0.17 0.36 0.46 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 0.83 1.00 0.93 0.97 0.74 0.91 0.88 0.92 0.85 0.90 0.73 0.76 0.56 0.81 0.58 0.96 0.81 0.83 0.71 0.83 0.74 0.85 0.77 0.75 0.55 0.69 0.69 0.80 0.72 0.86 0.70 0.74 0.62 0.74 0.67 0.74 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 0.53 0.54 0.53 0.46 0.51 0.36 0.44 0.37 0.43 0.37 0.45 0.38 0.48 0.38 0.51 0.41 0.45 0.41 0.47 0.32 0.39 0.33 0.37 0.33 0.39 0.37 0.38 0.38 0.38 0.31 0.38 0.32 0.37 0.28 0.33 0.29 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 0.26 0.26 0.25 0.26 0.21 0.24 0.19 0.20 0.18 0.18 0.18 0.22 0.22 0.30 0.22 0.25 0.21 0.23 0.19 0.20 0.19 0.19 0.15 0.15 0.16 0.18 0.20 0.25 0.19 0.23 0.20 0.20 0.18 0.19 0.16 0.14

(22)

mje 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 Inf. 0.77 0.95 0.83 1.00 0.90 0.99 0.95 0.99 0.94 0.54 0.55 0.34 0.62 0.98 1.00 0.92 1.00 0.91 0.99 0.99 1.00 0.94 0.81 0.66 0.75 0.81 0.97 0.79 0.98 0.83 0.95 0.87 1.00 0.94 0.95 0.76 0.82 0.86 0.89 m/e 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 Inf. 0.68 0.67 0.47 0.82 0.67 0.75 0.63 0.76 0.60 0.68 0.56 0.63 0.54 0.63 0.62 0.73 0.67 0.67 0.54 0.65 0.53 0.63 0.53 0.60 0.49 0.55 0.47 0.64 0.51 0.63 0.47 0.54 0.46 0.55 0.42 0.54 0.44 0.50 0.46 m/e 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 Inf. 0.36 0.29 0.43 0.32 0.36 0.28 0.32 0.27 0.31 0.27 0.33 0.25 0.30 0.30 0.31 0.34 0.34 0.29 0.32 0.26 0.29 0.26 0.26 0.22 0.31 0.22 0.30 0.22 0.31 0.23 0.26 0.22 0.23 0.20 0.22 0.22 0.21 0.24 0.28 m/e 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 Inf. 0.15 0.14 0.15 0.17 0.15 0.19 0.17 0.21 0.18 0.16 0.15 0.16 0.14 0.13 0.11 0.10 0.11 0.14 0.14 0.18 0.14 0.17 0.14 0.14 0.13 0.12 0.13 0.12 0.11 0.09 0.09 0.09 0.11 0.14 0.13 0.14 0.13 0.14 0.13

(23)

In this equation, k refers to the ^th mass to be selected and I pki\ is the absolute value of the correlation coefficient de-fined as:

P. = 7 ^ (9)

The use of another selection criterion can affect the se-quences of the masses. However, the sequence of peaks as found by application of Equation 8 is optimized by making use of a procedure described in Ref. (4). This procedure es-sentially yields a set of peaks for which the value of the co-variance determinant is maximal. The introduction of a se-lection criterion leads to a near optimal set and avoids the calculation of the covariance determinant for all possible combinations of peaks.

RESULTS

For the calculations of the amount of information and determination of the optimal sequence of masses a set of approximately 10 000 spectra of the Mass Spectra Data Centre in Aldermaston was used. The spectra were binary coded using a threshold level of 1% of the intensity of the base peak.

After calculation of the information of the separate masses, it was noticed that masses higher than 300 hardly yield any information. The amounts of information for the first 300 masses as calculated with Equation 2 are given in Table I. Subsequently the variances, covariances, correla-tion coefficients and correccorrela-tion factors were calculated.

The optimal sequence found is presented in Table II. Only 120 masses are given. With this set, a total amount of information of approximately 40 bits is obtained, without taking into account experimental errors. Addition of more masses to this set does not add any information. The influ-ence of the correlations between the peak occurrinflu-ences on the amount of information is shown in Figure 1. Figure 1 also shows the amount of information for the same se-quence but neglecting the correlation.

(24)

10 20 30 40 50 60 70 80 90 100 110 120

Number of peaks » •

Figure 1. Information vs. number of peaks

(•) Without correlation, (•) with correlation

Table III. Number of Different Coded Mass Spectra vs. the Number of Selected Masses"

Masses 15 30 4 5 60 75 90 105 120 Diff. spectra 2940 7233 8283 8668 8851 8957 9055 9123

a Total number of spectra: 9 6 2 8 . Appr. 1700 pairs of identical c o m p o u n d s .

(25)

o o o o o o o o o o o o

r H ( N C O - < t U 5 5 0 t > O O a 5 0 r H C < l • ^ • ^ c o c D c o t o c o r - c o o i - H O i 00 •^' CT> CO i > Ö CO i n C-" O) Ö Ö r H r H < N ( N C O C 0 C 0 C 0 C O T j < ' ^ " * ( N a 5 ( N e o c D O T t a i a i i n c o • ^ c o i n i > O L n 3 i i n ( N i - H i n ^ D i - l t > C O r H O t > . H ( N O O C O ( N m o o < M i n o t D - ^ i n o i N i - H t D - < * m r H l f i C D r - l ( N 0 5 i - H ( N C 0 i n ( N t > c o < c c £ ) c o i n c D i N T i < c o i M ; D I > Q O X O O C O l M C D a 5 a 5 < N O O ( N i n i O r H C ^ r H l O C ^ U l O O ' - H t O r H TH 1-H t-H rH e^ o c o i n o i O i T f c o t o t O i - i i r t o o • ^ t - O C O C ^ i - H ^ O C O N r - I C O O i n c o c - a ) r j < o o o o O ' ^ o o > - H c o • < t a 3 X r H ; £ > c ~ - x o o a 5 c o ' ^ ( N O i - H i n r H C ~ - ( T > ' > * C D C < l - * l M O m C T i i n t > o c o Q o o o 5 0 M < N t > i n r H i x > o - ^ t > 0 5 c - a 5 0 5 i > I M r H < r i O O C O C ^ C O O r H r ) ' T t « 0 a > u : ! X c o c o o t > c o T t « t > c O r H

(26)

An amount of information of 120 bits would be obtained if every mass were to yield the maximum amount of 1 bit. In fact, this is the amount of bits required for the storage of one spectrum consisting of 120 binary coded masses. Ap-parently the binary coded spectra are highly redundant. Each spectrum of the binary coded set should be unique. The amount of 40 bits of information indicates that 2'"^ ~ 10'^ chemical compounds can be distinguished by the set of masses containing these 40 bits of information. This state-ment, however, applies to the average: one can expect that not all spectra of the library are unique. Table III gives a survey of the number of spectra that are unique as a func-tion of the number of selected masses.

All calculations were performed on the IBM 360/65 com-puter of the Delft University of Technology.

DISCUSSION

It is obvious that the amounts of information as

calculat-ed by the proccalculat-edure describcalculat-ed in this paper necalculat-ed to be cor-rected for the uncertainty remaining after the experiment. This uncertainty is due to experimental errors and errors made when coding the several mass peaks. A rough value for the uncertainty can be estimated from the amounts of information compiled in Table I. Since the amount of in-formation obtained from the measurement of one mass peak can never be negative, the correction for the uncer-tainty remaining will never exceed the lowest amount of in-formation in Table I. Apart from some masses in the region 1-30, which are considered to be unreliable, the minimum amount obtained is 0.09 bit (masses 291, 292, 293). Assum-ing that these masses yield no information at all, the uncer-tainty remaining does not exceed 0.09 bit. This corresponds to an error of about 1% in the assignment of a '0' or ' 1 ' to the intensity of a peak. This implies that the amount of in-formation must be decreased by 0.09 bit per mass and that a maximum of 30 bits will be obtained for about 100 select-ed masses.

From the results compiled in Table III, it appears that not all binary coded spectra are unique. If the spectra are

(27)

binary coded with the first 120 peaks selected, it appears that the 9628 compounds yield only 9123 different codes. 505 spectra appear to be identical to other spectra of the li-brary. Of these 505 spectra, 163 can be ascribed to identical compounds and 133 are isomers. 209 compounds show coded spectra that are identical with the coded spectra of entirely different compounds. A considerable number of these 209 compounds are of relatively high and low molecu-lar weight. This can be ascribed to the fact that masses over 300 have not been considered in the selection procedure (because of the small amounts of information) and only a few masses below 30 have been selected (also because of the small amounts of information). It is easily understood that the number of different spectra decreases as the number of peaks used for coding is decreased.

The significance of the peaks selected is not expected to show any relation to the peaks that play an important role in the interpretation (in contrast to retrieval) of mass spec-tra. The selection procedure makes use of statistical data and peaks that are considered to be significant in some cases may not be selected because, on the average, they are less significant. Peaks yielding large amounts of informa-tion, e.g., 29, 51, 53, fall far down on the list of contributors because of the correlations with the peaks selected in an earlier stage, whereas in interpretation simultaneously oc-curring peaks are often extremely helpful.

The amounts of information are influenced by the choice of the threshold level. According to Grotch (i), a threshold level of 1% is to be considered as an optimum if the thresh-old is chosen to be the same for every peak. Undoubtedly, thresholds in most cases can be adjusted for every peak to yield an amount of information of 1 bit. This of course might lead to a smaller set of peaks yielding the required amount of information.

(28)

the amounts of information and the selected masses given in this paper. If a retrieval procedure for 1000 compounds is to be designed, 10 bits of information are required. This information can be obtained by measuring the first 20 peaks of Table II, provided that these 1000 compounds show the same distribution as the set of compounds used in this study. If the set of mass spectra shows a different dis-tribution of peaks, a different set of masses must be select-ed. An entirely different distribution for instance was found for 200 binary coded spectra of alkanes, yielding a maximum amount of information of 9 bits for 25 selected masses.

Whether the set of mass spectra used is a representative set is difficult to judge. Although the library is composed of spectra from several classes of chemical compounds, the relative size of each class (the mix) may not be equal to the relative size of the entire population of chemical com-pounds accessible for mass spectrometry. At present, there seems to be no way of solving this problem and it has to be assumed that the library used does appear to represent a reasonable cross-section of the entire population.

It cannot be expected that retrieval experiments will yield perfect matches. On the basis of the amount of infor-mation (39 bits for about 100 peaks) and the influence of errors (max. 9 bits), an average match of about 75% can be expected. This figure can only be verified by a large num-ber of retrievals. Some preliminary retrievals indicate a better agreement (furfural 96%, acetophenone 93%, 2-deca-none 93%, d-limonene 96%, ethylcaprylate 92%, terpinolene 96%). Because of the occurrence of identical codes for dif-ferent compounds, about 5% of the retrievals will not lead to an unambiguous identification.

It can be concluded that application of information theo-ry offers a method for compact coding of mass spectra to be used for high speed retrieval.

(29)

ACKNOWLEDGMENT

The authors are indebted to R. P. W. Duin, P. F. Dupuis, and J. H. Kelderman for their helpful discussions.

LITERATURE CITED

(1) S. L. Grotch, Anal. Chem., 42, 1214 (1970).

(2) C. E. Shannon and W. Weaver, "The Mathematical Theory of Communi-cation", The University of Illinois Press, Urbana, III., 1949.

(3) F. Erni, Beitrag zur Computerunterstutzten Strukturaufklarung, Thesis Nr. 4296, Eidgenossischen Technischen Hochschule, Zurich, 1972.

(4) P. F. Dupuis and A. Dijkstra, Anal. Chem., 47, 379 (1975). (5) S. L. Grotch, Anal. Chem., 46, 526 (1974).

R E C E I V E D for review April 14, 1975. Accepted October 30,

(30)

EXPERIMENTAL

The six mass spectral reference files used in this study are listed in Table 1. The .spectra were binary-coded for a mass range of mjz 1—300, by using an intensity threshold of T% of the intensity of the base peak, with T varying from 0.1 to 20%. For some reference files, a 2-bit code was generated with three intensity levels, viz. 1, 5 and 20%, thus specifying peak intensities as: no peak, small peak, medium size and large peak, respectively.

For 1650 chemical compounds, duplicates were extracted from reference file F to enable an investigation of the effects of errors.

For the development and testing of the algorithms, a PDPll/45 minicom-puter was used whereas the final comminicom-puter programs were run on the IBM 370/158 computer system of the Delft University of Technology. All programs were written in FORTRAN IV.

INFORMATION CONTENT

Discontinuous distribution

Retrieval from a reference file containing N spectra each with n coded features (in this case m/z values) yields an amount of information called the "information content" of the retrieval procedure [ 9 ] . The information con-tent for feature;, /HO) in bits, without taking experimental and coding errors into account, is given by Shannon's equation [10]

luiD^-l P;(')log2P;(/) (1)

where m represents the number of discrete values / for feature ; with corres-ponding probabilities Py(i).

For binary-coded intensities, only two values for feature; are distinguished: eitlrér below or above a given threshold, coded as " 0 " or " 1 " , respectively.

TABLE 1

Mass spectral reference files used

Ref. Code-name Number of spectra Origin

A B C D E F ms9628 msalkane msl5796 ms02408 msll346 ms22349 9628 195 15796 2408 11346 22349 MSDC file, release 1971» Alkane Spectra from A MSDC file, release 1973» Hydrocarbon Spectra from C EPA/NIH file, release 19751» Mix of C and E

»Mass Spectrometry Data Centre, Aldermaston, Gt. Britain.

''EPA/NIH Mass Spectral Data Base, 1975 edn., National Technical Information Service, Dept.-of Commerce, 5285 Port Road, Springfield, Virginia 22151, U.S.A.

(31)

3 CALCULATION OF TIIE IKFOEMATTON CONTENT OF RETRIEVAL* PROCEDURES, APPLIED TO MASS SPECTRAL DATA BASES

SUMMARY

A procedure has been developed for estimating the information content of retrieval systems with binary-coded mass spectra, as well as mass spectra coded by other methods, from the statistical properties of a reference file. For a reference file, binary-coded with a threshold of 1% of the intensity of the base peak, this results typically in an estimated information content of about 50 bits for 200 selected m/z values. It is shown thai,, because of errors occurring in the binary-coded .spectra, the actual information content is only about 12 bits. This explains the poor performance observed for retrieval systems with binary-coded mass spectra.

In recent years, information theory has been applied in different fields of analytical chemistry. The information content has been introduced as an optimization criterion in thin-layer chromatography [ 1 ] , gas chromatography

[2, J ] , infrared [4, 5] and mass spectrometry [6—9]. It has been shown that a mass spectrum even with binary-coded intensities still provides a large amount of information [ 6 ] . The information content is diminished consider-aiily by correlations between spectral features [ 9 ] . Similar observations have been made for binary-coded inlrared spectra [4, 5 ] .

The influence of errors and matching criteria on the retrieval of binary-coded mass spectra has been discussed [ 9 ] . It was concluded that the perfor-mance of the retrieval primarily depends on the extent of errors occurring in the coded spectra and is hardly affected by the matching criterion used.

In this paper, a method is described for calculating the information content for binary-coded mass spectra as well as spectra coded by other means. In addition, a new algorithm for feature selection is presented. Finally, an ap-proach is outlined for prediction of the performance of a forward search system with binary-coded spectra, when data bases which differ with respect to the number and the nature of the compounds involved are considered.

(32)

Since S,'"= , Pj{i) = 1, it is obvious that in this case P;(0) = 1 — Pj{l). Equation (1) tlien reduces to

/ H ( / ) = -Pj log2 P; - (1 - P j ) log2 (l-Pj) (2) with Pj = P j ( l ) . As an illustration for a large reference file, the probabilities

and information contents for a number of m/z values, calculated with eqn. (2), are compiled in Table 2.

If the spectral features are considered to be independent, the total infor-mation content can be calculated with

n n ni

/ „ ( I , 2...n)= I /H(;) = - 2 I PjU) log2 Pj{i) (3)

;• = 1 j = 1 I = 1

If there is a dependence, eqn. (3) has to be replaced by

mm m

/ „ ( I , 2 . . . n ) = - 2 I •• 1 P ( / , , J 2 . . .i„)log2P(i",,i2 • • •!„) (4)

The total n u m b e r of probabilities to be estimated amounts to m" . For a small number of features n and a relatively large number of spectra, eqn. (4) can be used to calculate the information content. In order to predict the information content of retrieval procedures for very large files from the statistical properties of small files, reliable estimates of p must be available. However, in practice the number of spectral features is large (for mass spectra a few hundred m/z values) and therefore it will be impossible to obtain an adequate estimate of

TABLE 2

Influence of the probability of occurrence p and the mismatch probability p j on the infor-mation content /jj (Shannon, eqn. 2) and /jj (Shannon, eqn . 14), respectively, for some binary-coded m/z values from reference file F (in bits)

m/z' Threshold level 1% Threshold level 2%

27 28 29 42 44 51 53 55 56 57 65 69 77 75 91 P 0.44 0.41 0.42 0.52 0.47 0.52 0.52 0.58 0.43 0.51 0.43 0.50 0.56 0.43 0.44 /H 0.99 0.98 0.98 1.0 1.0 1.0 1.0 0.98 0.98 1.0 0.98 1.0 0.99 0.98 0.99 Pd 0.24 0.26 0.19 0.12 0.17 0.14 0.15 0.11 0.10 0.12 0.11 0.09 0.09 0.10 0.10 Ik 0.20 0.16 0.28 0.47 0.34 0.41 0.39 0.48 0.12 0.47 0.49 0.56 0.55 0.52 0.52 P 0.40 0.37 0.38 0.45 0.39 0.44 0.42 0.51 0.36 0.43 0.35 0.44 0.49 0.35 0.38 ^H 0.97 0.95 0.96 0.99 0.96 0.99 0.98 1.0 0.94 0.98 0.93 0.99 1.00 0.93 0.96 Pd 0.22 0.25 0.18 0.12 0.16 0.10 0.12 0.10 0.10 0.10 0.09 0.08 0.09 0.08 0.07 Hi 0.22 0.16 0.28 0.46 0.34 0.50 0.45 0.53 0.49 0.52 0.51 0.58 0.17 0.53 0.58

(33)

the p values even with the sizes of the spectral data bases which are presently available. Application of eqn. (4) then leads to a maximum information con-tent of logj A'^ bits.

Continuous distributions and correlations

When the discontinuous distributions of the probabihties py in eqn. (1) can be approximated by a continuous normal distribution, the sum in eqn. (1) is replaced by the integral

IGU) = - f Piix)iog2Pj{x)<ix (5) — oo

with pj(x) as the Gaussian distribution function for feature; with value x measured in histogram units and loU) the information content for the integral form. After inte-graLion eqn. (5) becomes

IGU) = -2 log: 2neüJ (6) where aj is the variance of the normally distributed feature;. Thus the

infor-mation content is a logarithmic function of this variance. The n-dimensional equivalent becomes

/ o ( l , 2 . . . n ) = ilog2(27Te)"ICOVl (7) when IQ is the total information content for n features and ICOVI the

deter-minant of the covariance matrix, filled with the variaiices o ? and the covariances Ojj [ 9 ] . Combination of eqns. (6) and (7) finally results in

/ G ( 1 , 2 . . . n) = 2 IGU) + -2 logi ICGRI (8)

y - 1

with ICORI as the correlation determinant defined as

|C0RJ= 1 P | l P . 3 k> 1 \p,iP,i 1 Prn P,n P2n Pm 1

The correlation coefficient p,-y between the features i and ; can be calculated with the estimates of the variances o? and aj and the covariance Oij [9] from the equation p,-; = aij/{af af)'.

The second term in eqn. (8) is a correction of the information content caused by the interdependence of the features involved.

Binary-coded spectra and feature selection

Although binary-coded features can hardly be considered normally distri-buted, one can make an estimate of variances, covariances and correlation

(34)

aJ = Npj(l-pj)/(N-l) (9)

0.7 = (Pu -PiPj)l[PiPA'^ - P i ) ( l -Pj)]'^ (10) with Pj and py being the probabilities of coding feature i o r ; , respectively,

present ("1") and p^ the probability of coding both features present. Appli-cation of eqns. (2), (6) and (9) yields the information contents 1^ and IQ as a function of Py for one binary feature;. The results are presented in Fig. 1. The approximation of/^ by IQ for values of p between 0.15 and 0.85 results in a maximum error of 0.05 bit. Instead of making a correction in eqn. (8) for the differences between I^ and IQ [4, 9 ] , the total information content is calculated from

/ G ( 1 , 2 . . . n) = y /„(;•) + \ log, ICORI (11)

J = 1

The effect of the correction for correlated features (the second term in eqn. 11) is best illustrated in Table 3. In this Table the information contents calculated with eqns. (3), (4) and (11) are given for a set of m/z values with low corre-lation derived from reference file A [ 9 ] . From these numbers, it appears that eqn. (11) gives a reasonable estimate of the total information content.

However, as shown in Table 4 for an artificial set of two highly correlated features, calculation of the information content with eqn. (11) will sometimes result in an overcorrection for the correlations between the features. In addition, the cumulative effect of the correlations will make it impossible to calculate the total information content for all the features concerned. To avoid this problem, features that tend to show the effect of overcorrection must be deleted during the calculation. This is achieved by applying a feature selection

I.Ó 0.8 4-» £. 06 4-* g 0.4 § ° IS-0.2 E 5-0.4 **-= - 0 , 6 - a s c • y y - / /f / /

ll

- / /

1 i

1 , , ) a 2 <ss-^^ — -1 • 0.4 -^s,^ - I H - I G 1 I 1 0.6 \ \ , ^ 0.8 \ ^\ \ \ \ \ \ -\

I

1 , 1

Probability p

Fig. 1. The information contents / H (eqn. 2) and / y (eqns. 6, 9) as a function of the prob-ability p for a single binary-coded feature.

(35)

TABLE 3

Information contents for a set of n poorly correlated m/z values derived from reference file A (in bits)

n 1 2 3 4 5 6 7 8 9 10 m/z value 77 69 27 50 45 40 57 75 81 44 S / H O ) " 1.0 2.0 3.0 4.0 4.9 5.9 6.9 7.8 8.7 9.7 / H ( 1 , 2 . 1.0 2.0 3.0 3.8 4.6 5.4 6.2 6.8 7.4 8.1 . n ) b / G ( 1 , 2 . 1.0 2.0 3.0 3.8 4.7 5.5 6.2 7.0 7.7 8.4 • nf

"Sum of individual information contents, eqn. (3). ''Shannon information content, eqn. (4).

'Information content corrected for correlation, eqn. (11). TABLE 4

Influence of a high correlation coefficientp on the information content for a few sets of two artificial features 1 and 2 (in bits)

/ H C I ) " 1.0 1.0 1.0 0.9 0.9 0.9 hl(2f 1.0 1.0 1.0 0.8 0.8 0.8 P , 2 1.0 0.9 0.8 1.0 0.9 0.8 ^ / H O ) " 2.0 2.0 2.0 1.7 1.7 1.7 / G ( 1 , 2 ) « - o o d 0.8" 1.3 - < » c l 0.5« 1.0 ^Shannon information content, eqn. (2).

''Sum of individual information contents, eqn. (3).

•^Information content corrected for the correlation between 1 and 2, eqn. (11). ''information content not defined (zero logarithm).

'Information contentovercorrected for correlations.

algorithm, resulting in an information content / G ( 1 , 2 ... I) for / selected features with / less or equal io n. A detailed description of the algorithm is given in the Appendix. These calculations have boen carried out as a function of the intensity threshold T for a number of binary-coded mass spectral ref-erence files. The information content IQ and the number of features selected are presented in Table 5.

(36)

TABLE 5

Information content / G ( 1 , 2 . . . /) and the number of selected m/z values / for the binary-coded reference files at 6 different threshold levels from eqns. (10) and (11) (in bits) Reference file 0.1% 1% 2% 5% 10% 20% IG I IG I 'a I ^G ' ' G ' IG I msalkane (B) — 11 34 — — — — ms9628 (A) - 42 140 - - — — ms02408(D) 29 95 22 80 19 77 16 77 12 76 — m s l 5 7 9 6 ( C ) 47 157 46 156 41 159 32 159 24 164 17 171 m s l l 346(E) 55 194 55 196 51 198 40 197 29 188 20 197 ms22349 (F) 52 174 47 176 -correlations between the features, another estimate of these -correlations may be obtained by applying the 2-dimensional expressions of eqns. (4) and (11) for the features 1 and 2:

m m

/ „ ( 1 , 2 ) = - y y p{iu 12)log:p('., 12) (12)

' • = ' ' • . = '

/ G ( 1 , 2) = / H ( 1 ) + / H ( 2 ) + -; log, (1 - p?j) (13) / H ( 1 , 2) can easily be calculated with a large set of spectra, since only four p

values have t o be estimated for binary-coded features. Equating / H ( 1 . 2) and / G ( 1 , 2) then leads to a new estimate for the correlation coefficient p ^ ; this m a y then be used in eqn. (11), thus circumventing overcorrection. The same feature selection procedure mentioned above has been applied to obtain the information c o n t e n t IQ for the binary-coded reference files. Table 6 gives a brief example of the results.

This alternative manner of quantifying correlations between the features can also be used to assess the information content for non-binary-coded features, provided that the number of spectra is large enough to estimate all p values used in eqn. (12). For some reference files, the features have been converted to a 2-bit code; the information contents and the numbers of features selected are given in Table 7.

TABLE 6

Information content IQ (eqns. 11 and 13) and the number of selected m/z values, /, for the binary-coded reference files (threshold level 1%) (in bits)

Reference file / G ( 1 . 2 . . . / ) / Reference file / G ( 1 , 2 . . . / ) / msalkane (B) 13 msO2408(D) 23 ms9628 (A) 49 48 93 214 m s l 5 7 9 6 ( C ) m s l l 3 4 6 ( E ) ms22349 (F) 60 257 78 282 68 276

(37)

TABLE 7

Information content IQ and the number of selected m/z values / for 3 reference files,

con-verted into a 2-bit code (eqns, 1 1 , 12, 13) (in bits) Reference file m.s02408 (D) m s l 5 7 9 6 ( C ) m s n 3 4 6 (E) / G ( 1 , 2 . 40 110 137 ./) / 103 288 284 Influence of errors

For actual retrieval of spectra, the information c o n t e n t must be corrected for the uncertainty remaining after coding the spectra. The aicertainty in the coded spectra is caused by deviations in experimental, recording and coding conditions. Hence, for binary coded features, eqn. (2) must be replaced by: / H = - p logj p - (1 - p ) log2(l -p)+ p[p,,o logj Pi/o + (1 - P i / o ) log2(l - P i / o ) ]

-t ( 1 - p ) [Po/i log2Po/i + ( l - P o / . ) l o g 2 ( l - P ü / i ) J (14) with p,/„ and Po/i being the probabilities of coding a " 0 " or a " 1 " , when the

spectrum of the reference compound contains a " 1 " or a " 0 " , respectively, for that feature. From the probability of finding a mismatch between two spectra of the same compound (the "mismatch probability" pj) the "error probabilities" Pi/o and Po/i can be calculated from [11]

Pi/o = P d / 2 p a n d p o / , = P d / 2 ( l - p )

The mismatch probabilities are estimated from a set of 1650 pairs of spectra of the same compound, extracted from the reference file F. The large influence of the mismatch and error probabilities on the information content actually obtained in a retrieval process is shown for some m/z values in Table 2. The mismatch probability ranges u p to 0.26, with the maximum values in the low

m/z area.

With the assumption that the error probabilities for the different features are independent, the correction of the information c o n t e n t can also be done by replacing / H ( ; ) by / H ( ; ) in eqn. (11), yielding IQ. In Table 8 the information contents IQ and IQ, calculated for reference file F , binary-coded with two threshold levels, are given as an example.

Inspection of the results indicates that the information content drops dramatically from 52 to 11 bits and from 47 to 12 bits, respectively. This is much more than previously predicted [ 9 ] . Repeating the calculation for reference file A gave the same effect. This result explains the generally poor performance of a retrieval system based on binary-coded mass spectra [ 1 1 ] .

(38)

TABLE 8

Information contents IQ (eqns. 10, 11) and IQ (bits) (eqns. 10, 11, 14) as a function of the number of selected m/z values / for reference file F, binary-coded at two threshold levels / Threshold level 1% Threshold level 2%

10 20 30 4 0 50 60 70 8 0 90 Max.» / G ( 1 . 2 . 8.7 14.9 20.3 25.0 29.1 32.7 36.1 39.1 41.8 51.7 ./) / G ( 1 . 2 . 4 . 3 6.8 8.5 9.7 10.5 11.0 11.2 11.3 11.3 .') / G ( 1 > 2 . 8.4 14.5 19.5 23.7 27.4 30.8 33.8 36.5 38.9 47.4 ./) /Ci( 4. 7. 8. 1 0 . 1 1 . 1 2 . 1 2 . 12. 12. 12.

*The maximum number of selected m/z values are 174, 77, 176 and 89, respectively.

retrieval procedures involves estimates of variances and correlation coefficients for a limited number of spectra in order to predict implicitly the probabilities for all possible codes. For features with extremely high correlations, the first model would lead to pessimistic estimates of the resulting information con-tent. Use of a feature selection procedure is inevitable.

The second method for the calculation of the information content includes a different measure of the correlations (eqn. 13) which results in a contribution of almost all coded features to the information content. For this reason, the second method is expected to yield more reliable estimates of the information content. This method is also feasible for spectra coded by non-binary tech-niques, although it should be refined in order to deal with the influence of experimental and coding errors.

In the model discussed, the reference files are considered as selected samples of complete populations of mass spectra. Consequently, in the design or evaluation of retrieval methods the reference file to be used for the calculation of the information content should be as representative of a certain population of compounds as possible. For example, the results for alkane spectra (ref-erence file D ) , even without a correction of the information content for errors in the spectra, show clearly that binary coding does not yield enough infor-mation and is not recommended for retrieval purposes.

For retrieval from references files containing spectra of a wider variety of compounds (such as C, E and F), binary coding is more promising. With a threshold of 1% of the intensity of the base peak, an information content of about 50 bits for 200 selected m/z values is typical. This value decreases rapidly above threshold values of 2%, mainly because of the diminishing num-ber of peaks per spectrum.

(39)

However, most of the information is lost because of the errors that occur in the binary-coded spectra. The performance of a retrieval system based on the binary-coded reference file A confirms this conclusion [ 1 1 ] . The predic-tion that a threshold of 2% would yield a better retrieval performance than 1% is confirmed by the preliminary results obtained with reference file F.

The authors gratefully acknowledge valuable discussions with Louis van Norel, lY'ter Cley and Jan Van den Hende.

APPENDIX: FEATURE SELECTION ALGORITHM

To calculate the information content IQ for a near-optimal subset of / out of n features, eqn. (11) is converted to

lQ{l,2...n) = \\og2\S\ ( a l )

with the elements of determinant S defined as Sjj = 2'H'-'^ . 2'H^'^ . p,;. This

equation introduces a scaling which avoids computational problems such as arithmetical under- and over-flow. The diagonal elements of S range in value from 1 to 4 and represent the exponential form of the information content for every feature (/G(J) = { logj s,-,). The non-diagonal elements vary from —4 to + 4 , since the correlation coefficient p ranges from —1 to + 1 , and rep-resent the "covariances" between the features. The information content is calculated after the covariance elements of S have been zeroed by using the Gaussian elimination m e t h o d [12] with

/ G ( 1 , 2 . . . n) = i log, U s;,. = 2 -2 logj s;,- . (a2) i = 1 1 = 1

where s'a is the "variance" element after correction for the correlations between thefeatures.

Because of the effect of overcorrection for correlations between the features, only / features contribute to the calculated information content IQ. These features are selected together with their contribution t o IQ by the following procedure:

1. select the feature i with the highest value of s,-,-;

2. correct the variances of the remaining features k for the correlation between 1 and k with

Sfefe =Sfefe — s?fc/s,< (a3) 3. find the next feature n with the highest corrected value of s^„;

4. correct the covariances of the remaining features k for the correlation be-tween feature k and all features; previously selected, excluding feature n, with

(40)

The iterative procedure is terminated when all features have been selected or when t h e highest corrected value of s'„„ in step 3 becomes less than 1; all remaining features k will then have s'^k less than 1 and \ log: «/^/^ becomes negative.

The procedure described has been chosen to avoid the correction of all co-variances in S; the elements need to be corrected only for the correlations with the features already selected. In programming this procedure the indices /, ;', k and n are replaced by elements of a pointer array, in which the sequence of the features is stored, and which enables an indirect reference to the rows and columns of S without time-consuming rearrangements. The square matrix S is symmetric and only the upper triangle of the matrix need be stored, giving a considerable reduction in storage requirements. However, if possible the linear storage of S should be avoided as otherwise all indices in the eqns. (a3) and (a4) have t o be computed with a special function. The calculation of the information c o n t e n t / G for the mass spectral reference files in this study, with 300 binary-coded m/z values,requires the storage of 300 X 300 elements of the matrix S. The execution time depends lineaily on the number of coded featuresmultipliedby the number of features selected. On the IB.M 3 7 0 / 1 5 8

computer system this amounts to approximately 40 s for 100 features selected and 300 m/z values coded.

REFERENCES

1 D. L. Massart, J. Chromatogr., 79 (1973) 157.

2 P. F. Dupuis and A. Dijkstra, Anal. Chem., 47 (] 975) 379.

3 A. Eskes, P. F. Dupuis, A. Dijkstra, H. Declerc and D. L. Massart, Anal. Chem., 47 (1975) 2168.

4 P. F. Dupuis and A. Dijkstra, Fresenius' Z. Anal. Chem., 290 (1978) 357.

5 P.F.Dupuis, A. Dijkslraand J.H.van der Maas, Fresenius'Z. Anal. Chem., 291 (1978) 27. 6 S. L. Crotch, Anal. Chem., 42 (1970) 1214.

7 F. Erni, thesis no. 4296, Eidgen. Techn. Hochschule, Zurich (1972). 8 F. Erni and J. T. Clerc, Helv. Chim. Acta, 55 (1972) 489.

9 G. van Marlen and A. Dijkstra, Anal. Chem., 48 (1976) 595.

10 C. E. Shannon and W. Weaver, The Mathematical Theory of Communication, University

of Illinois Press, Urbana, 111., 1949.

11 G. van Marlen, A. Dijkstra and H. A. van't Klooster, Anal. Chem., 51 (1979) 420. 12 R. W. Hamming, Introduction to Applied Numerical Analysis, McGraw-Hill, New York,

1971, Ch. 5.

(41)

4 SEARCH STRATEGY AND DATA COMPRESSION FOR A BETRIEVAL SYSTEM WITH BINARY-CODED MASS SPECTRA

SUMMARY

A retrieval system for binary-coded mass spectra is described. The data base used contains 9628 low-resolution mass spectra from the Aldermaston Mass Spectra Data Collection. These spectra are reduced to 106 preselected binary-coded m/z values each. Storage of the compound names and formulae is minimized by using a special set of characters and file organization. The search strategy permits fast generation of the N-nearest neighbours. Depending on the number of best matches generated, an average search requires access to only 24—33% of the spectra contained in the data base. Because of its limited storage requirements, this search system can be used even on microcomputers. The minicomputer plays an increasingly important role in the functioning of modern laboratories. As a result the available mass spectral data bases have grown to such an extent that their sheer size is becoming a handicap to their in-house applicability for routine mass-spectral retrieval systems. The storage of compound names and empirical formulae for a data base of 10 000 spectra would require at least 1 million bytes and the spectral information a commen-surate amount or even more. The use of data compression techniques has there-fore become inevitable.

This paper describes the organization of a mass-spectral retrieval system based on the optimized use of storage combined with a feature selection tech-nique. In addition, the problem of reducing search time, also affected by the size of the data base, is addressed.

EXPERIMENTAL

Data base

A library of 9628 low-resolution mass spectra, originating from the Mass

Spectra Data Collection [ 1 ], was used as a reference file for the retrieval system. These spectra were reduced by binary coding of the intensity values with an intensity threshold of 1% of the base peak. Further reduction was obtained by selecting 106 binary-coded peak positions. The selection, based on the information content of a peak position corrected for the correlation between

(42)

peak positions, has been described previously [ 2 ] . With this method only those peak positions significant for the entire reference file were coded, requiring a storage capacity of 106 bits per spectrum.

i^i7e organization

The data base consists of three files with random access organization. The first file contains, beside the binary-coded spectrum, a unique identification number ID and a presearch parameter, the distance d^. This distance para-meter is defined as the number of peak positions coded "present" in the spec-trum or the number of "bit mismatches" between the specspec-trum and the "empty" spectrum with no peak positions coded present. The file is pre-arranged in

order of increasing values of the distances d^. All reference spectra with the same dR are combined into a "cluster" of contiguous records in the file, each record containing up to 24 spectra. Figure 1 shows a frequency plot of the number of spectra for all clusters in the file versus the da value.

The second file is used to store pointers for each spectrum to the empirical formula and name of the compound, stored in the third file. This method of indirect reference was chosen to eliminate any duplication of empirical form-ulae and name of compounds in the data base, thereby reducing the total storage requirements. As a result, only 3300 different empirical formulae and 8100 different compound names are stored. The relationship between these files is illustrated in Fig. 2. The storage requirements are summarized in Table 1. Keywords

To avoid lengthy and variable storage of compound names a special 8-bit "character" set was generated. This character set contains, in addition to the normal numerical and alphanumerical ASCII characters (a total of 64), a set of 160 "keywords" representing character combinations which occur frequently such as ACETYL, METHYL PHENYL, etc. The keywords generated and their frequencies of occurrence in the data base are given in Table 2. With this com-pression, most of the compound names occupy less than 12 bytes of storage. The names requiring more than 12 characters are split up in chemically

signi-o E 3 z IIIFIIII|| •'i.|-ii.- •, 0 10 20 30 40 50 60 70 80 90 100 110

Fig. 1. Distribution of reference spectra clusters.

(43)

MS-data

pointers

formulae and names

Fig. 2. File organization. TABLE 1

Storage requirements for 9628 binary-coded mass spectra

File Contents No. of entries Record length (byte) Storage (Kbyte) 1 2 3 Binary data Pointers Compound names and formulae 462 9628 8106 3279 480 4 16 223 39 216

ficant segments, separately stored in the file, and shared by all compound names. These segments are referred to within the 12-byte name-space by means of a 1-byte indicator followed by a 2-byte record number. The segments can also contain one or more references to other segments. The process of recon-structing a compound name is therefore recursive.

Configuration

The retrieval system is based on a PDPll/45 minicomputer, which is used for various laboratory applications, under RSX-llD in a multi-user environ-ment. The data base is stored on a RK05 disk with a capacity of 2.4 Mbytes and an average transfer time to memory of 2 ms per spectrum. According to Table 1, the storage of the data base requires about 20% of the capacity of a disk cartridge. The acquisition of the g.c.—m.s. or rn.s. data is carried out on a PDPll/45 preprocessor.

Search program

The retrieval program is written in PDPll FORTRAN IV-PLUS and requires a storage capacity of 5.5K words exclusive of system functions. The general