(RSC Analytical Spectroscopy Momographs) M. J. Adams-Chemometrics in Analytical Spectroscopy - Royal Society of Chemistry (1995)
(RSC Analytical Spectroscopy Momographs) M. J. Adams-Chemometrics in Analytical Spectroscopy - Royal Society of Chemistry (1995)
(RSC Analytical Spectroscopy Momographs) M. J. Adams-Chemometrics in Analytical Spectroscopy - Royal Society of Chemistry (1995)
The series will also cover related topics such as sample preparation;
detection systems; chemical sensing; environmental monitoring; and
process analytical chemistry.
ISBN 0-85404-555-4
RSC Analytical Spectroscopy Monographs
Series Editor: Neil Barnett, Deakin University, Victoria, Australia.
The series aims to provide a tutorial approach to the use of spectrometric and
spectroscopic measurement techniques in analytical science, providing guidance
and advice to individuals on a day-to-day basis during the course of their work
with the emphasis on important practical aspects of the subject.
A standing order plan is available for this series. A standing order will bring
delivery of each new volume immediately upon publication. For further
information, please write to:
Mike J. Adams
School of Applied Sciences, University of Wolverhampton,
Wolverhampton, UK
A catalogue record for this book is available from the British Library.
ISBN 0-85404-555-4
The term chemometrics was proposed more than 20 years ago to describe the
techniques and operations associated with the mathematical manipulation and
interpretation of chemical data. It is within the past 10 years, however, that
chemometrics has come to the fore, and become generally recognized as a
subject to be studied and researched by all chemists employing numerical data.
This is particularly true in analytical science. In a modern instrumentation
laboratory, the analytical chemist may be faced with a seemingly overwhelming
amount of numerical and graphical data. The identification, classification and
interpretation of these data can be a limiting factor in the efficient and effective
operation of the laboratory. Increasingly, sophisticated analytical instru-
mentation is also being employed out of the laboratory, for direct on-line or
in-line process monitoring. This trend places severe demands on data manipu-
lation, and can benefit from computerized decision making.
Chemometrics is complementary to laboratory automation. Just as auto-
mation is largely concerned with the tools with which to handle the mechanics
and chemistry of laboratory manipulations and processes, so chemometrics
seeks to apply mathematical and statistical operations to aid data handling.
This book aims to provide students and practising spectroscopists with an
introduction and guide to the application of selected chemometric techniques
used in processing and interpreting analytical data. Chapter 1 covers the basic
elements of univariate and multivariate data analysis, with particular emphasis
on the normal distribution. The acquisition of digital data and signal enhance-
ment by filtering and smoothing are discussed in Chapter 2. These processes are
fundamental to data analysis but are often neglected in chemometrics research
texts. Having acquired data, it is often necessary to process them prior to
analysis. Feature selection and extraction are reviewed in Chapter 3; the main
emphasis is on deriving information from data by forming linear combinations
of measured variables, particularly principal components. Pattern recognition
comprises a wide variety of chemometric and multivariate statistical techniques
and the most common algorithms are described in Chapters 4 and 5. In Chapter
4, exploratory data analysis by clustering is discussed, whilst Chapter 5 is
concerned with classification and discriminant analysis. Multivariate cali-
bration techniques have become increasingly popular and Chapter 6 provides a
vi Preface
summary and examples of the more common algorithms in use. Finally, an
Appendix is included which aims to serve as an introduction or refresher in
matrix algebra.
A conscious decision has been made not to provide computer programs of
the algorithms discussed. In recent years, the range and quality of software
available commercially for desktop, personal computers has improved dramati-
cally. Statistical software packages with excellent graphic display facilities are
available from many sources. In addition, modem mathematical software tools
allow the user to develop and experiment with algorithms without the problems
associated with developing machine specific inputloutput routines or high
resolution graphic interfaces.
The text is not intended to be an exhaustive review of chemometrics in
spectroscopic analysis. It aims to provide the reader with sufficient detail of
fundamental techniques to encourage further study and exploration, and aid in
dispelling the 'black-box' attitude to much of the software currently employed
in instrumental analytical analysis.
Contents
Subject Index
CHAPTER 1
Descriptive Statistics
1 Introduction
The mathematical manipulation of experimental data is a basic operation
associated with all modern instrumental analytical techniques. Computeri-
zation is ubiquitous and the range of computer software available to spectro-
scopists can appear overwhelming. Whether the final result is the determination
of the composition of a sample or the qualitative identification of some species
present, it is necessary for analysts to appreciate how their data are obtained
and how they can be subsequently modified and transformed to generate the
required information. A good starting point in this understanding is the study
of the elements of statistics pertaining to measurement and errors.'-3 Whilst
there is no shortage of excellent books on statistics and their applications in
spectroscopic analysis, no apology is necessary here for the basics to be
reviewed.
Even in those cases where an analysis is qualitative, quantitative measures
are employed in the processes associated with signal acquisition, data extrac-
tion, and data processing. The comparison of, say, a sample's infrared spectrum
with a set of standard spectra contained in a pre-recorded database involves
some quantitative measure of similarity in order to find and identify the best
match. Differences in spectrometer performance, sample preparation methods,
and the variability in sample composition due to impurities will all serve to
make an exact match extremely unlikely. In quantitative analysis the variability
in results may be even more evident. Within-laboratory tests amongst staff and
inter-laboratory round-robin exercises often demonstrate the far from perfect
nature of practical quantitative analysis. These experiments serve to confirm
the need for analysts to appreciate the source of observed differences and to
understand how such errors can be treated to obtain meaningful conclusions
from the analysis.
Quantitative analytical measurements are always subject to some degree of
1 C. Chatfield, 'Statistics for Technology', Chapman and Hall, London, UK, 1976.
P.R. Bevington, 'Data Reduction and Error Analysis for the Physical Sciences', McGraw-Hill,
New York, USA, 1969.
J.C. Miller and J.N. Miller, 'Statistics for Analytical Chemistry', Ellis Horwood, Chichester, UK,
1993.
2 Chapter I
error. No matter how much care is taken, or how stringent the precautions
followed to minimize the effects of gross errors from sample contamination or
systematic errors from poor instrument calibration, random errors will always
exist. In practice this means that although a quantitative measure of any
variable, be it mass, concentration, absorbance value, etc., may be assumed to
approximate the unknown true value, it is unlikely to be exactly equal to it.
Repeated measurement of the same variable on similar samples will not only
provide discrepancies between the observed results and the true value, but there
will be differences between the measurements themselves. This variability can
be ascribed to the presence of random errors associated with the measurement
process, e.g. instrument generated noise, as well as the natural, random vari-
ation in any sample's characteristics and composition. As more samples are
analysed or more measurements are repeated then a pattern to the inherent
scatter of the data will emerge. Some values will be observed to be too high and
some too low compared with the correct result, if this is known. In the absence
of any bias or systematic error the results will be distributed evenly about the
true value. If the analytical process and repeating measurement exercise could
be undertaken indefinitely, then the true underlying distribution of the data
about the correct or expected value would be obtained. In practice, of course,
this complete exercise is not possible. It is necessary to hypothesize about the
scatter of observed results and assume the presence of some underlying pre-
dictable and well characterized parent distribution. The most common assurnp-
tion is that the data are distributed normally.
2 Normal Distribution
The majority of statistical tests, and those most widely employed in analytical
science, assume that observed data follow a normal distribution. The normal,
sometimes referred to as Gaussian, distribution function is the most important
distribution for continuous data because of its wide range of practical applica-
tion. Most measurements of physical characteristics, with their associated
random errors and natural variations, can be approximated by the normal
distribution. The well known shape of this function is illustrated in Figure 1. As
shown, it is referred to as the normal probability curve.' The mathematical
model describing the normal distribution function with a single measured
variable, x, is given by Equation (1).
The height of the curve at some value of x is denoted byf ( x ) while p and a are
characteristic parameters of the function. The curve is symmetric about p, the
mean or average value, and the spread about this value is given by the variance,
2, or standard deviation, a. It is common for the curve to be standardized so
that the area enclosed is equal to unity, in which case f ( x ) provides the
probability of observing a value within a specified range of x values. With
Descriptive Statistics
Figure 1 Standardized normal probability curve and characteristic parameters, the mean
and standard deviation
Source A B C D
Cr Ni Cr Ni Cr Ni Cr Ni m g kg-'
Mean 9 7.50
s 3.16 1.94
r 0.82
If sets or groups of data of equal size are taken from the parent population
then the mean of each group will vary from group to group and these mean
values form the sampling distribution of 3. As an example, if the analytical
results provided in Table 1 are divided into five groups, each of eight results,
then the group mean values are 11.05, 11.41, 10.85, 10.85, and 11.04 mg kg-'.
The mean of these values is still 11-04,but the standard deviation of the group
means is 0.23 mg kg-' compared with 0.78 mg kg-' for the original 40 obser-
vations. The group means are less widely scattered about the mean than the
original data (Figure 2). The standard deviation of group mean values is
referred to as the standard error of the sample mean, a, and is calculated from
where a, is the standard deviation of the parent population and n is the number
of observations in each group. It is evident from Equation (6) that the more
observations taken, then the smaller the standard error of the mean and the
more accurate the value of the mean. This distribution of sampled mean values
provides the basis for an important concept in statistics. If random samples of
group size n are taken from a normal distribution then the distribution of the
sample means will also be normal. Furthermore, and this is not intuitively
obvious, even if the parent distribution is not normal, providing large sample
sizes (n > 30) are taken then the sampling distribution of the group means will
- Chapter 1
s = 0.23
Group Means
s s = 0.78
13
Original Data
I I
Figure 2 Group meansfor the data from Table I have a lower standard deviation than the
original data
Significance Tests
Having introduced the normal distribution and discussed its basic properties,
we can move on to the common statistical tests for comparing sets of data.
These methods and the calculations performed are referred to as signijicance
tests. An important feature and use of the normal distribution function is that it
enables areas under the curve, within any specified range, to be accurately
calculated. The function in Equation (1) is integrated numerically and the
results presented in statistical tables as areas under the normal curve. From
these tables, approximately 68% of observations can be expected to lie in the
region bounded by one standard deviation from the mean, 95% within CL + 2u,
and more than 99% within p + 3u.
We can return to the data presented in Table 1 for the analysis of the mineral
water. If the parent population parameters, u and po, are known to be
0.82 mg kg-' and 10.8 mg kg-' respectively, then can we answer the question
of whether the analytical results given in Table 1 are likely to have come from a
water sample with a mean sodium level similar to that providing the parent
data. In statistic's terminology, we wish to test the null hypothesis that the
means of the sample and the suggested parent population are similar. This is
generally written as
Descriptive Statistics
Ho: 2 = b
i.e. there is no difference between x and other than that due to random
variation. The lower the probability that the difference occurs by chance, the
less likely it is that the null hypothesis is true. In order for us to make the
decision whether to accept or reject the null hypothesis, we must declare a value
for the chance of making the wrong decision. If we assume there is less than a 1
in 20 chance of the difference being due to random factors, the difference is
signiJicant at the 5% level (usually written as a = 5%). We are willing to accept
a 5% risk of rejecting the conclusion that the observations are from the same
source as the parent data if they are in fact similar.
The test statistic for such an analysis is denoted by z and is given by
',
2 is 11.04mg kg- as determined above, and substituting into Equation (8)
values for and a then
The extreme regions of the normal curve containing 5% of the area are
illustrated in Figure 3 and the values can be obtained from statistical tables.
The selected portion of the curve, dictated by our limit of significance, is
referred to as the critical region. If the value of the test statistic falls within this
area then the hypothesis is rejected and there is no evidence to suggest that the
samples come from the parent source. From statistic tables, 2.5% of the area is
below - 1.96~ and 97.5% is above 1.96~.The calculated value for z of 1.85
does not exceed the tabulated z-value of 1.96 and the conclusion is that the
mean sodium concentrations of the analysed samples and the known parent
sample are not significantly different.
In the above example it was assumed that the mean value and standard
deviation of the sodium concentration in the parent sample were known. In
practice this is rarely possible as all the mineral water from the source would
not have been analysed and the best that can be achieved is to obtain recorded
estimates of p and a from repetitive sampling. Both the recorded mean value
and the standard deviation will undoubtedly vary and there will be a degree of
uncertainty in the precise shape of the parent normal distribution curve. This
uncertainty, arising from the use of sampled data, can be compensated for by
using a probability distribution with a wider spread than the normal curve. The
most common such distribution used in practice is Student's t-distribution. The
t-distribution curve is of a similar form to the normal function. As the number
of samples selected and analysed increases the two functions become increas-
ingly more similar.' Using the t-distribution the well known t-test can be
performed to establish the likelihood that a given sample is a member of a
8 Chapter 1
-4 -3 -2 -1 0 1 2 3 4
f0 3 X
Figure 3 Areas under the normal curve and z values for some critical regions
where 2 and s are our calculated estimates of the sample mean and standard
deviation, respectively. From standard tables, for 39 degrees of freedom, n - 1,
and with a 5% level of significance the value of t is given as 1.68. From
Equation (lo), t = 4.38 which exceeds the tabulated value of t and thus lies in
the critical region of the t-curve. Our conclusion is that the samples are unlikely
to arise from a source with a mean sodium level of 10.5 mg kg-' or less, leaving
the alternative hypothesis that the sodium concentration of the parent source is
greater than this.
The t-test can also be employed in comparing statistics from two different
samples or analytical methods rather than comparing, as above, one sample
against a parent population. The calculation is only a little more elaborate,
involving the standard deviation of two data sets to be used. Suppose the results
from the analysis of a second day's batch of 40 samples of water give a mean
'
value of 10.9 mg kg- ' and standard deviation of 0.83 mg kg- . Are the mean
Descriptive Statistics 9
sodium levels from this set and the data in Table 1 similar, and could the
samples come from the same parent population?
For this example the t-test takes the form
where s, and szare the standard deviations for the two sets of data.
Substituting the experimental values in Equations (1 1) and (12) provides a
t-value of 0.78. Accepting once again a 5% level of significance, the tabulated
value of t for 38 degrees of freedom and a = 0.025 is 2.02. (Since the mean of
one set of data could be significantly higher or lower than the other, an a value
of 2.5% is chosen to give a combined 5% critical region, a so-called two-tailed
application.) As the calculated t-value is less than the tabulated value then there
is no evidence to suggest that the samples came from populations having
different means. Hence, we accept that the samples are similar.
The t-test is widely used in analytical laboratories for comparing samples and
niethods of analysis. Its application, however, relies on three basic assump-
tions. Firstly, it is assumed that the samples analysed are selected at random.
This condition is met in most cases by careful design of the sampling procedure.
The second assumption is that the parent populations from which the samples
are taken are normally distributed. Fortunately, departure from normality
rarely causes serious problems providing sufficient samples are analysed.
Finally, the third assumption is that the population variances are equal. If this
last criterion is not valid then errors may arise in applying the t-test and this
assumption should be checked before other tests are applied. The equality of
variances can be examined by application of the F-test.
The F-test is based on the F-probability distribution curve and is used to test
the equality of variances obtained by statistical sampling. The distribution
describes the probabilities of obtaining specified ratios of sample variance from
the same parent population. Starting with a normal distribution with variance
2,if two random samples of sizes nl and n2 are taken from this population and
the sample variances, slZand ,:s calculated then the quotient sl2/sZ2will be
close to unity if the sample sizes are large. By taking repeated pairs of samples
and plotting the ratio, F = s12/s:, the F-distribution curve is ~ b t a i n e d . ~
In comparing sample variances, the ratio slZ/sZ2 for the two sets of data is
computed and the probability assessed, from F-tables, of obtaining by chance
that specific value of F from two samples arising from a single normal popu-
lation. If it is unlikely that this ratio could be obtained by chance, then this is
taken as indicating that the samples arise from different parent populations
with different variances.
10 Chapter 1
A simple application of the F-test can be illustrated by examining the mineral
water data in the previous examples for equality of variance.
The F-ratio is given by,
Analysis of Variance
The tests and examples discussed above have concentrated on the statistics
associated with a single variable and comparing two samples. When more
samples are involved a new set of techniques is used, the principal methods
being concerned with the analysis of variance. Analysis of variance plays a
major role in statistical data analysis and many texts are devoted to the
subject.= Here, we will only discuss the topic briefly and illustrate its use in a
simple example.
Consider an agricultural trial site sampled to provide six soil samples which
are subsequently analysed colorimetrically for phosphate concentration. The
task is to decide whether the phosphate content is the same in each sample.
H.L. Youmans, 'Statistics for Chemists', J. Wiley, New York, USA, 1973.
G.E.P. BOX,W.G. Hunter, and J.S. Hunter, 'Statistics for Experimenters', J. Wiley, New York,
USA, 1978.
D.L. Massart, A. Dijkstra, and L. Kaufman, 'Evaluation and Optimisation of Laboratory
Methods and Analytical Procedures', Elsevier, London, UK, 1978.
L. Davies, 'Efficiency in Research, Development and Production: The Statistical Design and
Analysis of Chemical Experiments', The Royal Society of Chemistry, Cambridge, UK, 1993.
8 M.J. Adams, in 'Practical Guide to Chemometrics', ed. S.J. Haswell, Marcel Dekker, New York,
USA, 1992, p. 181.
Descriptive Statistics 11
Table 3 Concentration of phosphate (mg kg-'), determined colorimetrically, in
five sub-samples of soils from six field sites
Table 4 Commonly used table layout for the analysis of variance (ANOVA) and
calculation of the F-value statistic
A common problem with this type of data analysis is the need to separate the
within-sample variance, i.e. the variation due to sample inhomogeneity and
analytical errors, from the variance which exists due to differences between the
phosphate content in the samples. The experimental procedure is likely to
proceed by dividing each sample into sub-samples and determining the phos-
phate concentration of each sub-sample. This process of analytical replication
serves to provide a means of assessing the within-sample variations due to
experimental error. If this is observed to be large compared with the variance
between the samples it will obviously be difficult to detect the differences
between the six samples. To reduce the chance of introducing a systematic error
or bias in the analysis, the sub-samples are randomized. In practice, this means
that the sub-samples from all six samples are analysed in a random order and
the experimental errors are confounded over all replicates. The analytical data
using this experimental scheme is shown in Table 3. The similarity of the six soil
samples is then assessed by the statistical techniques referred to as one-way
analysis of variance. Such a statistical analysis of the data is most easily
performed using an ANOVA (ANalysis Of VAriance) table as illustrated in
Table 4.
The total variation in the data can be partitioned between the variation
amongst the sub-samples and the variation within the sub-samples. The compu-
tation proceeds by determining the sum of squares for each source of variation
and then the variances.
12 Chapter 1
The total variance for all replicates of all samples analysed is given, from
Equation (3), by
where xu is the ith replicate of the jth sample. The total number of analyses is
denoted by N, which is equal to the number of replicates per sample, n,
multiplied by the number of samples, m. The numerator in Equation (15) is the
sum of squares for the total variation, SST, and can be rearranged to simplify
calculations,
For the soil phosphate data, the completed ANOVA table is shown in
Table 5.
Once the F-test value has been calculated it can be compared with standard
tabulated values, using some pre-specified level of significance to check whether
it lies in the critical region. If it does not, then there is no evidence to suggest
that the samples arise from different sources and the hypothesis that all the
values are similar can be accepted. From statistical tables, Fo.ol,s,24 = 3.90, and
since the experimental value of 1.69 does not exceed this then the result is not
significant at the 1% level and we can accept the hypothesis that there is no
difference between the six sets of sub-samples.
The simple one-way analysis of variance discussed above can indicate the
relative magnitude of differences in variance but provides no information as to
Outliers
The suspected presence of rogue values or outliers in a data set always causes
problems for the analyst. Not only must we be able to detect them, but some
systematic and reliable procedure for reducing their effect or eliminating them
may need to be implemented. Methods for detecting outliers depend on the
nature of the data as well as the data analysis being performed. For the present,
two commonly employed methods will be discussed briefly.
The first method is Dixon's Q - t e ~ t The. ~ data points are ranked and the
difference between a suspected outlier and the observation closest to it is
compared to the total range of measurements. This ratio is the Q-value. As with
the t-test, if the computed Q-value is greater than tabulated critical values for
some pre-selected level of significance, then the suspect data value can be
identified as an outlier and may be rejected.
Use of this test can be illustrated with reference to the data in Table 6, which
shows ten replicate measures of the molar absorptivity of nitrobenzene at
252 nm, its wavelength of maximum absorbance. Can the value of
E = 1056 mol-' m2 be classed as an outlier? As defined above,
For a sample size of 10, and with a 5% level of significance, the critical value
of Q, from tables, is 0.464. The calculated Q-value exceeds this critical value,
and therefore this point may be rejected from subsequent analysis. If necessary,
the remaining data can be examined for further suspected outliers.
A second method involves the examination of residual^.'^ A residual is
defined as the difference between an observed value and some expected,
predicted or modelled value. If the suspect datum has a residual greater than,
say, 4 times the residual standard deviation computed from all data, then it may
be rejected. For the data in Table 6, the expected value is the mean of the ten
results and the residuals are the differences between each value and this mean.
The standard deviation of these residuals is 14.00 and the residual for the
suspected outlier, 49, is certainly less than 4 times this value and, hence, this
9 S.J. Haswell, in 'Practical Guide to Chemometrics', ed. S.J. Haswell, Marcel Dekker, New York,
USA, 1992, p. 5.
10 R.G.Brereton, 'Chemometrics', Ellis Horwood, Chichester, UK, 1990.
14 Chapter 1
Table 6 Molar absorptivity valuesfor nitrobenzene measured at 252 nm
3 Lorentzian Distribution
Our discussions so far have been limited to assuming a normal, Gaussian
distribution to describe the spread of observed data. Before proceeding to
extend this analysis to multivariate measurements, it is worthwhile pointing out
that other continuous distributions are important in spectroscopy. One distri-
bution which is similar, but unrelated, to the Gaussian function is the Lorentz-
ian distribution. Sometimes called the Cauchy function, the Lorentzian distri-
bution is appropriate when describing resonance behaviour, and it is commonly
encountered in emission and absorption spectroscopies. This distribution for a
single variable, x, is defined by
0.3 -
-Q 0.2 -
0.1 -
0.0 -
I
half-width, w112. The standard deviation is not defined for the Lorentzian
distribution because of its slowly decreasing behaviour at large deviations from
the mean. Instead, the spread is denoted by wl12,defined as the full-width at half
maximum height. Figure 4 illustrates the comparison between the normal and
Lorentzian shape^.^ We shall meet the Lorentzian function again in subsequent
chapters.
4 Multivariate Data
To this point, the data analysis procedures discussed have been concerned with
a single measured variable. Although the determination of a single analyte
constitutes an important part of analytical science, there is increasing emphasis
being placed on multi-component analysis and using multiple measures in data
analysis. The problems associated with manipulating and investigating multiple
measurements on one or many samples constitutes that branch of applied
statistics known as multivariate analysis, and this forms a major subject in
chemometrics. 11-l3
Consideration of the results from a simple multi-element analysis will serve
to illustrate terms and parameters associated with the techniques used. This
example will also introduce some features of matrix operators basic to handling
multivariate data.14 In the scientific literature, matrix representation of multi-
11 B.F.J. Manly, 'Multivariate Statistical Analysis: A Primer', Chapman and Hall, London, UK,
1991.
12 A.A. AM and V. Clark, 'Computer Aided Multivariate Analysis', Lifetime Learning, Cali-
fornia. USA. 1984.
13 B. ~ l u r yand H. Riedwyl, 'Multivariate Statistics, A Practical Approach', Chapman and Hall,
London, UK, 1988.
l4 M.J.R. Healy, 'Matrices for Statistics', Oxford University Press, Oxford, UK, 1986.
16 Chapter I
Table 7 Resultsfrom the analysis of mineral water samples by atomic absorption
spectrometry.Expressedas a data matrix, each column represents a variate and
each row a sample or object
1 10.8
2 7.1
3 14.1
4 17.0
5 5.7
6 11.3
Mean = 11.0
Variance = 17.8
variate statistics is common. For those readers unfamiliar with the basic matrix
operations, or those who wish to refresh their memory, the Appendix provides a
summary and overview of elementary and common matrix operations.
The data shown in Table 7 comprise a portion of a multi-element analysis of
mineral water samples. The data from such an analysis can conveniently be
arranged in an n by m array, where n is the number of objects, or samples, and m
is the number of variables measured. This array is referred to as the data matrix
and the purpose of using matrix notation is to allow us to handle arrays of data
as single entities rather than having to specify each element in the array every
time we perform an operation on the data set. Our data matrix can be denoted
by the single symbol X and each element by xu, with the subscripts i and j
indicating the number of the row and column respectively. A matrix with only
one row is termed a row vector, e.g., r , and with only one column, a column
vector, e.g., c.
Each measure of an analysed variable, or variate, may be considered
independent. By summing elements of each column vector the mean and
standard deviation for each variate can be calculated (Table 7). Although these
operations reduce the size of the data set to a smaller set of descriptive statistics,
much relevant information can be lost. When performing any multivariate data
analysis it is important that the variates are not considered in isolation but are
combined to provide as complete a description of the total system as possible.
Interaction between variables can be as important as the individual mean values
and the distributions of the individual variates. Variables which exhibit no
interaction are said to be statistically independent, as a change in the value in
one variable cannot be predicted by a change in another measured variable. In
many cases in analytical science the variates are not statistically independent,
and some measure of their interaction is required in order to interpret the data
and characterize the samples. The degree or extent of this interaction between
variables can be estimated by calculating their covariances, the subject of the
next section.
Descriptive Statistics - 17
with xT denoting the transpose of the column vector x to form a row vector
(see Appendix). The numerator in Equations (21) and (22) is the corrected sum
of squares of the data (corrected by subtracting the mean value and referred to
as mean centring). To calculate covariance, the analogous quantity is the
corrected sum of products, SP, which is defined by
where xu is the ith measure of variablej, i.e. the value of variablej for object i,
xik is the ith measure of variable k, and SPjk is the corrected sum of products
between variables j and k. Note that in the special case where j = k Equation
(23) gives the sum of squares as used in Equation (3).
Sums of squares and products are basic to many statistical techniques and
Equation (23) can be simply expressed, using the matrix form, as
where Xdrepresents the data matrix after subtracting the column, i.e. variate,
means. The calculation of variance is completed by dividing by (n - 1) and
covariance is similarly obtained by dividing each element of the matrix SP by
(n - 1).
The steps involved in the algebraic calculation of the covariance between
sodium and potassium concentrations from Table 7 are shown in Table 8. The
complete variance-covariance matrix for our data is given in Table 9.
For the data the variancecovariance matrix, COY,, is square, the number of
rows and number of columns are the same, and the matrix is symmetric. For a
symmetric matrix, xu = xi,, and some pairs of entries are duplicated. The
covariance between, say, sodium and potassium is identical to that between
potassium and sodium. The variancecovariance matrix is said to have diagonal
18 Chapter I
Table 8 Calculation of covariance between sodium andpotassium concentrations
symmetry with the diagonal elements being the variances of the individual
variates.
In Figure 5(a) a scatter plot of the concentration of sodium vs. the concen-
tration of potassium, from Table 7, is illustrated. It can be clearly seen that the
two variates have a high interdependence, compared with magnesium vs.
potassium concentration, Figure 5(b). Just as the absolute value of variance is
influenced by the units of measurement, so covariance is similarly affected. To
estimate the degree of interrelation between variables, free from the effects of
measurement units, the correlation coefficient can be employed. The linear
correlation coefficient, rjk,between two variablesj and k is defined by,
As the value for covariance can equal but never exceed the product of the
+
standard deviations, values for r range from - 1 to 1. The complete corre-
lation matrix for the elemental data is presented in Table 10.
Descriptive Statistics
Table 10 Correlation matrix for the analytes in Table 7. The matrix is sym-
metric about the diagonal and values lie in the range - 1 to + 1
Figure 6 Scatter plots for bivariate data with various values of correlation coeflcient, r.
Least-squares best-fit lines are also shown. Note that correlation is only a
measure of linear dependence between variates
from A. This is certainly not the case for samples B, and the graph suggests that
a higher order, possibly quadratic, model would be better. For samples from
source C, a potential outlier has reduced an otherwise excellent linear corre-
lation, whereas for source D there is no evidence of any relationship between
chromium and nickel but an outlier has given rise to a high correlation
coefficient. To repeat the earlier warning, always visually examine the data
before proceeding with any manipulation.
Descriptive Statistics
Source A Source B
Source C
Source D
20 -
14 -
18 -
12 -
16 -
> 14 -
'
10-
2
8- F 12-
-
$ E; 10-
a
E
d
6-
2 8-
4- 6-
4-
2-
2-
0 4 I I I I I
0-
0 2 4 6 8 1 0 1 2 1 4 0 2 4 6 8 10 12 14
Nkkel, mg kgcq" Nickel, mg kg"
Figure 7 Scatterplots of the concentrations of chromium vs. nickelfrom four waste water
sources,from Table 2
Multivariate Normal
In much the same way as the more common univariate statistics assume a
normal distribution of the variable under study, so the most widely used
multivariate models are based on the assumption of a multivariate normal
distribution for each population sampled. The multivariate normal distribution
is a generalization of its univariate counterpart and its equation in matrix
notation is
22 Chapter 1
1
f = (2T)m/2 ( co vxll/2 exp [ - &(x- p)T COV,- '(x - p)] (26)
The representation of this equation for anything greater than two variates is
difficult to visualize, but the bivariate form (m = 2) serves to illustrate the
general case. The exponential term in Equation (26) is of the form xTAx and is
known as a quadratic form of a matrix product (Appendix A). Although the
mathematical details associated with the quadratic form are not important for
us here, one important property is that they have a well known geometric
interpretation. All quadratic forms that occur in chemometrics and statistical
data analysis expand to produce a quadratic surface that is a closed el1ipse:Just
as the univariate normal distribution appears bell-shaped, so the bivariate
normal distribution is elliptical.
For two variates, xl and x2, the mean vector and variance-covariance matrix
are defined in the manner as discussed above.
where k1 and F~are the means of xl and x2 respectively, all2and u222are their
variances, and u122= u212is the covariance between xl and x2. Figure 8
illustrates some bivariate normal distributions, and the contour plots show the
lines of equal probability about the bivariate mean, i.e. lines that connect points
having equal probability of occurring. The contour diagrams of Figure 8 may
be compared to the correlation plots presented previously. As the covariance,
ulZ2,increases in a positive manner from zero, so the association between the
variates increases and the spread is stretched, because the variables serve to act
together. If the covariance is negative then the distribution moves in the other
direction.
5 Displaying Data
As our discussions of population distributions and basic statistics have pro-
gressed, the use of graphical methods to display data can be seen to play an
important role in both univariate and multivariate analysis. Suitable data plots
can be used to display and describe both raw data, i.e. original measures, and
transformed or manipulated data. Graphs can aid in data analysis and inter-
pretation, and can serve to summarize final results." The use of diagrams may
help to reveal patterns in the data which may not be obvious from tabulated
results. With most computer-based data analysis packages the graphics support
15 J.M. Thompson, in 'Methods for Environmental Data Analysis', ed. C.N. Hewitt, Elsevier
Applied Science, London, UK, 1992, p. 213.
Descriptive Statistics
Covariance > 0
Figure 8 Bivariate normal distributions as probability contour plots for data having
different covariance relationships
can provide a valuable interface between the user and the experimental data.
The construction and use of graphical techniques to display univariate and
bivariate data are well known. The common calibration graph or analytical
working curve, relating, for example, measured absorbance to sample con-
centration, is ubiquitous in analytical science. No spectroscopist would
welcome the sole use of tabulated spectra without some graphical display of the
spectral pattern. The display of data obtained from more than two variables,
however, is less common and a number of ingenious techniques and methods
have been proposed and utilized to aid in the visualization of such multivariate
data sets. With three variables a three-dimensional model of the data can be
constructed and several graphical computer packages are available to assist in
the design of three-dimensional plots." In practice, the number of variables
examined may well be in excess of two or three and less familiar and less direct
techniques are required to display the data. Such techniques are generally
referred to as mapping methodrs as they attempt to represent a many-
Chapter 1
Tln
Figure 9 Three dimensional plot of zinc, tin, and iron data from Table 11
6 7 8 0
Variables (% by weight)
Samples Tin Zinc Iron Nickel
analysis of nine alloys for four elements. The concentration of three analytes,
zinc, tin, and iron, are displayed. It is immediately apparent from the illustra-
tion that the samples fall into one of two groups, with one sample lying between
the groups. This pattern in the data is more readily seen in the graphical display
than from the tabulated data.
This style of representation is limited to three variables and even then the
diagrams can become confusing, particularly if there are a lot of points to plot.
One method for graphically representing multivariate data ascribes each vari-
able to some characteristic of a cartoon face. These Chernoff faces have been
used extensively in the social sciences and adaptations have appeared in the
analytical chemistry literature. Figure 10 illustrates the use of Chernoff faces to
represent the data from Table 11. The size of the forehead is proportional to tin
concentration,the lower face to zinc level, eyebrows to nickel, and mouth shape
to iron concentration. As with the three-dimensional scatter plot, two groups
can be seen, samples 1, 2, 3, and 9, and samples 4, 5, 6, and 8, with sample 7
displaying characteristics from both groups.
Star-plots present an alternative means of displaying the same data (Figure
1l), with each ray size proportional to individual analyte concentrations.
A serious drawback with multidimensional representation is that visually
some characteristics are perceived as being of greater importance than others
and it is necessary to consider carefully the assignment of the variable to the
graph structure. In scatter plots, the relationships between the horizontal
co-ordinates can be more obvious than those for the higher-dimensional data
on a vertical axis. It is usually the case, therefore, that as well as any strictly
analytical reason for reducing the dimensionality of data, such simplification
can aid in presenting multidimensional data sets. Thus, principal components
and principal co-ordinates analysis are frequently encountered as graphical
aids as well as for their importance in numerically extracting information from
data. It is important to realize, however, that reduction of dimensionality can
lead to loss of information. Two-dimensional representation of multivariate
data can hide structure as well as aid in the identification of patterns.
Chapter 1
l6 A.F. Carley and P.H. Morgan, 'Computational Methods in the Chemical Sciences', Ellis
Honvood, Chichester, UK,1989.
17 W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, 'Numerical Recipes', Cam-
bridge University Press, Cambridge, UK, 1987.
18 J. Zupan, 'Algorithms for Chemists', Wiley, New York, USA, 1989.
'9 J.C. Davis, 'Statistics and Data Analysis in Geology', J. Wiley and Sons, New York, USA, 1973.
CHAPTER 2
1 Introduction
In the modern spectrochemical laboratory, even the most basic of instruments
is likely to be microprocessor controlled, with the signal output digitized. Given
this situation, it is necessary for analysts to appreciate the basic concepts
associated with computerized data acquisition and signal conversion to the
digital domain. After all, digitization of the analytical signal may represent one
of the first stages in the data acquisition and manipulation process. If this is
incorrectly carried out then subsequent processing may not be worthwhile. The
situation is analogous to that of analytical sampling. If a sample is not
representative of the parent material, then no matter how good the chemistry or
the analysis, the results may be meaningless or misleading.
The detectors and sensors commonly used in spectrometers are analogue
devices; the signal output represents some physical parameter, e.g. light inten-
sity, as a continuous function of time. In order to process such data in the
computer, the continuous, or analogue, signal must be digitized to provide a
series of numeric values equivalent to and representative of the original signal.
An important parameter to be selected is how fast, or at what rate, the input
signal should be digitized. One answer to the problem of selecting an appro-
priate sampling rate would be to digitize the signal at as high a rate as possible.
With modern high-speed, analogue-to-digital converters, however, this would
produce so much data that the storage capacity of the computer would soon be
exceeded. Instead, it is preferred that the number of values recorded is limited.
The analogue signal is digitally and discretely sampled, and the rate of sampling
determines the accuracy of the digital representation as a time discrete function.
2 Sampling Theory
Figure 1 illustrates a data path in a typical ratio-recording, dispersive infrared
spectrometer.' The digitization of the analogue signal produced by the detector
1 M.A. Ford, in 'Computer Methods in UV,Visible and IR Spectroscopy', ed. W.O. George and
H.A. Willis, The Royal Society of Chemistry, Cambridge, UK, 1990, p. 1.
Chapter 2
ntlo
A.V. Openheim and A.S. Willsky, 'Signals and Systems', Pfintice-Hall, New Jersey, USA, 1983.
Acquisition and Enhancement of Data
Figure 2 A schematic of the digital sampling process: (a) A signal, x,, is multiplied by a
train of pulses, p,, producing the signal x,,,; (b) The analytical signal, x,; (c) The
carrier signal, p,; (d) The resultant sampled signal is a train of pulses with
amplitudes limited by x,
(Reproduced by permission from ref. 2)
Figure 3 Sampling in the frequency domain; (a) Modulated signal, x,, has frequency
spectrum x/; (b) Harmonics of the carrier signal; (c) Spectrum of modulated
signal is a repetitive pattern of xf,and xf can be completely recovered by low pass
jltering using,for example, a boxjlter with cut-offfrequencyL; ( d ) Too low a
sampling frequency produces aliasing, overlapping of frequency patterns
(Reproduced by permission from ref. 2)
3 Signal-to-Noise Ratio
The spectral information used in an analysis is encoded as an electrical signal
from the spectrometer. In addition to desirable analytical information, such
Acquisition and Enhancement of Data 31
signals contain an undesirable component termed noise which can interfere
with the accurate extraction and interpretation of the required analytical data.
There are numerous sources of noise that arise from instrumentation, but
briefly the noise will comprise flicker noise, interference noise, and white noise.
These classes of noise signals are characterized by their frequency distribution.
Flicker noise is characterized by a frequency power spectrum that is more
pronounced at low frequencies than at high frequencies. This is minimized in
instrumentation by modulating the carrier signal and using a.c. detection and
a.c. signal processing, e.g. lock-in amplifiers. Interference from power supplies
may also add noise to the signal. Such noise is usually confined to specific
frequencies about 50 Hz, or 60 Hz, and their harmonics. By employing modu-
lation frequencies well away from the power line frequency, interference noise
can be reduced, and minimized further by using highly selective, narrow-
bandpass electronic filters. White noise is more difficult to eliminate since it is
random in nature, occurring at all frequencies in the spectrum. It is a funda-
mental characteristic of all electronic instruments. In recording a spectrum,
complete freedom from noise is an ideal that can never be realized in practice.
The noise associated with a recorded signal has a profound effect in an analysis
and one figure of merit used to describe the quality of a measurement is the
signal-to-noise ratio, SIN, which is defined as,
The rms (room mean square) noise is the square root of the average deviation
of the signal, xi, from the mean noise value, i.e.
2 (3 - xil2
rms noise =
This equation should be recognized as equating rms noise with the standard
deviation of the noise signal, a. SIN can, therefore, be defined as Z/o.
In spectrometric analysis SIN is usually measured in one of two ways. The
first technique is repeatedly to sample and measure the analytical signal and
determine the mean and standard deviation using Equation (3). Where a chart
recorder output is available, then a second method may be used. Assuming the
noise is random and normally distributed about the mean, it is likely that 99%
of the random deviations in the recorded signal will lie within ? 2.50 of the
mean value. By measuring the peak-to-peak deviation of the signal and dividing
by 5, an estimate of the rms noise is obtained. The use of this method is
illustrated in Figure 4. Whichever method is used, the signal should be sampled
for sufficient time to allow a reliable estimate of the standard deviation to be
made. When measuring SIN it is usually assumed that the noise is independent
of signal magnitude for small signals close to the baseline or background signal.
Noise, as well as affecting the appearance of a spectrum, influences the
sensitivity of an analytical technique and for quantitative analysis the S/N ratio
Chapter 2
- I 1 I 1
0 20 40 60 80 100 120
time
Figure 4 Amplified trace of an analytical signal recorded with amplitude close to the
background level, showing the mean signal amplitude, S, and the standard
deviation, s. The peak-to-peak noise is 5s
4 Detection Limits
The concept of an analytical detection limit implies that we can make a
qualitative decision regarding the presence or absence of analyte in a sample. In
arriving at such a decision there are two basic types of error that can arise
(Table 1). The Type I error leads to the conclusion that the analyte is present in
a sample when it is known not to be, and the Type I1 error is made if we
conclude that the analyte is absent, when in fact it is present. The definition of a
detection limit should address both types of error.3
Table 1 The Type I and Type I1 errors that can be made in accepting or rejecting
a statistical hypothesis
HYPOTHESIS HYPOTHESIS
IS CORRECT IS INCORRECT
J.C. Miller and J.N. Miller, 'Statistics for Analytical Chemistry', Ellis Horwood, Chichester, UK,
1993.
Acquisition and Enhancement of Data
Figure 5 (a) The normal distribution with the 5% critical region highlighted. Two
normally distributed signals with equal variances overlapping, with the mean of
one located at the 5% point of the other (b)- the decision limit; overlapping at
their 5% points with means separated by 3.30 (c)- the detection limit; and their
means separated by 100 (d) - the determination limit
5 Reducing Noise
If we assume that the analytical conditions have been optimized, say to produce
maximum signal intensity, then any increase in signal-to-noise ratio will be
achieved by reducing the noise level. Various strategies are widely employed to
reduce noise, including signal averaging, smoothing, and filtering. It is common
in modern spectrometers for several methods to be used on the same analytical
data at different stages in the data processing scheme (Figure 1).
Signal Averaging
The process of signal averaging is conducted by repetitively scanning and
co-adding individual spectra. Assuming the noise is randomly distributed, then
Acquisition and Enhancement of Data 35
the analytical signals which are coherent in time are enhanced, since the signal
grows linearly with the number of scans, N,
signal magnitude a N
signal magnitude = k l N
To consider the effect of signal averaging on the noise level we must refer to
the propagation of errors. The variance associated with the sum of independent
errors is equal to the sum of their variances, i.e.
Therefore,
Figure 6 An infrared spectrum and the results of co-adding 4 , 9, and 16 scans from the
same region
Signal Smoothing
A wide variety of mathematical manipulation schemes are available to smooth
spectral data, and in this section we shall concentrate on smoothing techniques
that serve to average a section of the data. They are all simple to implement on
personal computers. This ease of use has led to their widespread application,
but their selection and tuning is somewhat empirical and depends on the
application in-hand.
One simple smoothing procedure is boxcar averaging. Boxcar averaging
proceeds by dividing the spectral data into a series of discrete, equally spaced,
bands and replacing each band by a centroid average value. Figure 7 illustrates
the results using the technique for different widths of the filter window or band.
The greater the number of points averaged, the greater the degree of smoothing,
but there is also a corresponding increase in distortion of the signal and
subsequent loss of spectral resolution. The technique is derived from the use of
electronic boxcar integrator units. It is less widely used in modern spectrometry
than the methods of moving average and polynomial smoothing.
As with boxcar averaging, the moving average method replaces a group of
values by their mean value. The difference in the techniques is that with the
moving average successive bands overlap. Consider the spectrum illustrated in
Figure 8, which is comprised of transmission values, denoted xi. By averaging
the first five values, i = 1 . . . 5, a mean transmission value is produced which
provides the value for the third data point, xt3,in the smoothed spectrum. The
procedure continues by incrementing i and averaging the next five values to find
xt4from original data x2,x3, x4, x5, and x6. The degree of smoothing achieved is
controlled by the number of points averaged, i.e. the width of the smoothing
window. Distortion of the data is usually less apparent with the moving average
method than with boxcar averaging.
The mathematical process of implementing the moving average technique is
Acquisition and Enhancement of Data
W
Figure 7 An infrared spectrum and the results of applying a 5-point boxcar average, a
7-point average, and a 9-point average
Figure 8 Smoothing with a 5-point moving average. Each new point in the smoothed
spectrum isformed by averaging a span of 5 points from the original data
Chapter 2
Figure 9 Convolution of a spectrum with afilter is achieved by pulling the filter function
across the spectrum
termed convolution. The resultant spectrum, x' (as a vector), is said to be the
result of convolution of the original spectrum vector, x, with a filter function, w,
i.e.
For the simple five-point moving average, w = [I, 1,1,1,1]. The mechanism
and application of the convolution process can be visualized graphically as
illustrated in Figure 9.
In 1964 Savitzky and Golay described a technique for smoothing spectral
data using convolution filter vectors derived from the coefficients of least-
squares-fit polynomial function^.^ This paper, with subsequent arithmetic cor-
rection~,~ has become a classic in analytical signal processing and least-squares
polynomial smoothing is probably the technique in widest use in spectral data
processing and manipulation. To appreciate its derivation and application we
should extend our discussion of the moving average filter.
The simple moving average technique can be represented mathematically by
where xi and x f i are elements of the original and smoothed data vectors
respectively, and the values o, are the weighting factors in the smoothing
window. For a simple moving average function, oj= 1 for all j and the width of
the smoothing function is defined by (2n + 1) points.
A. Savitzky and M.J.E.Golay, Anal. Chem., 1964,36, 1627.
J . Steiner, Y. Termonia, and J. Deltour, Anal. Chem., 1972,44, 1906.
Acquisition and Enhancement of Data 39
The process of polynomial smoothing extends the principle of the moving
average by modifying the weight vector, o , such that the elements of o describe
a convex polynomial. The central value in each window, therefore, adds more
to the averaging process than values at the extremes of the window.
Consider five data points forming a part of a spectrum described by the data
set x recorded at equal wavelength intervals. Polynomial smoothing seeks to
replace the value of the point x, by a value calculated from the least-squares
polynomial fitted to xi- l, xi, xi+ l, and xi+, recorded at wavelengths
denoted by A,-2, A, and
For a quadratic curve fitted to the data, the model can be expressed as
where x' is the fitted model data and a,,al, and a2are the coefficients or weights
to be determined.
Using the method of least squares, the aim is to minimize the error, E, given
by the square of the difference between the model function, Equation (13) and
the 'observed data, for all data values fitted, i.e.
and, by simple differential calculus, this error function is a minimum when its
first derivative is zero.
Differentiating Equation (16) with respect to a,, al, and a2 respectively,
provides a set of so-called normal equations,
Because the A, values are equally spaced, Ah = A, - A,- is constant and only
relative A values are required for the model,
Savitzky and Golay published the coefficients for a range of least-squares fit
curves with up to 25-point wide smoothing windows for each.4 Corrections to
the original tables have been published by Steinier et al.'
Table 2 presents the weighting coefficients for performing 5, 9, 13, and
17-point quadratic smoothing and the results of applying these functions to the
infrared spectral data are illustrated in Figure 10.
When choosing to perform a Savitzky-Golay smoothing operation on spec-
tral data, it is necessary to select the filtering function (quadratic, quartic, etc.),
the width of the smoothing function (the number of points in the smoothing
window), and the number of times the filter is to be applied successively to the
I I
3000 cm" 2800 em.'
Table 2 Savitzky-Golay coeficients, or weightings, for 5-, 9-, 13-, and 17-point
quadratic smoothing of continuous spectral data
Points 17 13 9 5
data. Although the final choice is largely empirical, the quadratic function is the
most commonly used, with the window width selected according to the scan-
ning conditions. A review and account of selecting a suitable procedure has
been presented by Enke and Nieman.6
(23)
The two functionsf (t) and F(w) are said to comprise Fourier transform pairs.
As discussed previously with regard to sampling theory, real analytical
signals are barid-limited. The Fourier equations therefore should be modified
for practical use as we cannot sample an infinite number of data points. With
this practical constraint, the discrete forward complex transform is given by
1 I , , , , , , I
5 - 6 4 - 2 0 2 4 6 8
5 - 6 4 - 2 0 2 4 6 8
Figure 11 Some well characterized Fourier pairs. The white spectrum and the impulse
fwrction (a), the boxcar and sinc fwrctions (b),the triangular and sinc2fwrc-
tions (c),and the Gaussianpair ( d )
(Reproduced by permission from ref. 7)
interest since these shapes describe typical spectral profiles. The Fourier trans-
form of a Gaussian signal is another Gaussian form, and for a Lorentzian signal
the transform takes the shape of an exponentially decaying oscillator.
One of the earliest applications of the Fourier transform in spectroscopy was
in filtering and noise reduction. This technique is still extensively employed.
Figure 12 presents the Fourier transform of an infrared spectrum, before and
after applying the 13-point quadratic Savitzky-Golay function. The effect of
smoothing can clearly be seen as reducing the high-frequency fluctuations,
hopefully due to noise, by the polynomial function serving as a low-pass filter.
Convolution provides an important technique for smoothing and processing
spectral data, and can be undertaken in the frequency domain by simple
multiplication. Thus smoothing can be accomplished in the frequency domain,
following Fourier transformation of the data, by multiplying the Fourier
transform by a rectangular or other truncating function. The low-frequency
7 R. Bracewell, 'The Fourier Transform and Its Application', McGraw-Hill, New York, USA,
1965.
Chapter 2
Figure 12 A spectrum (a) and its Fourier transform before (b) and after applying a
13-point quadratic smoothingfilter (c)
I ( i 40 points 1 1
0
I I , , , I I
0 2 0 4 0 w80100120140 0 50 100150200250300
interfarogram wavelength
(no. points) (arb. acale)
Figure 13 A spectrum and its Fourier transform (a). The transform and its inverse
retaining (b)40, (c) 20, and ( d ) 6 of the Fourier coeficients
of EW,. Assuming the spectrum was acquired in a single scan taking 10 s and
it comprises 256 discrete points, then the sampling interval, At, is given by
The IR spectrum was synthesized from two Lorentzian bands, the sharpest
having ol/z= 1.17 s. Therefore EW, = 1.838 s and EWf= 0.554 Hz.
The complex interferogram of 256 points is composed of 128 real values and
128 imaginary values spanning the range 0-12.75 Hz. According to the EW
criterion, a suitable cut-off frequency is 0.554 Hz and the number of significant
points, N, to be retained may be calculated from
46 Chapter 2
Thus, points 7 to 128 are zeroed in both the real and imaginary arrays before
performing the inverse transform, Figure 13(d). Obviously, to use the tech-
nique, it is necessary to estimate the half-width of the narrowest band present.
Where possible this is usually done using some sharp isolated band in the
spectrum.
All the smoothing functions discussed in previous sections can be displayed
and compared in the frequency domain, and in addition new filters can be
designed. Bromba and Ziegler have made an extensive study of such 'designer'
filter^.^.'^ The Savitzky-Golay filter acts as a low-pass filter that is optimal for
polynomial shaped signals. Of course, in spectrometry Gaussian or Lorentzian
band shapes are the usual form and the polynomial is only an approximation to
a section of the spectrum defined by the width of the filter window. There is no
reason why filters other than the polynomial should not be employed for
smoothing spectral data. Use of the Savitzky-Golay procedure is as much
traditional as representing any theoretical optimum.
Bromba and Ziegler have defined a general filter with weighting elements
defined by the form
Assuming all noise is removed then the result is the true spectrum. Conver-
sely, from Equation (33), if the smoothed spectrum is subtracted from the
original, raw data, then a noise spectrum is obtained. The distribution of this
noise as a function of wavelength may provide information regarding the
source of the noise in spectrometers. The procedure is analogous to the analysis
of residuals in regression analysis and modelling.
points
( frequency )
Figure 14 Thefrequency response offilters of Bomba and Ziegler for a values of O.5,1.0,
and 2.0
6 Interpolation
Not all analytical data can be recorded on a continuous basis; discrete measure-
ments often have to be made and they may not be at regular time or space
intervals. To predict intermediate values for a smooth graphic display, or to
perform many mathematical manipulations, e.g. Savitzky-Golay smoothing, it
is necessary to evaluate regularly spaced intermediate values. Such values are
obtained by interpolation.
Obviously, if the true underlying mathematical relationship between the
independent and dependent variables is known then any value can be computed
exactly. Unfortunately, this information is rarely available and any required
interpolated data must be estimated.
The data in Table 3, shown in Figure 15, consist of magnesium concentra-
tions as determined from river water samples collected at various distances
from the stream mouth. Because of the problems of accessibility to sampling
sites, the samples were collected at irregular intervals along the stream channel
and the distances between samples were calculated from aerial photographs. To
produce regularly spaced data, all methods for interpolation assume that no
discontinuity exists in the recorded data. It is also assumed that any inter-
mediate, estimated value is dependent on neighbouring recorded values. The
simplest interpolation technique is linear interpolation. With reference to Figure
15, if yl and y2 are observed values at points xl and x2, then the value of y'
situated at x' between xl and x2 can be calculated from
48 Chapter 2
Table 3 Concentration of maganesium (mg kg-') from a stream sampled at
different locations along its course. Distances are from stream mouth to
sample locations
distance (m)
This, of course, is the model of linear interpolation and for x' = 2500 m,
y' = 8.74 mg kg-'.
To take account of more measured data, higher order polynomials can be
employed. A quadratic model will fit three pairs of points,
with the quadratic term being zero when x' = XI or x' = x2. When x' = x3 then
substitution and rearrangement of Equation (37) allows the coefficient a] to be
calculated,
and
Substituting for al and x' = 2500 m into Equation (37), the estimated value
of y' is 9.05 mg kg-' Mg.
The technique can be extended further. With four pairs of observations, a
cubic equation can be generated to pass through each point,
and substituting into Equation (40), for x' = 2500 m, then y' = 8.93 mg kg-'
Mg.
As the number of observed points to be connected increases, then so too does
the degree of the polynomial required if we are to guarantee passing through
50 Chapter 2
each point. The general technique is referred to as providing divided difference
polynomials. The coefficients a2, a3, a4, etc. may be generated algorithmically
by the 'Newton forward formula', and many examples of the algorithms are
available.
To fit a curve to n data points a polynomial of degree (n - 1) is required, and
with a large data set the number of coefficients to be calculated is correspond-
ingly large. Thus 100 data points could be interpolated using a 99-degree
polynomial. Polynomials of such a high degree, however, are unstable. They
can fluctuate wildly with the high-degree terms forcing an exact fit to the data.
Low-degree polynomials are much easier to work with analytically and they are
widely used for curve fitting, modelling, and producing graphic output. To fit
small polynomials to an extensive set of data it is necessary to abandon the idea
of trying to force a single polynomial through all the points. Instead different
polynomials are used to connect different segments of points, piecing each
section smoothly together. One technique exploiting this principle is spline
interpolation, and its use is analogous to using a mechanical flexicurve to draw
manually a smooth curve through fixed points.
The shape described by a spline between two adjacent points, or knots, is a
cubic, third-degree polynomial. For the six pairs of data points representing
our magnesium study, we would consider the curve connecting the data to
comprise five cubic polynomials. Each of these take the form
To compute the spline, we must calculate values for the 20 coefficients, four
for each polynomial segment. Therefore we require 20 simultaneous equations,
dictated by the following physical constraints imposed on the curve.
Since the curve must touch each point then
The spline must curve smoothly about each point with no sharp bends or
kinks, so the slope of each segment where they connect must be similar. To
achieve this the first derivatives of the spline polynomials must be equal at the
measured points.
We can also demand that the second derivatives of each segment will be
similar at the knots.
1' A.F. Carley and P.H. Morgan, 'Computational Methods in the Chemical Sciences', Ellis
Horwood, Chichester, UK, 1989.
12 P. Gans, 'Data Fitting in the Chemical Sciences', J. Wiley and Sons, Chichester, UK, 1992.
Acquisition and Enhancement of Data 51
Finally, we can specify that at the extreme ends of the curve the second
derivatives are zero:
then if the values ofp, . . .p5were known, all the coefficients, a, b, c, d, could be
computed from the following four equations,
If each spline segment is scaled on the x-axis between the limits [0,1], using
the term t = (x - xi)/(xi+I - xi), then the curve can be expressed as"
To calculate the values of pi, we impose the constraint that the first deriva-
tives of the spline segments are equal at their endpoints. The resulting equations
are
52 Chapter 2
or in matrix form,
where
wi = 6 (('I+
- yi) - ( ~-iYi- 1)
(52)
( ~ + (xi-xi-1)
i
distance (m)
Figure 16 The result of applying a cubic spline interpolation model to the stream magnes-
ium data
Acquisition and Enhancement of Data 53
and this, with values for pl = 0 and p2 = - 6.314, is substituted into Equation
(49),
The resultant cubic spline curve for the complete range of the magnesium
data is illustrated in Figure 16.
Spline curve fitting has many important applications in analytical science,
not only in interpolation but also in differentiation and calibration. The
technique is particularly useful when no analytical model of the data is avail-
able."
Having acquired our chemical data, it is now necessary to analyse the results
and extract the required relevant information. This will obviously depend on
the aims of the analysis, but further preprocessing and manipulation of the data
may be needed. This is considered in the next chapter.
CHAPTER 3
1 Introduction
Previous chapters have largely been concerned with processes related to acquir-
ing our analytical data in a digital form suitable for further manipulation and
analysis. This data analysis may include calibration, modelling, and pattern
recognition. Many of these procedures are based on multivariate numerical
data processing and before the methods can be successfully applied it is usual to
perform some pre-processing on the data. There are three main aims of this
pre-processing stage in data analysis,
(a) to reduce the amount of data and eliminate data that are irrelevant to the
study being undertaken,
(b) to preserve or enhance sufficient information within the data in order to
achieve the desired goal,
(c) to extract the information in, or transform the data to, a form suitable for
further analysis.
One of the most common forms of pre-processing spectral data is normali-
zation. At its simplest this may involve no more than scaling each spectrum in a
collection so that the most intense band in each spectrum is some constant
value. Alternatively, spectra could be normalized to constant area under the
curve of the absorption or emission profile. A more sophisticated procedure
involves constructing a covariance matrix between variates and extracting the
eigenvectors and eigenvalues. Eigen analysis yields a set of new variables which
are linear combinations of the original variables. This can often lead to
representing the original information in fewer new variables, thus reducing the
dimensionality of the data and aiding subsequent analysis.
The success of pattern recognition techniques can frequently be enhanced or
simplified by suitable prior treatment of the analytical data, and feature selec-
tion and feature extraction are important stages in chemometrics. Feature
selection refers to identifying and selecting those features present in the analy-
tical data which are believed to be important to the success of calibration or
pattern recognition. Techniques commonly used include differentiation, inte-
gration, and peak identiJication. Feature extraction, on the other hand, changes
Feature Selection and Extraction 55
the dimensionality of the data and generally refers to processes combining or
transforming original variables to provide new and better variables. Methods
widely used include Fourier transformation and principal components analysis.
In this chapter the popular techniques pertinent to feature selection and
extraction are introduced and developed. Their application is illustrated with
reference to spectrochemical analysis.
2 Differentiation
Derivative spectroscopy provides a means for presenting spectral data in a
potentially more useful form than the zero'th order, normal data. The tech-
nique has been used for many years in many branches of analytical spectro-
scopy. Derivative spectra are usually obtained by differentiating the recorded
signal with respect to wavelength as thq spectrum is scanned. Whereas early
applications mainly relied on hard-wired units for electronic differentiation,
modern derivative spectroscopy is normally accomplished computationally
using mathematical functions. First-, second-, and higher-order derivatives can
easily be generated.
Analytical applications of derivative spectroscopy are numerous and gen-
erally owe their popularity to the apparent higher resolution of the differential
Figure 1 A pair of overlapping Gaussian peaks (a), and the Jirst- (b),second- (c), and
third-order (d) derivative spectra
Chapter 3
Figure 2 Quantitative analysis withfirst derivative spectra. Peak heights are displayed in
relative absorbance units
data compared with the original spectrum. The effect can be illustrated with
reference to the example shown in Figure 1. The zero'th-, first-, and second-
order derivatives of a spectrum, comprised of the sum of two overlapping
Gaussian peaks, are presented. The presence of a smaller analyte peak can be
much more evident in the derivative spectra. In addition, for determining the
intensity of the smaller peak in the presence of the large neighbouring peak,
derivative spectra can be more useful and may be subject to less error. This is
illustrated in Figure 2, in which the zero'th and first derivative spectra are
shown for an analyte band with and without the presence of an overlapping
band. If we assign unit peak height to the analyte in the normal, zero'th-order,
spectrum, then for the same band with the interfering band present, a peak
height of 55 units is recorded. Using a tangent baseline in order to attempt to
correct for the overlap fails as there is no unique or easily identified tangent,
and a not unreasonable value of 12 units for the peak height could be recorded,
a 20% error. The situation is improved considerably if the first derivative
spectrum is analysed. A value of one is assigned to the peak-to-peak distance of
the lone analyte spectrum. In the presence of the overlapping band a similar
measure for the analyte is now 1.04, a 4% error.
This example, however, oversimplifies the case of using derivative spectro-
scopy as it gives no indication of the effects of noise on the results. Derivative
spectra tend to emphasize changes in slope that are difficult to detect in the
zero'th-order spectrum. Unfortunately, as we have seen in previous chapters,
noise is often comprised of high-frequency components and thus may be greatly
Feature Selection and Extraction 57
amplified by differentiation. It is the presence of noise which generally limits the
use of derivative spectroscopyto UV-visible spectrometryand other techniques
in which a high signal-to-noise ratio may be obtained for a spectrum.
Various mathematical procedures may be employed to differentiate spectral
data. We will assume that such data are recorded at evenly spaced intervals
along the wavelength, A, or other x-altis. If this is not the case, the data may be
interpolated to provide this. The simplest method to produce the first-deriva-
tive spectrum is by difference,
or,
Equations (4) and (5) are similar to Equations (2) and (3). The difference is in
the use of additional terms using extra points from the data in order to provide
a better approximation.
The relative merits of these different methods can be compared by differen-
Data
I 2 3 4
by Equation (1) 3.1 or 2.7 3.3 or 2.6 3.4 or 2.55 3.45 or 2.525
by Equation (2) 2.9 2.95 2.98 2.9875
by Equation (4) 2.980 2.990 2.995 2.9975
by Equation (3) 0.4 0.7 0.85 0.925
by Equation (5) 1.0857 1.0429 1.0214 1.0107
R. Bracewell, 'The Fourier Transform and its Applications', McGraw-Hill, New York, USA,
1965.
Feature Selection and Extraction
Figure 3 Differentiation of spectra via the Fourier transform: (a) the original spectrum;
(b)its Fourier transform; (c) the differentialfilter applied to the transform; (d)
the resultingfirst derivative spectrum from the inverse transform
noise ratio of the differential spectrum is severely degraded. This problem can
be partly alleviated by combining a smoothing function along with the differen-
tial filter. In Figure 4(a), the differential transform is truncated and applied to
low frequencies only. High frequencies are eliminated by the zero weighting of
the function. The result of multiplying our transformed data by this new filter
function is shown in Figure 4(b) and the resultant first-derivative spectrum in
Figure 4(c). The effect of the extra smoothing function is evident if Figure 4(c)
and Figure 3(d) are compared.
For many applications the digitization of a full spectrum provides far more
data than is warranted by the spectrum's information content. An infrared
spectrum, for example, is characterized as much by regions of no absorption as
regions containing absorption bands, and most IR spectra can be reduced to a
list of some 20-50 peaks. This represents such a dramatic decrease in dimen-
sionality of the spectral data that it is not surprising that peak tables are
commonly employed to describe spectra. The determination of spectral peak
positions from digital data is relatively straightforward and the facility is
offered on many commercial spectrometers. Probably the most common tech-
niques for finding peak positions involve analysis of derivative data.
In Figure 5 a single Lorentzian function is illustrated along with its first,
Chapter 3
Figure 4 Combining smodthing and differentiating in the frequency domain: (a) the
truncation filter to remove high frequency, noise signals and provide the first
derivative; (b)the transform of the spectrumform Figure 3(a) after application
of thefilter; (c)the resultingfirst derivative spectrumfrom the inverse transform
second, third, and fourth derivatives with respect to energy. At peak positions
the following conditions exist,
where y'is the first derivative, y" the second, and so on.
Thus, the presence and location of a peak in a spectrum can be ascertained
from a suitable sbbset of the rules expressed mathematically in Equation (6):4
Rule 1, a peak centre has been located if the first derivative value is zero and
the second derivative value is negative, i.e.
Figure 5 A Lorentzian band (a), and its first (b),second (c), third (d), and fourth (e)
derivatives
Rule 2, a peak centre has been located if the third derivative is zero and the
fourth derivative is positive, i.e.
Figure 6 Results of a peak picking algorithm. At x = 80, the first derivative spectrum
crosses zero and the second derivative is negative. A 9-point cubic least-squares
jit is applied about this point to derive the coefjicients of the cubic model. The
peak position (dyldx = 0 ) is calculated as occurring at x = 80.3
3 Integration
Mathematically, integration is complementary to differentiation and comput-
ing the integral of a function is a fundamental operation in data processing. It
occurs frequently in analytical science in terms of determining the area under a
curve, e.g. the integrated absorbance of a transient signal from a graphite
furnace atomic absorption spectrometer. Many classic algorithms exist for
approximating the area under a curve. We will briefly examine the more
common with reference to the absorption profile illustrated in Figure 7. This
envelope was generated from the model y = (0. 1x3- 1.12 + 3x + 0.2). Its
integral, between the limits x = 0 and x = 6, can be computed directly. The area
under the curve is 8.400.
One of the simplest integration techniques to implement on a computer is the
method of summing rectangles that each fit a portion of the curve, Figure 8(a).
For N + 1 points in the interval xl, xz . . . XN+ we have N rectangles of width
(xi+ - xi) and height, h ~ given, by the value of the curve at the mid-point
between xi and xi+ ,.The approximate area under the curve, A, between x, and
XN+ is therefore given by
Feature Selection and Extraction
0 1 2 3 4 5 6 7 8
time (s)
Figure 7 The model absorption profile from a graphite furnace AAS study
Figure 8 The area under the AAS profile using (a) rectangular and (b) trapezoidal
integration
As N gets larger, the width of each rectangle becomes smaller and the answer
is more accurate:
for N = 5, A = 8.544
N = 10, A = 8.436
N = 1 5 , A=8.388
64 Chapter 3
A second method of approximating the integral is to divide the area under
the curve into trapezoids, Figure 8(b). The area of each trapezoid is given by
one-half the product of the width (xi+ - xi) and the sum of the two sides, hi
,.
and hi+ The area under the curve can be calculated from
For our absorption peak, the trapezoid method using different widths pro-
duces the following estimates for the integral:
for N = 5 , A=8.112
N = 10, A = 8.328
N=15, A=8.368
for N = 5, A = 8.400
N = 10, A = 8.400
N = 15, A = 8.400
4 Combining Variables
Many analytical measures cannot be represented as a time-series in the form of
a spectrum, but are comprised of discrete measurements, e.g. compositional or
trace analysis. Data reduction can still play an important role in such cases. The
interpretation of many multivariate problems can be simplified by considering
not only the original variables but also linear combinations of them. That is, a
new set of variables can be constructed each of which contains a sum of the
original variables each suitably weighted. These linear combinations can be
derived on an ad hoc basis or more formally using established mathematical
techniques. Whatever the method used, however, the aim is to reduce the
number of variables considered in subsequent analysis and obtain an improved
representation of the original data. The number of variables measured is not
reduced.
An important and commonly used procedure which generally satisfies these
5 A.F. Carley and P.H. Morgan, 'Computational Methods in the Chemical Sciences', Ellis
Honvood, Chichester, UK, 1989.
Feature Selection and Extraction 65
criteria is principal components analysis. Before this specific topic is examined
it is worthwhile discussing some of the more general features associated with
linear combinations of variables.
and its value for the 17 samples calculated. The values of a and b could be
chosen arbitrarily such that, for example, a = b. Then, this variable would
describe a new axis at an angle of 45" with the axes of Figure 9. The sample
points can be projected on to this as illustrated in Figure 10.
As for the actual values of the coefficients a and b, the simplest case is
described by a = b = 1, but any value will provide the same angle of projection
and the same form of the distribution of data on this new line. In practice, it is
usual to specify a particular linear combination referred as the normalized linear
combination and defined by
6 W. Niedermeier, in 'Applied Atomic Spectroscopy', ed. E.L. Grove, Plenum Press, New York,
USA, 1978, p. 219.
Table 3 Heart-tissue trace metal data
1 A0
2 MPA
3 RSCV
4TV
5 MV
6 PV
7 AV
8 RA
9 LAA
10 RV
11 LV
12 LV-PM
13 IVS
14 CR
15 SN
16 AVN B +
17 LBB
A 0 Aorta; MPA Main pulmonary artery; RSCV Right superior vena cava; TV Tricuspid valve; M V Mitral valve; PV Pulmonary valve; AV Aortic
valve; RA Right atrium; LAA Left atrial appendage; RV Right ventricle; LV Left ventricle; LV-PM Left ventricle, muscle; IVS Interventricular septum;
CR Crista supraventricularis;SN sinus node; AVN B Atrioventricular node; LBB Left bundle branch.
+
h
$
$
Table 4 The matrix of correlations between the analytes determined from heart-tissue data
8.h
Cu Mn Mo Zn Cr Ni Cs Ba Sr Cd A1 Sn Pb 5
..5
Chapter 3
Cr, mg kg"
lo 1
Figure 9 Chromium and nickel concentration scatter plot from heart tissue data. The
distribution of concentration valuesfor each element is shown as a bar graph on
their respective axes
Cr, mg kg"
10-
9-
8 -
7-
6-
5-
4-
3-
2 -
1 -
0 - I
0
I
1 2
,
3
,
4
,
5
,
6
,
7
,
8
,
8
,
1 0
,
NI, mg kg-'
Figure 10 A 45" line on the Cr-Ni data plot with the individual sample pointsprojected on
to this line
Feature Selection and Extraction 69
example, this implies a = b = 11J2. The variance of X3 derived from substitut-
ing a and b into Equation 10 for the concentration of chromium and nickel for
each of the 17 samples is 5.22 compared with d = 3.07 and ?c = 2.43 for X1 and
X2 respectively. Thus X3 does indeed contain more potential information than
either X1 or X2.
This reorganization or partitioning of variance associated with individual
variates can be formally addressed as follows.
For any linear combination of variables defining a new variable X given by
a = cosa
b = sina
where a is the angle between the projection of the new axis and the original
ordinate axis. If a = 45O, then a = b = 11J2, the normalized coefficients as
derived from Equation (11). This trigonometric relationship is often employed
in determining different linear combinations of variables and is used in many
principal component algorithms.
Values of a , or a and b, employed in practice depend on the aims of the data
analysis. Different linear combinations of the same variables will produce new
variables with different attributes which may be of interest in studying different
problems.' The linear combination which produces the greatest separation
between two groups of data samples is appropriate in supervised pattern
B. Flury and H. Riedwye, 'Multivariate Statistics: A Practical Approach', Chapman and Hall,
London, UK, 1984.
70 Chapter 3
recognition. This forms the basis of linear discriminant analysis, a topic that
will be discussed in Chapter 5. Considering our samples or objects as a single
group or cluster, we may wish to determine the minimum number of normalized
linear combinations having the greatest proportion of the total variance, in
order to reduce the dimensionality of the problem. This is the task of principal
components analysis and is treated in the next section.
As well as this matrix form, this structure can also be represented diagram-
matically as shown in Figure 11. The variance of the chromium data is
represented by a line along the X1 axis with a length equal to the variance of XI.
J.C. Davis, 'Statistics in Data Analysis in Geology', J. Wiley and Sons, New York, USA, 1973.
Feature Selection and Extraction
0 1 2 3 4 5
sZ
Figure 11 A bivariate variance-covariance matrix may be displayed graphically
9 M.J.R. Healy, 'Matrices for Statistics', Oxford University Press, Oxford, UK, 1986.
Chapter 3
If x is not 0 then the determinant of the coefficient matrix must be zero, i.e.
For our experimental data with X1 and X2 representing chromium and nickel
concentrations, and A = Cov, then
Feature Selection and Extraction 73
which simplifies to
and the elements of x are the eigenvectors associated with the first eigenvalue,
el. For our 2 x 2, Ni-Cr variance-covariance data, substitution into (26) leads
to
with vl and v12as the eigenvectors associated with the first eigenvalue, and v2]
and v2' defining the slope of the second eigenvalue.
Solving these equations gives
74 Chapter 3
which defines the slope of the major axis of the ellipse (Figure 12), and
Thus, the elements of the eigenvectors become the required coefficients for
the original variables, and are referred to as loadings. The individual elements
of the new variables (PC1 and PC2) are derived from X1 and X2 and are termed
the score^.'^^" The 1 components scores for the chromium and nickel
data are given in Table
The total variance of the original nickel and chromium data is
3.07 + 2.43 = 5.5 with X1 contributing 56% of the variance and X2 contribut-
A0
MPA
RSCV
TV
MV
PV
AV
RA
LAA
RV
LV
LV-PM
IVS
CR
SN
AVN + B
LBB
10 B.F.J. Manly, 'Multivariate Statistical Methods: A Primer', Chapman and Hall, London, UK,
1986.
11 R.E. Aries, D.P. Lidiard, and R.A. Spragg, Chem. Br., 1991,27,821.
Feature Selection and Extraction 75
ing the remaining 44%. The calculated eigenvalues are the lengths of the two
principal axes and represent the variance associated with each new variable,
PC1 and PC2. The first principal component, therefore, contains 5.2415.50 or
more than 95% of the total variance, and the second principal component less
than 5%, 0.2615.50. If it were necessary to reduce the display of our original
bivariate data to a single dimension using only one variable, say chromium
concentration, then a loss of 44% of the total variance would ensue. Using the
first principal component, however, and optimally combining the two vari-
ables, only 5% of the total variance would be missing.
We are now in a position to return to the complete set of trace element data in
Table 3 and apply principal components analysis to the full data matrix. The
techniques described and used in the above example to extract and determine
the eigenvalues and eigenvectors for two variables can be extended to the more
general, multivariate case but the procedure becomes increasingly difficult and
arithmetically tedious with large matrices. Instead, the eigenvalues are usually
found by matrix manipulation and iterative approximation methods using
appropriate computer software. Before such an analysis is undertaken, the
question of whether to transform the original data should be considered.
Examination of Table 3 indicates that the variates considered have widely
differing means and standard deviations. Rather than standardizing the data,
since they are all recorded in the same units, one other useful transformation is
to take logarithms of the values. The result of this transformation is to scale all
the data to a more similar range and reduce the relative effects of the more
concentrated metals. Having performed the log-transformation on our data,
the results of performing PCA on all 13 for the 17 samples are as given in
Table 6.
According to the eigenvalue results present in Table 6(b), and displayed in
the scree plot of Figure 13, over 84% of the total variance in the original data
0
0 2 4 6 8 1 0 1 2 1 4
Factor
Figure 13 An eigenvalue, scree plot for the heart-tissue trace metal data
Table 6 Results of principal components analysis on the logarithms of the trace metals concentration data
- - - - - -
(a) Eigenvalues
PC1 PC2 PC3 PV4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13
Variance 7.49 3.47 0.99 0.38 0.23 0.15 0.09 0.08 0.05 0.02 0.01 0.000 0.001
% Variance 57.7 26.7 7.7 2.9 1.8 1.2 0.7 0.6 0.4 0.1 0.1 0.0 0.0
Cumulative 57.7 84.4 92.1 95.0 96.8 98.0 99.8 99.9 100 100 100
% contribution
98.7 fla4
(b)Eigenvectors
PC1 PC2 PC3 PV4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13
Log CU - 0.17 0.47 0.08 0.08 0.07 0.16 - 0.07 - 0.17 0.35 0.47 - 0.43 - 0.22 0.35
Log Mn 0.20 0.39 - 0.05 0.57 - 0.30 0.12 0.14 0.27 -0.16 - 0.54 -0.17 0.001 0.13
Log MO 0.63 0.49 - 0.04 - 0.33 0.26 - 0.52 0.36 - 0.009 0.03 - 0.30 0.20 0.02 0.26
Log Zn 0.10 0.01 0.95 0.10 0.12 -0.11 -0.1 0.10 -0.08 -0.03 0.08 -0.05 -0.006
Log Cr 0.33 0.04 - 0.19 - 0.03 0.74 0.26 - 0.30 0.20 - 0.23 - 0.08 - 0.006 - 0.21 0.01
Log Ni 0.29 0.27 -0.05 -0.24 -0.30 -0.26 -0.50 -0.26 -0.45 0.28 -0.06 -0.002 -0.08
Log CS 0.36 - 0.04 - 0.07 0.15 - 0.20 0.002 - 0.05 - 0.12 0.35 0.11 0.58 -0.55 0.10
Log Ba 0.35 -0.14 -0.03 0.17 0.04 -0.05 -0.16 0.08 0.19 0.19 0.14 0.68 0.48
Log Sr 0.34 -0.11 -0.01 0.33 0.27 -0.31 0.24 -0.50 0.19 0.10 -0.32 0.04 -0.36
Log Cd 0.33 0.19 0.17 - 0.36 - 0.07 0.59 0.41 - 0.38 - 0.17 0.005 - 0.02 0.07 0.16
Log A1 0.35 -0.03 -0.05 0.06 -0.18 -0.11 0.44 0.63 -0.21 0.38 -0.14 -0.09 -0.11
Log Sn - 0.09 0.51 0.01 0.14 0.08 0.26 0.38 0.05 0.08 0.22 0.42 0.34 - 0.53
Log Pb 0.34 0.07 0.04 - 0.41 - 0.15 0.06 - 0.22 0.26 0.55 - 0.31 - 0.26 0.07 - 0.29
Feature Selection and Extraction
Figure 14 Scatter plot of the 17 heart-tissue samples on the standardizedjrst two prin-
cipal componentsfrom the trace metal data
can be accounted for by the first two principal components. The transformation
of the 13 original variables to two new linear combinations represents consider-
able reduction of the data presented whilst retaining much of the original
information. A scatter plot of the first two principal components scores is
shown in Figure 14 and patterns to the samples according to the distribution of
the trace metals in the data are evident. Three tissues, the pulmonary valve,
aortic valve, and the right superior vena cava, constitute unique groups of one
tissue each, well distinguished from the rest. The aorta, main pulmonary artery,
mitral, and tricuspid valves constitute a cluster of four tissue types. Finally,
there is a group of ten tissues derived from the myocardium. A more detailed
analysis and discussion of this data set is presented by Niedermeier.6
As well as being used with discrete analytical data, such as the trace metal
concentrations discussed above, principal components analysis has been exten-
sively employed on digitized spectral profiles.'* A simple example will illustrate
the basis of these applications. Infrared spectra of 21 samples of acrylic, PVC,
styrene, and nylon polymers, as thin films, were recorded in the range
400M00 cm- '. Each spectrum was normalized on the most intense absorption
band to remove film thickness effects, and reduced to 216 discrete values by
signal averaging. The resulting 21 x 216 data matrix was subject to principal
components analysis. The resulting eigenvalues are illustrated in the scree plot
of Figure 15, and the first three principal components account for more than
91% of the total variance in the original spectra. A scatter plot of the polymer
data loaded on to these three components is shown in Figure 16. It is evident
from this plot that these three components are sufficient to provide effective
0 1 2 3 4 5 6
Factor
Figure 15 A scree plot for the eigenvalues derivedfrom the ZR spectra of 21 polymers
A Nylon C PVC
B Styrene D Acrylic
clustering of the samples with clear separation between the groups and types of
polymer. The first component, PCl, forms an axis which would allow the
partitioning between acrylic polymer and other samples. PC2 provides for two
partitions; both the nylon and PVC polymers are separated from the styrene
and acrylic polymers. PC3 allows the separation between styrenes and others.
Examination of the principal component loadings, the eigenvectors, as func-
Feature Selection and Extraction 79
tions of wavelength, i.e. spectra of loadings, highlights the weights given to each
spectral point m each of the original spectra, Figure 17. It can be seen from
these 'spectra' that where a partition between sample types if formed, the
majority of absorption bands in the corresponding spectra receive strong
positive or negative weighting. PC2 produces two partitions and, Figure 17(b),
the bands in nylon spectra receive positive weightings and bands in PVC
spectra, Figure 17(c), have negative weightings.
The power of principal components analysis is in providing a mathematical
transformation of our analytical data to a form with reduced dimensionality.
From the results, the similarity and difference between objects and samples can
often be better assessed and this makes the technique of prime importance in
chemometrics. Having introduced the methodology and basics here, future
chapters will consider the use of the technique as a data preprocessing tool.
Factor Analysis
The extraction of the eigenvectors from a symmetric data matrix forms the basis
and starting point of many multivariate chemometric procedures. The way in
which the data are preprocessed and scaled, and how the resulting vectors are
treated, has produced a wide range of related and similar techniques. By far the
most common is principal components analysis. As we have seen, PCA pro-
vides n eigenvectors derived from a n x n dispersion matrix of variances and
covariances, or correlations. If the data are standardized prior to eigenvector
analysis, then the variancwovariance matrix becomes the correlation matrix
[see Equation (25) in Chapter 1, with sl = sZ]. Another technique, strongly
related to PCA, is factor analysis.13
Factor analysis is the name given to eigen analysis of a data matrix with the
intended aim of reducing the data set of n variables to a specified number, p, of
fewer linear combination variables, or factors, with which to describe the data.
Thus, p is selected to be less than n and, hopefully, the new data matrix will be
more amenable to interpretation. The final interpretation of the meaning and
significance of these new factors lies with the user and the context of the
problem.
A full description and derivation of the many factor analysis methods
reported in the analytical literature is beyond the scope of this book. We will
limit ourselves here to the general and underlying features associated with the
technique. A more detailed account is provided by, for example, HopkeI4.l5and
other^.'"'^
13 D. Child, 'The Essentials of Factor Analysis', 2nd Edn, Cassel Educational, London, UK, 1990.
14 P.K. Hopke, in 'Methods of Environmental Data Analysis', ed. C.N. Hewitt, Elsevier, Essex,
UK, 1992.
15 P.K. Hopke, Chemomet. Zntell. Lab. Systems, 1989,6, 7.
16 E. Malinowski, 'Factor Analysis in Chemistry', J. Wiley and Sons, New York, USA, 1991.
17 T.P.E. Auf der Heyde, J. Chem. Ed., 1983,7, 149.
18 G.L. Ritter, S.R. Lowry, T.L. Isenhour, and C.L. Wilkins, Anal. Chem., 1976,48, 591.
'9 E. Malinowski and M. McCue, Anal. Chem., 1977,49, 284.
Chapter 3
-- Acrylic
0.4 1 I I I
-- Nylon
- PC2
4.4 1 I I I
-- PVC
- PC2
4.4 I I 1
-- Polystyrene
- PC2
1.0 7
0.8 -
I
1
0.6 -
0.2 -
9.2 -
, 4.4 ' I I
29
39
40
41
42
43
44
54
55
56
57
69
83
84
85
86
Mean
too ,
20 30 40 50
I
W 70 80 90
spectra, i.e. how many components are in the mixtures. We can then attempt to
identify the nature or source of each extracted component. These are the aims
of factor analysis.
Before we can compute the eigenvectors associated with our data matrix, we
need to select appropriate, if any, preprocessing methods for the data, and the
form of the dispersion matrix. Specifically, we can choose to generate a
covariance matrix or a correlation matrix from the data. Each of these could be
derived from the original, origin-centred data or from transformed, mean-
centred data. In addition, we should bear in mind the aim of the analysis and
decide whether the variables for numerical analysis are the m / z values or the
composition of the sample mixtures themselves. Thus we have eight options in
forming the transformed, symmetric matrix for extracting eigenvectors. We can
form a 5 x 5 covariance, or correlation, matrix on the origin- or mean-centred
compositional values. Alternatively, a 17 x 17 covariance, or correlation,
matrix can be formed from origin- or mean-centred m / z values.
Each of these transformations can be expressed in matrix form as a transform
of the data matrix X to a new matrix Y followed by calculating the appropriate
dispersion matrix, C (the variance-covariance, or correlation matrix). The
relevant equations are
84 Chapter 3
and
a, = 1 and b, = 0 (35)
a, = 1 and b, = - Zj (36)
Covariance matrix
A B C D E
A 726.62
B 726.04 734.27
C 683.14 713.48 808.63
D 594.23 655.34 890.97 1157.49
E 421.65 485.71 754.65 1065.58 1025.79
Eigenvalues
Factor 1 2 3 4 5
Eigenvalue 3744.08 703.56 2.84 1.25 1.07
Cumulative % 84.08 99.88 99.95 99.98 100.00
contribution
Eigenvectors
F(I) F(II) F(1ZZ) F(zv) F( V )
A 0.368 0.557 - 0.424 0.607 - 0.075
B 0.389 0.486 0.401 - 0.462 - 0.488
C 0.461 0.121 - 0.193 - 0.434 0.740
D 0.533 - 0.359 0.605 - 0.446 0.146
E 0.464 - 0.557 - 0.505 - 0.176 - 0.434
viously with principal components analysis. In the earlier example the disper-
sion matrix was formed between the measured trace metal variables, and the
technique is sometimes referred to as R-mode analysis. For the current MS
data, processing by R-mode analysis would involve the data being scaled along
each m / z column and information about relative peak sizes in any single
spectrum would be destroyed. In Q-mode analysis, any scaling is performed
within a spectrum and the mass fragmentation pattern for each sample is
preserved.
The variance-covariance matrix of the mass spectra data is presented in
Table 8, along with the results of computing the eigenvectors and eigenvalues
from this matrix. In factor analysis we assume that any relationships between
our samples from within the original data set can be represented by p mutually
uncorrelated underlying factors. The value of p is usually selected to be much
less than the number of original variables. These p underlying factors, or new
variables, are referred to as common factors and may be amenable to physical
interpretation. The remaining variance not accounted for by the p factors will
be contained in a uniquefactor and may be attributable, for example, to noise in
the system.
The first requirement is to determine the appropriate value of p, the number
of factors necessary to describe the original data adequately. If p cannot be
specified then the partition of total variance between common and unique
factors cannot be determined. For our simple example with the mass spectra
data it appears obvious that p = 2, i.e. there are two common factors which we
Chapter 3
I I I I I I 1
0 1 2 3 4 5 6
Factor
Figure 19 The scree plot of the eigenvalues extracted from the MS data
may interpret as being due to two components in the mixtures. The eigenvalues
drop markedly from the second to the third value, as can be seen from Table 8
and Figure 19. The first two factors account for more than 99% of the total
variance. The choice is not always so clear, however, and in the chemometrics
literature a number of more objective functions have been described to select
appropriate values of p.I4
e eigenvectors in Table 8 have been normalized, i.e. each vector has unit
The elements in each of the factors are the factor loadings, and the complete
factor loading matrix for our MS data is given in Table 9. This conversion has
not changed the orientation of the factor axes from the original eigenvectors
Feature Selection and Extraction 87
Table 9 Thefactor loading matrix from a Q-mode analysis of the MS data
but has simply changed their absolute magnitude. The lengths of each vector
are now equal to the square root of the eigenvalues, i.e. the factors represent the
standard deviations.
From Table 8, the first factor accounts for 3744.0814452.80 = 84% of the
variance in the data. Of this, 22.522/3744.08 = 13.5% is derived from object or
sample A, 15.2% from B, 21.3% from C, 28.5% from D, and 21.5% from E.
The total variance associated with object A is accounted for by the five factors.
Taking the square of each element in the factor matrix (remember, these are
standard deviations) and summing for each object provides the amount of
variance contributed by each object.
For sample A,
The final values listed in Equation 43 represent the fraction of each object's
variance explained by the two factors. They are referred to as the communality
88 Chapter 3
values, denoted h2, and they depend on the number of factors used. As the
number of factors retained increases, then the communalities tend to unity. The
remaining (1 - h2) fraction of the variance for each sample is considered as
being associated with its unique variance and is attributable to noise.
Returning to our mass spectra, having calculated the eigenvalues, eigen-
vectors, and factor loadings, we must decide how many factors need be retained
in our model. In the absence of noise in the measurements, the eigenvalues
above the number necessary to describe the data are zero. In practice, of course,
noise will always be present. However, as we can see with our mass spectra data
a large relative decrease in the magnitude of the eigenvalues occurs after two
values, so we can assume here that p = 2. Hopke14 provides an account of
several objective functions to assess the correct number of factors.
Having reduced the dimensionality of the data by electing to retain two
factors, we can proceed with our analysis and attempt to interpret them.
Examination of the columns of loadings for the first two factors in the factor
matrix, Table 9, shows that some values are negative. The physical significance
of these loadings or coefficients is not immediately apparent. The loadings for
these two factors are illustrated graphically in Figure 20(a). The location of the
orthogonal vectors in the two-factor space has been constrained by the three
remaining but unused factors. If these three factors are not to be used then we
can rotate the first two factors in the sample space and possibly find a better
position for them; a position which will provide for a more meaningful inter-
pretation of the data. Of the several factor rotation schemes routinely used in
factor analysis, that referred to as the varimax technique is most commonly
used. Varimax rotation moves each factor axis to a new, but still orthogonal,
position so tha-the loading projections are near either the origin or the
extremities of these new axes.8 The rotation is rigid, to retain orthogonality
between factors, and is undertaken using an iterative algorithm.
Using the varimax rotation method, the rotated factor loadings for the first
two factors from the mass spectra data are given in Table 10 and are plotted in
Figure 20(b). The relative position of the objects to each other has remained
unchanged, but all loadings are now positive. In fact, all loadings are present in
the first quadrant of the diagram and in an order we can recognize as corres-
ponding to the mixtures' compositional analysis. Sample A is predominantly
cyclohexane (90%) and E hexane (90%). Examination of Figure 20(b) indicates
how we could identify the nature of the two components if they were unknown,
as would be the case with a real set of samples of course. Presumably, if the
mass spectra of the two pure components were present in, or added to, the
original data matrix, then the loadings associated with these samples would
align themselves closely with the pure factors. Such, in fact, is the case. Table 11
provides the normalized mass spectra of cyclohexane and hexane, and Table 12
gives the varimax rotated factor loadings for the now 7 x 7 variance-
covariance matrix from the data now containing the mixtures and the pure
components. The loadings of the first two factors for the seven samples are
illustrated in Figure 21. As expected the single-components spectra are closely
aligned with the axes of the derived and rotated factors.
Feature Selection and Extraction
0 5 10 15 20 25 30 35
Factor 1
Figure 24 The original factor loadings obtained from the MS data (a) and the rotated
factor loadings,following varimax rotation, with only two factors retained in
the model (b)
Figure 21 The rotated factor loadings for the MS data including the pure component
spectra, XI (cyclohexane) and X2 (hexane)
90 Chapter 3
Table 10 The rotated factor loading matrix, retaining only the first two factors
and using varimax
Table 12 The rotated factor loadingsfor the complete MS data set, including the
two pure components, cyclohexane ( X l ) and hexane (X2)
Feature Selection and Extraction 91
Varimax rotation is a commonly used and widely available factor rotation
technique, but other methods have been proposed for interpreting factors from
analytical chemistry data. We could rotate the axes in order that they align
directly with factors from expected components. These axes, referred to as test
vectors, would be physically significant in terms of interpretation and the
rotation procedure is referred to as target transformation. Target trans-
formation factor analysis has proved to be a valuable technique in chemo-
metri~s.~' The number of components in mixture spectra can be identified and
the rotated factor loadings in terms of test data relating to standard, known
spectra, can be interpreted.
In this chapter we have been able to discuss only some of the more common
and basic methods of feature selection and extraction. This area is a major
subject of active research in chemometrics. The effectiveness of subsequent data
processing and interpretation is largely governed by how well our analytical
data have been summarized by these methods. The interested reader is encour-
aged to study the many specialist texts and journals available to appreciate the
wide breadth of study associated with this subject.
21 P.K. Hopke, D.J. Alpert, and B.A. Roscoe, Comput. Chem., 1983,7, 149.
CHAPTER 4
Pattern Recognition I -
Unsupervised Analysis
1 Introduction
It is an inherent human trait that, presented with a collection of objects, we will
attempt to classify them and organize them into groups according to some
observed or perceived similarity. Whether it is with childhood toys and sorting
blocks by shape or into coloured sets, or with hobbies devoted to collecting, we
obtain satisfaction from classifying things. This characteristic is no less evident
in science. Indeed, without the ability to classify and group objects, data, and
ideas, the vast wealth of scientific information would be little more than a single
random set of data and be of little practical v lue or use. There are simply too
many objects or events encountered in our d k o u t i n e to be able to consider
each as an individual and discrete entity.
Instead, it is common to refer an observation or measure to some previously
catalogued, similar example. The organization in the Periodic Table, for
example, allows us to study group chemistry with deviations from general
behaviour for any element to be recorded as required. In a similar manner,
much organic chemistry can be catalogued in terms of the chemistry of generic
functional groups. In infrared spectroscopy, the concept of correlation between
spectra and molecular structures is exploited to provide the basis for spectral
interpretation; in general each functional group exhibits well defined regions of
absorption.
Although the human brain is excellent at recognizing and classifyingpatterns
and shapes, it performs less well if an object is represented by a numerical list of
attributes, and much analytical data is acquired and presented in such a form.
Consider the data shown in Table 1, obtained from an analysis of a series of
alloys. This is only a relatively small data set but it may not be immediately
apparent that these samples can be organized into well defined groups defining
the type or class of alloy according to their composition. The data from Table 1
are expressed diagrammatically in Figure 1. Although we may guess that there
are two similar groups based on the Ni, Cr, and Mn content, the picture suffers
from the presence of extraneous data. The situation would be more complex
Pattern Recognition I - Unsupervised Analysis 93
Table 1 Concentration, expressed as mg kg-', of trace metals in six alloy
samples
Figure 1 The trace metal data from Table 1 plotted to illustrate the presence of two
groups
still if more objects were analysed or more variables were measured. As modern
analytical techniques are able to generate large quantities of qualitative and
quantitative data, it is necessary to seek and apply formal methods which can
serve to highlight similarities and differences between samples. The general
problem is one of classijication and the contents of this chapter are concerned
with addressing the following, broadly stated task. Given a number of objects
or samples, each described by a set of measured values, we are to derive a
formal mathematical scheme for grouping the objects into classes such that
objects within a class are similar, and different from those in other classes. The
number of classes and the class characteristics are not known apriori but are to
be determined from the analysis.'
It is the last statement in the challenge facing us that distinguishes the
1 B. Everitt, 'Cluster Analysis', 2nd Edn, Heinemann Educational, London, UK, 1980.
Chapter 4
I
I
Variable 1
Figure 2 What constitutes a cluster and its boundary will depend on interpretation as well
as the clustering algorithm employed
2 Choice of Variable
In essence, what all clustering algorithms aim to achieve is to group together
similar, neighbouring points into clusters in the n-dimensional space defined by
the n-variate measures on the objects. As with supervised pattern recognition
(see Chapter 5), and other chemometric techniques, the selection of variables
and their pre-processing can greatly influence the outcome. It is worth
repeating that cluster analysis is an exploratory, investigative technique and a
data set should be examined using several different methods in order to obtain a
more complete picture of the information contained in the data.
The initial choice of the measurements made and used to describe each object
constitute the frame of reference within which the clusters are to be established.
This choice will reflect an analyst's judgement of relevance for the purpose of
classification, based usually on prior experience. In most cases the number of
variables is determined empirically and often tends to exceed the minimum
required to achieve successful classification. Although this situation may guar-
antee satisfactory classification, the use of an excessive number of variables can
severely effect computation time and a method's efficiency. Applying some
preprocessing transformation to the data is often worthwhile. Standardization
of the raw data can be undertaken, and is particularly valuable when different
types of variable are measured. But it should be borne in mind that standardi-
zation can have the effect of reducing or eliminating the very differences
between objects which are required for classification. Another technique worth
considering is to perform a principal components analysis on the original data,
to produce a set of new, statistically independent variables. Cluster analysis can
then be performed on the first few principal components describing the major-
ity of the samples' variance.
Finally, having performed a cluster analysis, statistical tests can be employed
to assess the contribution of each variable to the clustering process. Variables
found to contribute little may be omitted and the cluster analysis repeated.
Similarity Measures
Similarity or association coefficients have long been associated with cluster
analysis, and it is perhaps not surprising that the most commonly used is the
correlation coefficient. Other similarity measures are rarely employed. Most are
poorly defined and not amenable to mathematical analysis, and none have
received much attention in the analytical literature. The calculation of corre-
lation coefficients is described in Chapter One, and Table 2(a) provides the full
symmetric matrix of these coefficientsof similarity for the alloy data from Table
1. With such a small data set, a cluster analysis can be performed manually to
illustrate the stages involved in the process. The first step is to find the mutually
largest correlation in the matrix to form centres for the clusters. The highest
correlation in each column of Table 2(a) is shown in boldface type. Objects A
and D form mutual highly correlated pairs, as do objects B and C. Note that
although object E is most highly correlated with D, they are not considered as
forming a pair as D most resembles A rather than E. Similarly, object F is not
paired with B, as B is more similar to C.
A D E B C F
The resemblance between the mutual pairs is indicated in the diagram shown
in Figure 3, which links A to D and B to C by a horizontal line drawn from the
vertical axis at points representing their respective correlation coefficients.
At the next stage, objects A and D, and B and C, are considered to comprise
new, distinct objects with associative properties and are similar to the other
objects according to their average individual values. Table 2(b) shows the newly
calculated correlation matrix. Clusters AD and BC have a correlation coeffi-
cient calculated from the sum of the correlations of A to B, D to B, A to C, and
D to C, divided by four. The correlation between AD and E is the average of the
original A to E and D to E correlations. The clus'tering procedure is now
repeated, and object E joins cluster AD and object F joins BC, Figure 3(b). The
process is continued until all clusters are joined and the final similarity matrix is
produced as in Table 2(c) with the resultant diagram, a dendrogram, shown in
Figure 3(c). That two groups, ADE and BCF, may be present in the original
data is demonstrated.
From this extremely simple example, the basic steps involved in cluster
analysis and the value of the technique in classification are evident. The final
dendrogram, Figure 3(c), clearly illustrates the similarity between the different
samples. The original raw tabulated data have been reduced to a pictorial form
which simplifies and demonstrates the structure within the data. It is pertinent
to ask, however, what information has been lost in producing the diagram and
98 Chapter 4
Table 3 The matrix of apparent correlations between the six alloy samples,
derivedfrom the dendrogram of Figure 3 and Table 2(b) and (c)
Figure 4 True vs. apparent correlations indicating the distortion achieved by averaging
correlation values to produce the dendrogram
to what extent does the graph accurately represent our original data. From the
dendrogram and Table 2(b) the apparent correlation between sample B and
sample F is 0.91, rather than the true value of 0.94 from the calculated similarity
matrix. This error arose owing to the averaging process in treating the BC pair
as a single entity, and the degree of distortion increases as successive levels of
clusters are averaged together. Table 3 is the matrix of apparent correlations
between objects as obtained from the dendrogram. These apparent correlations
are sometimes referred to as cophenetic values, and if these are plotted against
actual correlations, Figure 4, then a visual impression is obtained of the
distortion in the dendrogram. A numerical measure of the similarity between
the values can be calculated by computing the linear correlation between the
two sets. If there is no distortion, then the plot would form a straight line and
the correlation would be 1. In our example this correlation, r = 0.99. Although
such a high value for r may indicate a strong linear relationship, as Figure 4
shows there is considerable difference between the real and apparent corre-
lations.
Pattern Recognition I - Unsupervised Analysis 99
Distance Measures
The correlation coefficient is too limiting in its definition to be of value in many
applications of cluster analysis. It is a measure only of colinearity between
variates and takes no account of non-linear relationships or the absolute
magnitude of variates. Instead, distance measures which can be defined
mathematically are more commonly encountered in cluster analysis. Of course,
it is always possible at the end of a clustering process to substitute distance with
reverse similarity; the greater the distance between objects the less their simi-
larity.
An object is characterized by a set of measures, and it may be represented as
a point in multidimensional space defined by the axes, each of which corre-
sponds to a variate. In Figure 5, a data matrix X defines measures of two
variables on two objects A and B. Object A is characterized by the pattern
vector, a = xl,, x12,and B by the pattern vector, b = xzl,xz2.Using a distance
measure, objects or points closest together are assigned to the same cluster.
Numerous distance metrics have been proposed and applied in the scientific
literature.
For a function to be useful as a distance metric between objects then the
following basic rules apply (for objects A and B):
(a) dAB3 0, the distance between all pairs of measurements for objects A and
B must be non-negative,
(b) dAB= dBA, the distance measure is symmetrical and can only be zero
when A = B.
(c) dAC+ dBC3 dAB,the distance is commutative for all pairs of points. This
statement corresponds to the familiar triangular inequality of Euclidean
geometry.
The most commonly referenced distance metric is the Euclidean distance,
defined by
where xi,, is the value of the j"th variable measured on the i'th object. This
equation can be expressed in vector notation as
-
dAB= [(a - b)= (a - b)]1/2 (3)
When m = 1, Equation (4) defines the city-block metric, and if m = 2 then the
Euclidean distance is defined. Figure 5 illustrates these measures on two-
dimensional data.
If the variables have been measured in different units, then it may be
necessary to scale the data to make the values comparable.+' An equivalent
procedure is to compute a weighted Euclidean distance,
4 J. Hartigan, 'Clustering Algorithms', J. Wiley and Sons, New York, USA, 1975.
5 A.A. AM and V. Clark, 'Computer Aided Multivariate Analysis', Lifetime Learning, California,
USA, 1984.
6 'Classification and Clustering', cd. J. Van Ryzin, Academic Press, New York, USA, 1971.
7 D.J. Hand, 'Discrimination and Classification', J. Wiley and Sons, Chichester, UK, 1981.
Pattern Recognition I - Unsupervised Analysis 101
vectors corresponding to the variables in the data matrix. Weighting variables is
largely subjective and may be based on a priori knowledge regarding the data,
such as measurement error or equivalent variance of variables. If, for example,
weights are chosen to be inversely proportional to measurement variance, then
variates with greater precision are weighted more heavily. However, such
variates may contribute little to an effective clustering process.
One weighted distance measure which does occur frequently in the scientific
literature is the Mahalanobis distance,4"
where Cov is the full variance-covariance matrtix for the original data. The
Mahalanobis distance is invariant under any linear transformation of the
original variables. If several variables are highly correlated, this type of weight-
ing scheme down-weights their individual contributions. It should be used with
care, however. In cluster analysis, use of the Mahalanobis distance may
produce even worse results than equating the variance of each variable and may
serve only to decrease the clarity of the cluster^.^
Before proceeding with a more detailed examination of clustering techniques,
we can now compare correlation and distance metrics as suitable measures of
similarity for cluster analysis. A simple example serves to illustrate the main
points. In Table 4, three objects (A, B, and C) are characterized by five variates.
The correlation matrix and Euclidean distance matrix are given in Tables 5 and
Table 5 The correlations matrix of (a), of data from Table 4 and clustering
objects with highest mutual correlation, (b)
A B C
(a) A B C
Figure 6 Dendrograms for the three-object data set from Table 4 , clustered according to
correlation (a) and distance (b)
variable
Figure 7 The three-objectfrom Table 4
metric in this case provides a suitable clustering measure. On the other hand, if
XI,x2 . . . x5 denoted, say, wavelengths and the response values a measure of
absorption or emission at these wavelengths, then a different explanation may
be sought. It is clear from Figure 7 that if the data represent spectra, then A and
C are similar, differing only in scale or concentration, whereas spectrum B has a
different profile. Hence, correlation provides a suitable measure of similarity.
If, as spectra, the data had been normalized to the most intense response, then
A and C would have been closer and the distance metric more meaningful.
In summary, therefore, the first stage in cluster analysis is to compute the
matrix of selected distance measures between objects. As the entire clustering
process may depend on the choice of distance it is recommended that results
using different functions are compared.
4 Clustering Techniques
By grouping similar objects, clusters are themselves representative of those
objects and form a distinct group according to some empirical rule. It is implicit
in producing clusters that such a group can be represented further by a typical
element of the cluster. This single element may be a genuine member of the
cluster or a hypothetical point, for example an average of the contents' char-
acteristics in multidimensional space. One common method of identifying a
cluster's typical element is to substitute the mean values for the variates
describing the objects in the cluster. The between-cluster distance can then be
defined as the Euclidean distance, or other metric, between these means. Other
measures not using the group means are available. The nearest-neighbour
distance defines the distance between two closest members from different
groups. The furthest-neighbour distance on the other hand is that between the
most remote pair of objects in two groups. A further inter-group measure is
104 Chapter 4
obtained by taking the average of all the inter-element measures between
elements in different groups. As well as defining the inter-group separation
between clusters, each of these measures provides the basis for a clustering
technique, defining the method by which clusters are constructed or divided.
In relatively simple cases, in which only two or three variables are measured
for each sample, the data can usually be examined visually and any clustering
identified by eye. As the number of variates increases, however, this is rarely
possible and many scatter plots, between all possible pairs of variates, would
need to be produced in order to identify major clusters, and even then clusters
could be missed. To address this problem, many numerical clustering tech-
niques have been developed, and the techniques themselves have been
classified. For our purposes the methods considered belong to one of the
following types.
(a) Hierarchical techniques in which the elements or objects are clustered to
form new representative objects, with the process being repeated at different
levels to produce a tree structure, the dendrogram.
(b) Methods employing optimization of the partitioning between clusters
using some type of iterative algorithm, until some predefined minimum change
in the groups is produced.
(c) Fuzzy cluster analysis in which objects are assigned a membership func-
tion indicating their degree of belonging to a particular group or
In order to demonstrate the calculations and results associated with the
different methods, the small set of bivariate data in Table 7 will be used. These
data comprise 12 objects in two-dimensional space, Figure 8, and the positions
(a) A B C D E F G H I J K L
XI 2 6 7 8 1 3 2 7 6 7 6 2
X 2 1 1 1 1 2 2 3 3 4 4 5 6
(b) A B C D E F G H I J K L
of the points are representative of different shaped clusters, the single point (L),
the extended group (B,C,D), the symmetrical group (A,E,F,G), and the asym-
metrical cluster (H,I,J,K).1
Hierarchical Techniques
When employing hierarchical clustering techniques, the original data are separ-
ated into a few general classes, each of which is further subdivided into still
smaller groups until finally the individual objects themselves remain. Such
methods may be agglomerative or divisive. By agglomerative clustering, small
groups, starting with individual samples, are fused to produce larger groups as
in the examples studied previously. In contrast, divisive clustering starts with a
single cluster, containing all samples, which is successively divided into smaller
partitions. Hierarchical techniques are very popular, not least because their
application leads to the production of a dendrogram which can provide a
two-dimensional pictorial representation of the clustering process and the
results. Agglomerative hierarchical clustering is very common and we will
proceed with details of its application.
Agglomerative methods begin with the computation of a similarity or dis-
tance matrix between the objects, and result in a dendrogram illustrating the
succesive fusion of objects and groups until the stage is reached when all objects
are fused into one large set. Agglomerative methods are the most common
hierarchical schemes found in scientific literature. The entire process involved
in undertaking agglomerative clustering using distance measures can be sum-
marized by a four-step algorithm.
106 Chapter 4
Step 1. Calculation of the between-object distance matrix.
Step 2. Find the smallest elements in the distance matrix and join the
corresponding objects into a single cluster.
Step 3. Calculate a new distance matrix, taking into account that clusters
produced in the second step will have formed new objects and taken
the place of original data points.
Step 4. Return to Step 2 or stop if the final two clusters have been fused into
the final, single cluster.
The wide range of agglomerative methods available differ principally in the
implementation of Step 3 and the calculation of the distance between two
clusters. The different between-group distance measures can be defined in terms
of the general formula
where di,,is the distance between objects i and j and dk(,,is the distance between
group k and a new group ( i , ~formed
) by the fusion of groups i and j. The values
of coefficients a , a,, P, and y are chosen to select the specific between-group
metric to be used. Table 8 lists the more common metrics and the corresponding
values for ai,aj, P, and y.
The use of Equation (8) makes it a simple matter for standard computer
software packages to offer a choice of distance measures to be investigated by
selecting the appropriate values of the coefficients.
Coefficients
Metric ai a, b d
Nearest neighbour
(single linkage)
Furthest neighbour
(complete linkage)
Centroid
Median
Group average
Ward's method
F G E A L C D B I K J H
Clusters
From the 12 x 12 distance matrix, Table 7(b), objects B and C form a new,
combined object and the distance from BC to each original object is calculated
according to Equation (9). Thus, for A to BC,
i.e. the distance between a cluster and an object is the smallest of the distances
between the elements in the cluster and the object.
The distance between the new object BC and each remaining original object
is calculated, and the procedure repeated with the resulting 11 x 11 distance
matrix until a single cluster containing all objects is produced. The resulting
dendrogram is illustrated in Figure 9.
The dendrogram for the furthest-neighbour, or complete linkage, technique is
produced in a similar manner. In this case, Equation (8) becomes
108 Chapter 4
and this implies that
i.e. the distance between a cluster and an object is the maximum of the distances
between cluster elements and the object.
For example, for group BC to object D, the B to D distance is 2 units and the
C to D distance is 1 unit. From Equation (13), therefore dDCBC) = 2, or
The process is repeated between BC and other objects, and the iteration
starts again with the new distance matrix until a single cluster is produced. The
dendrogram from applying Ward's method is illustrated in Figure 1 1 .
A E F G L C D B H J I K
Clusters
Figure 10 Dendrogram of the data from Table 7(a) using the furthest-neighbour
algorithm
Pattern Recognition I - UnsupervisedAnalysis
A E F G L C D B H J I K
Clusters
Figure 11 Dendrogram of the data from Table 7(a) using Ward's method
The different methods available from applying Equation (8) with the co-
efficients from Table 8 each produce their own style of dendrogram with their
own merits and disadvantages. Which technique or method is best is largely
governed by experience and empirical tests. The construction of the dendrol
gram invariably induces considerable distortion as discussed, and dher, non-
hierarchical, methods are generally favoured when large data sets are to be
analysed.
where L(i) is the cluster containing the i'th object. Thus E represents the sum of
the squares of the distances between object i and the cluster centres.
The algorithm proceeds by moving an object from one cluster to another in
order to reduce E,and ends when no movement can reduce E. The steps involved
are:
Step 1: Given K clusters and their initial contents, calculate the cluster
means BLJ and the initial partition error, E.
Step 2: For the first object, compute the increase in error, be, obtained by
transferring the object from its current cluster, L(l), to every other
cluster L, 2 < L < K:
If this value for AE is negative, i.e. the move would reduce the
partition error, transfer the object and adjust the cluster means
taking account of their new populations.
Step 3: Repeat Step 2 for every object.
Step 4: If no object has been moved then stop, else return to Step 2.
Applying the algorithm manually to our test data will illustrate its opera-
tion. ,
Using the data from Table 7, it is necessary first to specify the number of
clusters into which the objects are to be partitioned. We will use K = 4. Before
the algorithm is implemented we also need to assign each object to an initial
cluster. A number of methods are available, and that used here is to assign
object i to cluster L(i) according to
- MIN
ZXi,,
Lei) = INT[(K- ')(MAX ?xi,,- MIN ?xi,, (19)
where ?xi,, is the sum of all variables for each object, and MIN and MAX
denote the minimum and maximum sum values.
For the test data,
Pattern Recognition I - Unsupervised Analysis
Objects
Variables A B C D E F G H I J K L
XI 2 6 7 8 1 3 2 7 6 7 6 2
x2 1 1 1 1 2 2 3 3 4 4 5 6
ZXi.j 3 7 8 9 3 5 5 101011118
MAX EXi,, = 11 MIN X X i , = 3
For object A,
i= A B C D E F G H I J K L
L(i)= 1 2 2 3 1 1 1 3 3 4 4 2
Similarly for each of the remaining three clusters. The initial clusters arid
their mean values are therefore,
and to Cluster 3,
and to Cluster 4,
The AE values are all positive and each proposed change would serve to
increase the partition error, so object A is not moved from Cluster 1. This result
can be appreciated by reference to Figure 12(a). Object A is closest to the centre
of Cluster 1 and nothing would be gained by assigning it to another cluster.
The algorithm continues by checking each object and calculating AE for each
object with each cluster. For our purpose, visual examination of Figure 12(a)
indicates that no change would be expected for object B, but for object C a
move is likely as it is closer to the centre of Cluster 3 than Cluster 2.
Moving object C, the third object, to Cluster 1,
and
and to Cluster 3,
114 Chapter 4
and
and to Cluster 4,
So, moving object C from Cluster 2 to Cluster 3 decreases E by 14.88, and the
new value of E is (42.35 - 14.88) = 27.47. The partition is therefore changed.
With new clusters and contents we must calculate their new mean values:
The new partition, after the first pass through the algorithm, is illustrated in
Figure 12(b). On the second run through the algorithm object B will transfer to
Cluster 3; it is nearer its mean than Cluster 2.
On the second pass, therefore, the cluster populations and their centres are,
Figure 12(c),
On the fourth pass, object H moves from Cluster 3 to Cluster 4, Figure 12(e),
Pattern Recognition I - Unsupervised Analysis
Cluster Contents (4th change) Cluster means
x1 x2
1 A E F G 2.00 2.00
2 L 2.00 6.00
3 B C D 7.00 1.OO
4 H I J K 6.50 4.00
The process is repeated once more but this time no movement of any object
between clusters gives a better solution in terms of reducing the value of E. So
Figure 12(e) represents the best result.
Our initial assumption when applying the K-means algorithm was that four
clusters were known to exist. Visual examination of the data suggests that this
assumption is reasonable in this case, but other values could be acceptable
depending on the model investigated. For K = 2 and K = 3, the K-means
algorithm produces the results illustrated in Figure 13. Although statistical tests
have been proposed in order to select the best number of partitions, cluster
analysis is not generally considered a statistical technique, and the choice of
criteria for best results is at the discretion of the user.
0 0
0 1 2 3 4 5 6 7 8 9
0
0 1 2 3 4 5 6 7 8
0
9
xl xl
Figure 13 The K-means algorithm applied to the test data from Table 7(a) assuming
there are two clusters (a), and three clusters (b)
Fuzzy Clustering
The principal aim of performing a cluster analysis is to permit the identification
of similar samples according to their measured properties. Hierarchical tech-
niques, as we have seen, achieve this by linking objects according to some
formal rule set. The K-means method on the other hand seeks to partition the
pattern space containing the objects into an optimal predefined number of
116 Chapter 4
Table 9 Bivariate data ( x , and x2) measured on 15 objects, A . . . 0
relaxed when applying fuzzy clustering and objects are recognized as belonging,
to a lesser or greater degree, to every cluster.
The degree or extent to which an object, i, belongs to a specific cluster, k, is
referred to as that object's membership function, denoted kt.Thus, visual
118 Chapter 4
inspection of Figure 14 would suggest that for two clusters objects E and K
would be close to the cluster centres, i.e. PIE- 1 and )LZK - 1, and that object H
would belong equally to both clusters, i.e. p l =~0.5 and P2H = 0.5. This is
precisely the result obtained with fuzzy clustering.
As with K-means clustering, the fuzzy k-means technique is iterative and
seeks to minimize the within-cluster sum of squares. Our data matrix is defined
by the elements xii and we seek K clusters, not by hard partitioning of the
variable space, but by fuzzy partitions, each of which has a cluster centre or
prototype value, Bkj, (1 < k < K).
The algorithm starts with a pre-selected number of clusters, K. In addition,
an initial fuzzy partition of the objects is supplied such that there are no empty
clusters and the membership functions for an object with respect to each cluster
sum to unity,
New fuzzy partitions are then defined by a new set of membership functions
given by,
i.e. the ratio of the inverse squared distance of object i from the k'th cluster
centre to the sum of the inverse squared distances of object i to all cluster
centres.
From this new partitioning, new cluster centres are calculated by applying
Equation (33), and the process repeats until the total change in values of the
membership functions is less than some preselected value, or a set number of
iterations has been achieved.
Application of the algorithm can be demonstrated using the data from Table
9.
With K = 2, our first step is to assign membership functions for each object
and each cluster. This process can be done in a random fashion, bearing in mind
the constraint imposed by Equation (32), or using prior knowledge, e.g. the
output from crisp clustering methods. With the results from the K-means
algorithm, Figure 15(b), the membership functions can be assigned as shown in
Table 10. Objects A . . . H belong predominately to Cluster 1 and objects I . . .
0 to Cluster 2.
Pattern Recognition I - UnsupervisedAnalysis 119
Using this initial fuzzy partition, the initial cluster centres can be calculated
according to Equation (33).
And we can proceed to calculate new membership functions for each object
about these centres.
The squared Euclidean distance between object A and the centre of Cluster 1
is, from Equation (I),
Figure 16 Results of applying the fuzzy k-means clustering algorithm to the data from
Table 9. The values in parenthesis indicate the membership function for each
object relative to group A
The sum (plA+ pZA) is unity, which satisfies Equation (32), and the mem-
bership functions for the other objects can be calculated in a similar manner.
The process is repeated and after five iterations the total change in the squared
pkivalues is less than lo-' and the membership functions are considered stable,
Table 11. This result, Figure 16, accurately reflects the symmetric distribution
of the data.
The same algorithm that provides the membership functions for the test data
can be used to generate values for interpolated and extrapolated data, and a
three-dimensional surface plot produced, Figure 16(b). Going one stage
further, we can combine )I,, and pzi to provide the complete membership
surface, according to the rule
Pattern Recognition I - Unsupervised Analysis 121
Figure 17 The two cluster surface plot of data from Table 9 using the fuzzy clustering
algorithm
122 Chapter 4
Cluster analysis is justifiably a popular and common technique for explora-
tory data analysis. Most commercial multivariate statistical software packages
offer several algorithms, along with a wide range of graphical display facilities
to aid the user in identifying patterns in data. Having indicated that some
pattern and structure may be present in our data, it is often necessary to
examine the relative importance of the variables and determine how the clusters
may be defined and separated. This is the primary function of supervised
pattern recognition and is examined in Chapter 5.
CHAPTER 5
1 Introduction
Generally, the term pattern recognition tends to refer to the ability to assign an
object to one of several possible categories, according to the values of some
measured parameters. In statistics and chemometrics, however, the term is
often used in two specific areas. In Chapter 4, unsupervised pattern recognition,
or cluster analysis, was introduced as an exploratory method for data analysis.
Given a collection of objects, each of which is described by a set of measures
defining its pattern vector, cluster analysis seeks to provide evidence of natural
groupings or clusters of the objects in order to allow the presence of patterns in
the data to be identified. The number of clusters, their populations, and their
interpretation are somewhat subjectively assigned and are not known before
the analysis is conducted. Supervised pattern recognition, the subject of this
chapter, is very different, and is often referred to in the literature as classifi-
cation or discriminant analysis. With supervised pattern recognition, the
number of parent groups is known in advance and representative samples of
each group are available. With this information, the problem facing the analyst
is to assign an unclassified object to one of the parent groups. A simple example
will serve to make this distinction between unsupervised and supervised pattern
recognition clearer.
Suppose we have determined the elemental composition of a large number of
mineral samples, and wish to know whether these samples can be organized
into groups according to similarity of composition. As demonstrated in
Chapter 4, cluster analysis can be applied and a wide variety of methods are
available to explore possible structures and similarities in the analytical data.
The result of cluster analysis may be that the samples can be clearly distin-
guished, by some combination of analyte concentrations, into two groups, and
we may wish to use this information to identify and categorize future samples as
belonging to one of the two groups. This latter process is classijcation, and the
means of deriving the classification rules from previously classified samples
is referred to as discrimination. It is a pre-requisite for undertaking this
124 Chapter 5
supervised pattern recognition that a suitable collection of pre-assigned objects,
the training set, is available in order to determine the discriminating rule or
discriminantfunction.'-3
The precise nature and form of the classifying function used in a pattern
recognition exercise is largely dependent on the analytical data. If the parent
population distribution of each group is known to follow the normal curve,
then parametric methods such as statistical discriminant analysis can be usefully
employed. Discriminant analysis is one of the most powerful and commonly
used pattern recognition techniques and algorithms are generally available with
all commercial statistical software packages. If, on the other hand, the distri-
bution of the data is unknown, or known not to be normal, then non-parametric
methods come to the fore. One of the most widely used non-parametric algo-
rithms is that of K-nearest neighbour~.~ Finally, in recent years, considerable
interest has been shown in the use of artificial neural networks for supervised
pattern recognition and many examples have been reported in the analytical
chemistry literature.' In this chapter each of these techniques is examined along
with its application to analytical data.
2 Discriminant Functions
The most popular and widely used parametric method for pattern recognition is
discriminant analysis. The background to the development and use of this
technique will be illustrated using a simple bivariate example.
In monitoring a chemical process, it was found that the quality of the final
product can be assessed from spectral data using a simple two-wavelength
photometer. Table 1 shows absorbance data recorded at these two wavelengths
(400 and 560 nm) from samples of 'good' and 'bad' products, labelled Group A
and Group B respectively. On the basis of the data presented, we wish to derive
a rule to predict which group future samples can be assigned to, using the two
wavelength measures.
Examining the analytical data, the first step is to determine their descriptive
statistics, i.e. the mean and standard deviation for each variable in Group A
and Group B. It is evident from Table 1 that at both wavelengths Group A
exhibits higher mean absorbance than samples from Group B. In addition, the
standard deviation of data from each variable in both groups is similar. If we
consider just one variable, the absorbance at 400 nm, then a first attempt at
classification would assign the samples to groups according to this absorbance
value. Figure 1 illustrates the predicted effect of such a scheme. The mean
1 B.K. Lavine, in 'Practical Guide to Chemometrics', ed. S.J. Haswell, Marcel Dekker, New York,
USA, 1992.
M. James, 'Classification Algorithms', Collins, London, UK, 1985.
3 F.J. Manly, 'Multivariate Statistical Methods: A Primer', Chapman and Hall, London, UK,
1991.
A.A. Afifi and V. Clark, 'Computer-Aided Multivariate Analysis', Lifetime Learning, California,
USA, 1984.
5 J. Zupan and J. Gesteiger, 'Neural Networks for Chemists', VCH, Weinheim, Germany, 1993.
Pattern Recognition I1 - Supervised Learning 125
Table 1 Absorbance measurements on two classes of material at 400 and 560 nm
Mean
discriminant line
I Group B Group A
values and distribution of the sample absorbances at 400 nm are taken from
Table 1, and it is clear that the use of this single variable alone is insufficient to
separate the two groups. With the single variable, however, a decision or
discriminant function can be proposed.
For equal variances of absorbance data in Groups A and B, the discriminant
rule is given by
126 Chapter 5
assign sample to Group A if
i.e. a sample is assigned to the group with the nearest mean value.
Having obtained such a classification rule, it is necessary to test the rule and
indicate how good it is. There are several testing methods in common use.
Procedures include the use of a set of independent samples or objects not
included in the training set, the use of the training set itself, and the leave-one-
out method. The use of a new, independent set of samples not used in deriving
the classification rule may appear the obvious best choice, but it is often not
practical. Given a finite size of a data set, such as in Table 1, it would be
necessary to split the data into two sets, one for training and one for validation.
The problem is deciding which objects should be in which set, and deciding on
the size of the sets. Obviously, the more samples used to train and develop the
classification rule, the more robust and better the rule is likely to be. Similarly,
however, the larger the validation set, the more confidence we can have in the
rule's ability to discriminate objects correctly.
The most common method employed to get around this problem is to use all
the available data for training the classifier and subsequently test each object as
if it were an unknown, unclassified sample. The inherent problem with using the
training set as the validation set is that the total classification error, the error
rate, will be biased low. This is not surprising as the classification rule would
have been developed using this same data. New, independent samples may lie
outside the boundaries defined by the training set and we do not know how the
rule will behave in such cases. This bias decreases as the number of samples
analysed increases. For large data sets, say when the number of objects exceeds
10 times the number of variables, the measured apparent error can be con-
sidered a good approximation of the true error.
If the independent sample set method is considered to be too wasteful of
data, which may be expensive to obtain, and the use of the training set for
validation is considered insufficiently rigorous, then the leave-one-out method
can be employed. By this method all samples but one are used to derive the
classification rule, and the sample left out is used to test the rule. The process is
repeated with each sample in turn being omitted from the training set and used
for validation. The major disadvantage of this method is that there are as many
rules derived as there are samples in the data set and this can be compu-
tationally demanding. In addition, the error rate obtained refers to the average
performance of all the classifiers and not to any particular rule which may
subsequently be applied to new, unknown samples.
The results of classification techniques examined in this chapter will be
assessed by their apparent error rates using all available data for both training
and validation, in line with most commercial software.
Pattern Recognition I1 - Supervised Learning 127
Table 2 Use of the contingency table, or confusion matrix, of classijication
results (a). Eii is the number of objectsfrom group i classijied as j, Mi, is
the number of objects actually in group i, and Mi, is the number classijied
in group i. The results using the single absorbance at 400 nm, (b), and at
560 nm, (c)
Actual membership
A B
Predicted A EAA EBA MAC
membership B EAB EBB MB~
MAa MBa
Actual membership
A B
Predicted 10 1
membership 1 9
11 10
Actual membership
A B
Predicted A 9 1 10
membership B 2 9 11
11 10
The rules expressed by Equations (1) and (2) ensure that the probability of
error in misclassifying samples is equal for both groups. In those cases for
which the absorbance lies on the discriminant line, samples are assigned
randomly to Group A or B. Applying this classification rule to our data results
in a total error rate of 9%; two samples are misclassified. To detail how the
classifier makes errors, the results can be displayed in the form of a contingency
table, referred to as a confusion matrix, of actual group against classified group,
Table 2. A similar result is obtained if the single variable of absorbance at
560 nm is considered alone; three samples are misclassifed.
In Figure 2, the distribution of each variable for each group is plotted along
with a bivariate scatter plot of the data and it is clear that the two groups form
distinct clusters. However, it is equally evident that it is necessary for both
variables to be considered in order to achieve a clear separation. The problem
facing us is to determine the best line between the data clusters, the discriminant
function, and this can be achieved by consideration of probability and Bayes'
theorem.
Bayes' Theorem
The Bayes' rule simply states that 'a sample or object should be assigned to that
group having the highest conditional probability' and application of this rule to
parametric classification schemes provides optimum discriminating capability.
An explanation of the term 'conditional probability' is perhaps in order here,
Chapter 5
GroupA
GrwpB
Figure 2 The data from Table 1 as a scatter plot and, along each axis, the univariate
distributions. Two distinct groups are evident from the data
with reference to a simple example. In spinning a fair coin, the chance of tossing
the coin and getting heads is 50%, i.e.
P(G(A)) and P(rqB)) are the a priori probabilities, i.e. the probabilities of a
sample belong to A and B in the absence of having any analytical data.
P(xlG(A))is a conditional probability expressing the chance of a vector pattern
x arising from a member of Group A, and this can be estimated by sampling the
population of Group A. A similar equation can be arranged for P ( G ( B ) I ~ )and
substitution of Equation (8) into Equation (7) gives
The denominator term of Equation (8) is common to PcqA)) and P(G(B)) and
hence cancels from each side of the inequality.
Although P(xlG(A)) can be estimated by analysing large numbers of samples,
similarly for P(xlG(B)),
the procedure is still time consuming and requires large
numbers of analyses. Fortunately, if the variables contributing to the vector
pattern are assumed to possess a multivariate normal distribution, then these
conditional probability values can be calculated from
which describes the multidimensional normal distribution for two variables (see
Chapter 1). P(xlG(A))can, therefore, be estimated from the vector of Group A
mean values, PA, and the group covariance matrix, COVA.
130 Chapter 5
Substituting Equation (lo), and the equivalent for P(xlsB)),in Equation (9),
taking logarithms and rearranging leads to the rule
Calculation of the left-hand side of this equation results in a value for each
object which is a function of x, the pattern vector, and which is referred to as
the discriminant score.
The discriminant function, dA(x)is defined by
The second term in the right-hand side of Equation (12) defining the discrimi-
nant function is the quadratic form of a matrix expansion. Its relevance to our
discussions here can be seen with reference to Figure 3 which illustrates the
division of the sample space for two groups using a simple quadratic function.
This Bayes' classifier is able to separate groups with very differently shaped
distributions, i.e. with differing covariance matrices, and it is commonly refer-
red to as the quadratic discriminant function.
The use of Equation (15) can be illustrated by application to the data from
Table 1.
Pattern Recognition I1 - Supervised Learning
'
x2 I
Figure 3 Contour plots of two groups of bivariate normal data and the quadratic division
of the sample space
From Table 1, the vector of variable means for each group is,
The discriminant functions, dA(x) and dB(x), for each sample in the training
set of Table 1 can now be calculated.
Thus for the first sample,
Chapter 5
The calculated value for dA(x) is less than that of dB(x) SO this object is
assigned to Group A. The calculation can be repeated for each sample in the
training set of Table 1, and the results are provided in Table 3. All 21 samples
have been classified correctly as to their parent group. The quadratic dis-
criminating function between the two groups can be derived from Equation (14)
by solving the quadratic equations for x. The result is illustrated in Figure 4 and
the success of this line in classifying the training set is apparent.
Actual group
A B
Predicted 11 0 11
membership 0 10 10
11 10
Figure 4 Scatter plot of the data from Table I and the calculated quadratic discriminant
function
I
xl
Figure 5 Contour plots of two groups of bivariate data with each group having identical
variance-covariance matrices. Such groups are linearly separable
134 Chapter 5
With the assumption of equal covariance matrices, the rule defined by
Equation (1 1) becomes
and
Equations (25) are linear with respect to x and this classification technique is
referred to as linear discriminant analysis, with the discriminant function
obtained by least squares analysis, analogous to multiple regression analysis.
Turning to our spectroscopic data of Table 1, we can evaluate the perform-
ance of this linear discriminant analyser.
For the whole set of data and combining all samples from both groups,
Since the value for fA(x) exceeds that for fB(x), from Equation (26) the first
sample is assigned to Group A. The remaining samples can be analysed in a
similar manner and the results are shown in Table 4. One sample, from Grovp
A, is misclassified. The decision line can be found by solving .for x when
fA(x) =fB(x). This line is shown in Figure 6 and the misclassified sample can be
clearly identified.
Figure 6 Scatter plot of the data from Table I and the calculated linear discriminant
function
136 Chapter 5
Table 4 Discriminant scores using the linear discriminant function as classijier
(a), and the resulting confusion matrix (b)
Actual membership
A B
Predicted 10 0
membership 1 10
11 10
As this linear classifier has performed less well than the quadratic classifier, it
is worth examining further the underlying assumptions that are made in
applying the linear model. The major assumption made is that the two groups
of data arise from normal parent populations having similar covariance
matrices. Visual examination of Figure 2 indicates that this assumption may
not be valid for these absorbance data. The data from samples forming Group
A display an apparent positive correlation ( r = 0.54) between x, and x2,
whereas there is negative correlation (r = - 0.85) between the absorbance
values at the two wavelengths for those samples in Group B. For a more
quantitative measure and assessment of the similarity of the two variance-
covariance matrices we require some multivariate version of the simple F-test.
Such a test may be derived as follow^.^
For k groups of data characterized by j = 1 . . . m variables, we may compute
k variance-covariance matrices, and for two groups A and B we wish to test the
hypothesis
If the data arise from a single parent population, then a pooled variance-
covariance matrix may be calculated from
J.C. Davis, 'Statistics and Data Analysis in Geology', J. Wiley & Sons, New York, USA, 1973.
Pattern Recognition IZ- Supervised Learning
which expresses the difference between the logarithm of the determinant of the
pooled variance-covariance matrix and the average of the logarithms of the
determinants of the group variance-covariance matrices. The more similar
the group matrices, the smaller the value of M.
Finally a test statistic based on the X2 distribution is generated from
where
1
and
Thus,
x2 = 0.885 x 42.1 = 37.3
3 Nearest Neighbours
The discriminant analysis techniques discussed above rely for their effective use
on a priori knowledge of the underlying parent distribution function of the
variates. In analytical chemistry, the assumption of multivariate normal distri-
bution may not be valid. A wide variety of techniques for pattern recognition
not requiring any assumption regarding the distribution of the data have been
proposed and employed in analytical spectroscopy. These methods are referred
to as non-parametric methods. Most of these schemes are based on attempts to
estimate P(,IG,,and include histogram techniques, kernel estimates and expansion
methods. One of the most common techniques is that of K-nearest neighbours.
The basic idea underlying nearest-neighbour methods is conceptually very
simple, and in practice it is mathematically simple to implement. The general
method is based on applying the so-called K-nearest neighbour classification
rule, usually referred to as K-NN. The distance between the pattern vector of
Pattern Recognition I1 - Supervised Learning
GroupA
8 Group B
A Unknown
Figure 7 Radius, r, of the circle about an unclassijed object containing three nearest
neighbours, two from group A and one from group B. The unknown sample is
assigned to group A
the unclassified sample and every classified sample from the training set is
calculated, and the majority of smallest distances, i.e. the nearest neighbours,
determines to which group the unknown is to be assigned. The most common
distance metric used is the Euclidean distance between two pattern vectors.
For objects 1 and 2 characterized by multivariate pattern vectors xl and x2
defined by
where ki is the number of nearest neighbours in group i and VK,xis the volume
of space which contains the K nearest neighbours.
Using Equation (40) in the Bayes' rule gives
assign to group i if
ki 1 kj 1
P(~3 - -> P(G,)- - for all j # i
ni V K , ~ nj VK,.
Since the volume term is constant to both sides of the equation, the rule
simplifies to,
assign to group i if
Assigned group
A B A A A A A A A A A B B B B B B B B B B
Actual membership
A B
Predicted A 10 0 10
membership B 1 10 11
11 10
When, say, infrared or mass spectra can be reduced to binary strings indicat-
ing the presence or absence of peaks or other features, the Hamming distance
metric is simple to implement. In such cases it provides a value of differing bits
in the binary pattern and is equivalent to performing the exclusive-OR function
between the vectors. The Hamming distance is a popular choice in spectral
Chapter 5
Sample Binary
XOR
R1
with sample
10101011 11110000
0
Figure 8 Binary representation of spectra data ( I -peak, 0 - no peak). The sample has
smallest number of XOR bits set with reference spectrum R4, and this, therefore,
is the best match
4 The Perceptron
As an approximation to the Bayes' rule, the linear discriminant function
provides the basis for the most common of the statistical classification schemes,
Pattern Recognition II - Supervised Learning 143
but there has been much work devoted to the development of simpler linear
classification rules. One such method which has featured extensively in spectro-
scopic pattern recognition studies is the perceptron algorithm.
The perceptron is a simple linear classifier that requires no assumptions to be
made regarding the parent distribution of the analytical data. For pattern
vectors that are linearly separable, a perceptron will find a hyperplane (in two
dimensions this is a line) that completely separates the groups. The algorithm is
iterative and starts by placing a line at random in the sample space and
examining which side of the line each object in the training set falls. If an object
is on the wrong side of the line then the position of the line is changed to
attempt to correct the mistake. The next object is examined and the process
repeats until a line position is found that correctly partitions the sample space
for all objects. The method makes no claims regarding its ability to classify
objects not included in the training set, and if the groups in the training set are
not linearly separable then the algorithm may not settle to a final stable result.
The perceptron is a learning algorithm and can be considered as a simple
model of a biological neuron. It is worth examining here not only as a classifier
in its own right, but also as providing the basic features of modern artificial
neural networks.
The operation of a perceptron unit is illustrated schematically in Figure 9.
The function of the unit is to modify its input signals and produce a binary
output, 1 or 0, dependent on the sum of these inputs. Mathematically, the
perceptron performs a weighted sum of its inputs, compares this with some
threshold value and the output is turned on (output = 1) if this value is
exceeded, else it remains off (output = 0).
For m inputs,
m
x' = (xl. . . x,) represents an object's pattern vector, and w' = (w, . . . w,) is
the vector of weights which serve to modify the relative importance of each
weighted
output
input weighted
Figure 9 The simple perceptron unit. Inputs are weighted and summed and the output is
'1' or '0' depending on whether or not it exceeh a dejned threshold value
144 Chapter 5
element of x. These weights are varied as the model learns to distinguish
between the groups assigned in the training set.
The sum of the inputs, I, is compared with a threshold value, 8, and if I > 8 a
value of 1 is output, otherwise 0 is output, Figure 9. This comparison can be
achieved by subtracting 0 from I and comparing the result with zero, i.e. by
adding - 8 as an offset to I. The summation and comparison operations can,
therefore, be combined by modifying Equation (49,
m+ 1
total input, I = wixi = W . x
i= 1
where now w = (wl . . . wm+l) with wm+ being referred to as the unit's bias, and
x = ( x l . . . ~ , , , + ~ ) w i t h=x 1.
~+~
The resulting output, y, is given by
The training of the perceptron as a linear classifier then follows the following
steps,
(a) randomly assign the initial elements of the weight vector, w,
(b) present an input pattern vector from the training set,
(c) calculate the output value according to Equation (47),
(d) alter the weight vector to discourage incorrect decisions and reduce the
classification error,
(e) present the next object's pattern vector and repeat from step (c).
This process is repeated until all objects are correctly classified.
Figure 10(a) illustrates a bivariate data set comprising two groups, each of
two objects. These four objects are defined by their pattern vectors, including
Xm+ 1, as
9 Group B
(a)
Figure 10 A simple two-group, bivariate data set (a), and iterative discriminant analysis
using the simple perceptron (b)
For our first object, Al, the product of x and w is positive and the output is 1,
which is a correct result.
and
For sample A2, the output is also positive and no change in the weight vector
is required. For sample B1, however, an output of 1 is incorrect; B1 is not in the
same group as A1 and A2, and we need to modify the weight vector. The
following weight vector adapting rule is simple to implement,'
R. Beale and T. Jackson, 'Neural Computing: An Introduction', Adam Hilger, Bristol, UK,
1991.
Chapter 5
(a) if the result is correct, then w(new) = w(old),
(b) if y = 0 but should be y = 1, then w(new) = w(o1d) + x
(c) if y = 1 but should be y = 0, then w(new) = w(o1d) - x. (53)
Our perceptron has failed on sample B1: the output is 1 but should be 0.
Therefore, from Equation (53c),
This is illustrated in Figure 10(b) and serves to provide the correct classi-
fication of the four objects.
The calculations involved with implementing this perceptron algorithm are
simple but tedious to perform manually. Using a simple computer program and
analysing the two-wavelength spectral data from Table 1 a satisfactory par-
tition line is obtained eventually and the result is illustrated in Figure 11. The
perceptron has achieved a separation of the two groups and every sample has
been rightly assigned to its correct parent group.
Several variations of this simple perceptron algorithm can be found in the
literature, with most differences relating to the rules used for adapting the
weight vector. A detailed account can be found in Beale and Jackson, as well as
a proof of the perceptron's ability to produce a satisfactory solution, if such a
solution is possible.'
The major limitation of the simple perceptron model is that it fails drastically
on linearly inseparable pattern recognition problems. For a solution to these
cases we must investigate the properties and abilities of multilayer perceptrons
and artificial neural networks.
perceptron function
Figure 11 Partition of the data from Table 1 by a linear function derived from a simple
perceptron unit
Chapter 5
Group A
Group B
1.0 - .....'.... :.
unlt 1 ,....,,,,.......
0.9 - A1 .....
,....."
82 ,'
0.8 - ....
.... rn ;
0.7 - .... ,.' unlt 2
0.6 - .-.
Figure 12 A simple two-group, bivariate data set that is not linearly separable by a single
function. The lines shown are the linear classifiersfrom the two units in thejirst
layer of the multilayer system shown in Figure 13
which serve to define the lines shown in Figure 12. The weight vector associated
with the third, output, perceptron is designed to provide the final classification
from the output values of perceptrons 1 and 2,
Pattern Recognition 11 - Supervised Learning
We can calculate the output from each perceptron for each sample presented
to the input of the system. Thus for object Al,
-1 -
- 6 - 4 - 2 0 2
input
4 6
-1 h,
-1 h,
-6.4-2 0 2
input
4 6 4 - 4 4 0 2
input
4
Figure 14 Some commonly used thresholdfunctions for neural networks: the Heaviside
function (a), the linear function (b),and the sigmoidal function (c)
6
Pattern Recognition II - Supervised Learning
Figure 15 The general scheme for a fully connected two-layer neural network with four
inputs
where
0, is the output from neuron j and $ is the summed input to neuron j from
other neurons, 0 , modified according to the weight of the connection, wii,
between the i'th and j"th neurons, Figure 16.
The final output from the network for our input pattern is compared with the
known, correct result and a measure of the error is computed. In order to
reduce this error, the weight vectors between neurons are adjusted by using the
generalized delta rule and back-propagating the error from one layer to the
previous layer.
The total error, E, is given by the difference between the correct or target
output, t, and the actual measured output, 0 , i.e.
and the critical parameter that is passed back through the layers of the network
is defined by
Chapter 5
For output units the observed results can be compared directly with the target
result, and
Figure 18 A neural network, comprising an input layer (I), a hidden layer (H), and an
output layer (0). This is capable of correctly classifying the analytical data
from Table 1. The required weighting coeficients are shown on each connec-
tion and the bias values for a sigmoidal threshold fmction are shown above
each neuron
154 Chapter 5
or thousands of neurons will be difficult, if not impossible, to analyse in terms
of its internal behaviour. The performance of a neural network is usually
judged by results, often with little attention paid to statistical tests or the
stability of the system.
As demonstrated previously, a single-layer perceptron can serve as a linear
classifier by fitting a line or plane between the classes of objects, but it fails with
non-linear problems. The two-layer device, however, is capable of combining
the linear decision planes to solve such problems as that illustrated in Figure 12.
Increasing the number of perceptrons or neuron units in the hidden layer
increases proportionally the number of linear edges to the pattern shape
capable of being classified. If a third layer of neurons is added then even more
complex shapes may be identified. Arbitrarily complex shapes can be defined by
a three-layer network and such a system is capable of separating any class of
patterns. This general principle is illustrated in Figure 17.6
For our two-wavelength spectral data, a two-layer network is adequate to
achieve the desired separation. A suitable neural network, with the weight
vectors, is illustrated in Figure 18.
CHAPTER 6
1 Introduction
Calibration is one of the most important tasks in quantitative spectrochemical
analysis. The subject continues to be extensively examined and discussed in the
chemometrics literature as ever more complex chemical systems are studied.
The computational procedures discussed in this chapter are concerned with
describing quantitative relationships between two or more variables. In par-
ticular we are interested in studying how measured independent or response
variables vary as a function a single so-called dependent variable. The class of
techniques studied is referred to as regression analysis.
The principal aim in undertaking regression analysis is to develop a suitable
mathematical model for descriptive or predictive purposes. The model can be
used to confirm some idea or theory regarding the relationship between vari-
ables or it can be used to predict some general, continuous response function
from discrete and possibly relatively few measurements.
The single most common application of regression analysis in analytical
laboratories is undoubtedly curve-fitting and the construction of calibration
lines from data obtained from instrumental methods of analysis. Such graphs,
for example absorbance or emission intensity as a function of sample con-
centration, are commonly assumed to be linear, although non-linear functions
can also be used. The fitting of some 'best' straight line to analytical data
provides us with the opportunity to examine the fundamental principles of
regression analysis and the criteria for measuring 'goodness ofjit'.
Not all relationships can be adequately described using the simple linear
model, however, and more complex functions, such as quadratic and higher-
order polynomial equations, may be required to fit the experimental data.
Finally, more than one variable may be measured. For example, multiwave-
length calibration procedures are finding increasing applications in analytical
spectrometry and multivariate regression analysis forms the basis for many
chemometric methods reported in the literature.
156 Chapter 6
2 Linear Regression
It frequently occurs in analytical spectrometry that some characteristic, y, of a
sample is to be determined as a function of some other quantity, x, and it is
necessary to determine the relationship or function between x and y, which may
be expressed as y =f(x). An example would be the calibration of an atomic
absorption spectrometer for a specific element prior to the determination of the
concentration of that element in a series of samples.
A series of n absorbance measurements is made, y , one for each of a suitable
range of known concentration, xi. The n pairs of measurements (xi, yi) can be
plotted as a scatter diagram to provide a visual representation of the relation-
ship between x and y.
In the determination of chromium and nickel in machine oil by atomic
absorption spectrometry the calibration data presented in Table 1 were
obtained. These experimental data are shown graphically in Figure 1.
At low concentrations of analyte and working at low absorbance values, a
linear relationship is to be expected between absorbance and concentration, as
predicted by Beer's Law. Visual inspection of Figure l(a) for the chromium
data confirms the correctness of this linear function and, in this case, it is a
simple matter to draw by hand a satisfactory straight line through the data and
use the plot for subsequent analyses. The equation of the line can be estimated
directly from this plot. In this case there is little apparent experimental
uncertainty. In many cases, however, the situation is not so clear-cut. Figure
l(b) illustrates the scatter plot of the nickel data. It is not possible here to draw
a straight line passing through all points even though a linear relationship
(a)
Chromium concn. (mg kg-'): 0 1 2 3 4 5 (-4
Absorbance: 0.01 0.11 0.21 0.29 0.38 0.52 (y)
Nickel concn. (mg kg-'): 0 1 2 3 4 5 (4
Absorbance: 0.02 0.12 0.14 0.32 0.38 0.49 (y)
(b)
For Nickel: x = 2.50 and y = 0.245
sum
(xi - x): - 2.50 - 1.50 - 0.50 0.50 1.50 2.50
( ~-iY ) : - 0.225 - 0.125 - 0.105 0.075 0.135 0.245
(xi - x)(yi - Y ) : 0.562 0.187 0.052 0.037 0.202 0.621 1.655
(xi - x ) ~ : 6.25 2.25 0.25 0.25 2.25 0.25 17.50
Calibration and Regression Analysis
Cr, mg kg"
0.6 1
0.0 L/
0 1 2 3 4 5 6
NI, mg kg-'
The total error is the sum of the squared deviations. For some model defined
by coefficients a and b, this error will be a minimum and this minimum point
can be determined using partial differential calculus.
From Equations (1) and (2) we can substitute our model equation into the
definition of error,
where ci and 6 are least squares estimates of the intercept, a, and slope, b.
Expanding and rearranging Equations (4) provides the two simultaneous
equations,
nci + h x i = Zyi
rizxi + 6~ x2 = Z( yixi)
and
Calibration and Regression Analysis 159
where 2 and p represent the mean values of x and y.'
For the experimental data for Ni, calculation of & and 6 is trivial (a = 0.0075
and b = 0.095) and the fitted line passes through the central point given by 2,y,
Figure l(b).
Once values for 8 and 6 are derived, it is possible to deduce the concentration
of subsequently analysed samples by recording their absorbances and substitut-
ing the values in Equation (1). It should be noted, however, that because the
model is derived for concentration data in the range defined by xi it is important
that subsequent predictions are also based on measurements in this range. The
model should be used for interpolation only and not extrapolation.
The variance associated with these deviations will be given by this sum of
squares divided by the number of degrees of freedom,
1 C. Chatfield, 'Statistics for Technology', Chapman and Hall, London, UK, 1975.
Chapter 6
and
How well the estimated straight line fits the experimental data can be assessed
by determining the coeflcient of determination and the correlation coefficient.
The total variation associated with the y values, SST, is given by the sum of
the squared deviations of the observed y values from the mean y value,
This total variation comprises two components, that due to the residual or
deviation sum of squares, SSD, and that from the sum of squares due to
regression, SSR:
SSDis a measure of the failure of the regressed line to fit the data points, and
SSRprovides a measure of the variation in the regression line about the mean
values.
The ratio of SSRto SSTindicates how well the model straight line fits the
experimental data. It is referred to as the coefficient of determination and its
value varies between zero and one. From Equation (13), if SSD = 0 (the fitted
line passes through each datum point) the total variation in y is explained by the
regression line and SST= SSRand the ratio is one. On the other hand, if the
regressed line fails completely to fit the data, SSRis zero, the total error is
-
dominated by the residuals, i.e. SST SSD, then the ratio is zero and no linear
relationship is present in the data.
The coefficient of determination is denoted by ?,
where yuis the mean absorbance of the unknown sample from m measurements.
Thus, from a sample having a mean measured absorbance of 0.25 (from five
observations),
and
3 Polynomial Regression
Although the linear model is the model most commonly encountered in analy-
tical science, not all relationships between a pair of variables can be adequately
described by linear regression. A calibration curve does not have to approxi-
mate a straight line to be of practical value. The use of higher-order equations
to model the association between dependent and independent variables may be
more appropriate. The most popular function to model non-linear data and
include curvature in the graph is to fit a power-series polynomial of the form
Conc.
6001
500-
(b)
g400-
t 30-
=
e
200-
100 -
04 I I
0 5 10 15 20 25 30
Conc.
Conc.
Figure 3 The linear (a), quadratic (b),and cubic (c) regression lines for thefluorescence
data from Table 3
linear model
quadratic model
0 5 10 15 20 25 30
Conc.
Figure 4 Residuals (yi- ji)as a function of concentration ( x ) for best fit linear and
quadratic models
166 Chapter 6
where E is a random error, assumed to be normally distributed, with a variance,
d,independent of the value of I. If these assumptions are valid and Equation
(25) is a true model of the experimental data then the variance of E will be equal
to the variance about the regression line. If the model is incorrect, then the
variance around the regression will exceed the variance of E. These variances
can be estimated using ANOVA and the F-ratio calculated to compare the
variances and test the significance of the model.
The form of the ANOVA table for multiple regression is shown in Table 4.
The completed table for the linear model fitted to the fluorescence data is given
in Table 5. This analysis of variance serves to test whether a regression line is
helpful in predicting the values of intensity from concentration data. For the
linear model we wish to test whether the line of slope b adds a significant
contribution to the zero-order model. The null hypothesis being tested is,
H,,:b = O (26)
residuals from the line, X(Ii - f i ) 2 . As expected, this in fact is the case for the
linear model, F,,4= 74.8, compared with F,,4 = 7.71 from tables for a 5% level
of significance. So the null hypothesis is rejected, the linear regression model is
significant, and the degree to which the regression equation fits the data can be
evaluated from the coefficient of determination, 3, given by Equation (14).
A similar ANOVA table can be completed for the quadratic model, Table 6.
Does the addition of a quadratic term contribute significantly to the first-order,
linear model? The equation tested is now
Once again the high value of the Pratio indicates the model is significant as a
predictor. This analysis can now be taken a step further since the sum of the
squares associated with the regression line can be attributed to two com-
ponents, the linear function and the quadratic function. This analysis is accom-
plished by the decomposition of the sum of squares, Table 7. The total sum of
squares values for the regression can be obtained from Table 6 and that due to
Orthogonal Polynomials
In the previous section, the fluorescence emission data were modelled using
linear, quadratic, and cubic equations and the quadratic form was determined
as providing the most appropriate model. Despite this, on moving to the higher,
cubic, polynomial the coefficient of the cubic term is not zero and the values for
the regression coefficients are considerably different from those obtained for
the quadratic equation. In general, the least squares polynomial fitting pro-
cedure will yield values for the coefficientswhich are dependent on the degree of
the polynomial model. This is one of the reasons why the use of polynomial
curve fitting often contributes little to understanding the causal relationship
between independent and dependent variables, despite the technique providing
a useful curve fitting procedure.
With the general polynomial equation discussed above, the value of the first
coefficient, a, represents the intercept of the line with the y-axis. The b co-
efficient is the slope of the line at this point, and subsequent coefficients are the
values of higher orders of curvature. A more physically significant model might
be achieved by modelling the experimental data with a special polynomial
equation; a model in which the coefficients are not dependent on the specific
order of equation used. One such series of equations having this property of
independence of coefficients is that referred to as orthogonalpolynomials.
Bevington4presents the general orthogonal polynomial between variables y
and x in the form
P.R.Bevington, 'Data Reduction and Error Analysis in the Physical Sciences', McGraw-Hill,
New York, USA, 1969.
170 Chapter 6
addition of higher-order terms to the polynomial will not change the value of
the coefficients of lower-order terms. This extra constraint is used to evaluate
the parameters p, yl, y2, ti1,etc. The coefficienta represents the average y value,
b the average slope, c the average curvature, etc.
In general, the computation of orthogonal polynomials is laborious but
the arithmetic can be greatly simplified if the values of the independent variable,
x, are equally spaced and the dependent variable is homos~edastic.~ In this
case,
where
and
Orthogonal polynomials are particularly useful when the order of the equa-
tion is not known beforehand. The problem of finding the lowest-order poly-
nomial to represent the data adequately can be achieved by first fitting a
straight line, then a quadratic curve, then a cubic, and so on. At each stage it is
only necessary to determine one additional parameter and apply the F-test to
estimate the significance of each additional term.
For the fluorescence emission data,
0 5 10 15 20 25 30
Conc.
Figure S Orthogonal linear, quadratic, and cubic models for thefIuorescence intensity
data from Table 3
the quadratic by
4 Multivariate Regression
To this point, the discussion of regression analysis and its applications has
been limited to modelling the association between a dependent variable and a
172 Chapter 6
single independent variable. Chemometrics is more often concerned with multi-
variate measures. Thus it is necessary to extend our account of regression to
include cases in which several or many independent variables contribute to the
measured response. It is important to realize at the outset that the term
independent variables as used here does not imply statistical independence, as
the x variables may be highly correlated.
In the simplest example, the dependent response variable, y, may be a
function of two such independent variables, xl and x2.
Again a is the intercept on the ordinate y-axis, and bl and b2 are the partial
regression coeficients. These coefficients denote the rate of change of the mean
of y as a function of x,, with x2 constant, and the rate of change of y as a
function of x2 with xl constant.
Multivariate regression analysis plays an important role in modern process
control analysis, particularly for quantitative W-visible absorption spec-
trometry and near-IR reflectance analysis. It is common practice with these
techniques to monitor absorbance, or reflectance, at several wavelengths and
relate these individual measures to the concentration of some analyte. The
results from a simple two-wavelength experiment serve to illustrate the details
of multivariate regression and its application to multivariate calibration pro-
cedures.
Figure 6 presents a W spectrum of the amino acid tryptophan. For quanti-
tative analysis, measurements at a single wavelength, e.g. k14,would be ade-
quate if no interfering species are present. In the presence of other absorbing
species, however, more measurements are needed. In Table 10 are presented the
concentrations and measured absorbance values at A,, of seven standard
Tryptophan
Tyrosine
0.4
0 2 4 6 8 10 12 14 16 18 20 22 24 26 2 8 3 0
Wavelength (arb. units)
0.0 0
0 5 10 15 20 25 30
Irrl, mg kg"
Figure 7 The least squares linear model of absorbance (A,,) vs. concentration of trypto-
phan, data from Table 10
0 0.0356 0.0390
5 0.3068 0.21 10
10 0.3980 0.1860
15 0.3860 0.0450
20 0.6020 0.1580
25 0.6680 0.1070
29 0.8470 0.2010
x l (7) 0.3440 0.2010
x2 (14) 0.3670 0.0500
x3 (27) 0.0810 0.21 10
Actual: 7 14 27 mg kg-'
Predicted: 10.26 11.14 28.20 mg kg- '
If a second term, say the absorbance at AZ1,is added to the model equation,
the predictive ability is improved considerably. Thus by including AZ1,the
least-squares model is
Actual: 7 14 27 mg kg-'
Predicted: 7.03 14.04 26.99 mg kg-'
This model as given by Equation (41) could be usefully employed for the
quantitative determination of tryptophan in the presence of tyrosine.
Of course, the reason for the improvement in the calibration model when the
second term is included is that AZIserves to compensate for the absorbance due
to the tyrosine since AZI is in the spectral region of a tyrosine absorption band
with little interference from tryptophan, Figure 6. In general, the selection of
variables for multivariate regression analysis may not be so obvious.
.
A
[Tr] = 2 mg kg"
[Tr] = 8 mg kg"
[Tr] = 16 mg kg"
8 10 12 14 16 18 20 22 24 26 28
Wavelength (arb. units)
: error
0 5 10 15 20 25 30
True
Figure 10 The predicted tryptophan concentrationfrom the univariate regression model,
using A,,, vs. the true, known concentration. Prediction lines for test samples
X1 and X2 are illustrated also
- 4 ] / , , , , , ,
-6
0 5 10 15 20 25 30
[Trl
Figure 11 Residuals as a function of concentrationfor the univariate regression model,
using A,, from Table 11
with a coefficient of determination, ?, of 0.970. The model vs. actual data and
the residuals plot are shown in Figures 12 and 13. X1 and X2 are evaluated as
'
11.19 and 25.24 mg kg- respectively.
Although the bivariate model performs considerably better than the uni-
variate model, as evidenced by the smaller residuals, the calibration might be
improved further by including more spectral data. The question arises as to
which data to include. In the limit of course, all data will be used and the model
takes the form
0 5 10 15 20 25 30
True
Figure 12 True and predicted concentrations using the bivariate model with A,, and A,,
-2
-3
-4
-5
-6
0 5 10 15 20 25 30
ITrl
Figure 13 Residuals as a function of concentration for the bivariate regression model,
using A,, and A,, from Table 11
180 Chapter 6
Tr = a + blA9+ b2A12+ b3A15+. . . + b7A27 (47)
To determine by least-squares the value of each coefficient requires we use
eight simultaneous equations. In matrix notation the normal equations can be
expressed as,
where
and,
J.C. Davis, 'Statistics in Data Analysis in Geology', J. Wiley and Sons, New York, USA, 1973.
Calibration and Regression Analysis 181
To be used in a predictive equation these coefficients must be 'unstandard-
ized', and, from Equation (30),
Hence
-
error
X2 ,i
i
;*
I I I
0 5 10 15 20 25 30
True
-2 -
-3 -
-4 -
-5
-6 - r I I 1 ---
0 5 10 15 20 25 30
[Trl
Figure 15 Residuals as a function of concentrationfor the full regression model using all
variablesfrom Table I I
The matrix of residuals (Tr - fi, A, - a,, A,, - A,,, etc.) is given in Table
13, and the corresponding correlation matrix between these residuals in Table
14. From Table 14 the variable having the largest absolute correlation with Tr
residuals is A,. Therefore we select this as the second variable to be added to the
regression model.
Hence, at step 2,
Forward regression proceeds to step 3 using the same technique. The vari-
ables A, and A12are regressed on to each of the variables not in the equation
6 A.A. Afifi and V. Clark, 'Computer-Aided Multivariate Analysis', Lifetime Learning, California,
USA, 1984.
184 Chapter 6
Table 13 Matrix of residuals for each variable after removing the linear model
using AI2
Tr - fr Ag - A,, - A,, A,, - A,, A,, - A,, A24 - A,, AZ7- AZ7
and the unused variable with the highest partial correlation coefficient is
selected as the next to use. If we continue in this way then all variables will
eventually be added and no effective subset will have been generated, so a
stopping rule is employed. The most commonly used stopping rule in commer-
cial programs is based on the F-test of the hypothesis that the partial corre-
lation coefficient of the variable to be entered in the equation is equal to zero.
No more variables are added to the equation when the F-value is less than some
specified cut-off value, referred to as the minimum F-to-enter value.
A completed forward regression analysis of the W absorbance data is
presented in Table 15. Using a cut-off F-value of 4.60 (F,,,, at 95% confidence
limit), three variables are included in the final equation:
The predicted vs. actual data are illustrated in Figure 16 and the residuals
Calibration and Regression Analysis 185
Table 15 Forward regression analysis of the data from Table 11. After three
steps no remaining variable has a F-to-enter value exceeding the
declared minimum of 4.60, and the procedure stops
error j
+- ;-e
5D.e
0 5 10 15 20 25 30
True
Figure 16 True and predicted concentrations using three variables (Ap,A12,and A2,)
from Table 11
Chapter 6
plotted in Figure 17. Calculated values for X1 and X2 are 11.30 and 25.72 mg
kg- ' respectively.
An alternative method is described by backward elimination. This technique
starts with a full equation containing every measured variate and successively
deletes one variable at each step. The variables are dropped from the equation
on the basis of testing the significance of the regression coefficients, i.e. for each
variable is the coefficient zero? The F-statistic is referred to as the computed
F-to-remove. The procedure is terminated when all variables remaining in the
model are considered significant.
Table 16 illustrates a worked example using the tryptophan data. Initially,
with all variables in the model, has the smallest computed F-to-remove
value and this variable is removed from the model and eliminated at the first
step. The procedure proceeds by computing a new regression equation with the
remaining six variables and again examining the calculated F-to-remove values
for the next candidate for elimination. This process continues until no variable
can be removed since all F-to-remove values are greater than some specified
maximum value. This is the stopping rule; F-to-remove = 4 was employed here.
It so happens in this example that the results of performing backward
elimination regression are identical with those obtained from the forward
regression analysis. This may not be the case in general. In its favour, forward
regression generally involves a smaller amount of computation than backward
elimination, particularly when many variables are involved in the analysis.
However, should it occur that two or more variables combine together to be a
good predictor compared with single variables, then backward elimination will
often lead to a better equation.
Finally, stepwise regression, a modified version of the forward selection
technique, is often available with commercial programs. As with forward
Calibration and Regression Analysis 187
Table 16 Backward regression analysis of the data from Table 11. After four
steps, three variables remain in the regression equation; their F-to-
remove values exceed the declared maximum value of 4.0
4s
Figure 18 Scatter plot of absorbance data at three wavelengths, AI2, AIS,and A,,, from
Table 11. The high degree of colinearity, or correlation, between these data is
evidenced by their lying on a plane and not being randomly distributed in the
pattern space
Calibration and Regression Analysis
Figure 19 The jirst principal component, PCZ, from A,,, A,,, and A,, vs. tryptophan
concentration
components analysis two new variables can be defined containing over 99% of
the original variance of the three original variables. The first principal com-
ponent alone accounts for over 90% of the total variance and a plot of Tr
against PC1 is shown in Figure 19.
The use and application of principal components in regression analysis has
been extensively reported in the chemometrics literature.'-'' We can calculate
the principal components from our data set, so providing us with a set of new,
orthogonal variables. Each of these principal components will be a linear
combination of, and contain information from, all the original variables. By
selecting an appropriate subset of principal components, the regression model
is reduced whilst having the relevant information from the original data. The
PCR technique described here follows the methodology described by Martens
and Naes" and is applied to the data from Table 11. The original data are
preprocessed by mean-centring. The variance-covariance dispersion matrix is
then computed and from this square, symmetric matrix we calculate the norma-
lized eigenvalues and eigenvectors. From each eigenvector, the principal com-
ponent scores are determined, and by conventional regression analysis the
calibration model is developed. The stepwise procedure is illustrated in Table
17 and we will now follow the steps involved.
ommmwm-0mC-mmml. m
mmbm-
~888oYe4qq85~Kj5
00000000000000
f0q2
I l l I I I I I
bC-mmWWmmOmNWWb b
bb12w~mb-mWomml. dl.
-!1-9?-?9991?"9 u!?
00000000000000 0s
I I I I I I I I I
I I I I I I I I I -
mmwowwmommmmbm b
ommm mmwmmmw-m mm
?9998-!crqqqqqcrcr ~ o q
00000000000000 Ol.
I I I I I I I
Calibration and Regression Analysis
0 0 . 0 . w w ~ m w P m 0 \ ~ w ~ 3m
- -awmmm--m
?99999"9989999
W0.Q
Y?
02
p" 00000000000000
I I I I I I I I I
I I I I I I I I I
Chapter 6
Calibration and Regression Analysis
I.wrnt-4-0 I.00bmw 00
8 28835%SH838883
d d d o o o d d d d o o o o 0
z m w m m - - W m m w b
51 2
q888S88888380q
o o o d d d d d o o o o o o
- u
d
Hd d d ~d d d ~d d dm~d d d~ d d ~X
b
~ H
f
H qqggH%%83i23z
d o o o o d d d d o o o o o 2
I I I I I I
and the estimated regression coefficients for the one-factor model, from Equa-
tion (58), are
Calibration and Regression Analysis 195
Table 18 Eigenvalues of the mean-centred original data (Table 11). Over 97%
of the original variance can be accounted for in theJirst two principal
components
Therefore,
+
Tr(sing1e factor) = - 8.99 7.82A9 + 12.99AI2+ 14.37A15
+
+ 10.34A18+ 11.78A21 0.24A2, + 1.06A2, (64)
.5
..........
. error
X1 X2
0
0 5 10 15 25 30
True
Figure 20 True vs. predicted tryptophan concentration using only the jirst principal
component in the regression model
Chapter 6
0 5 10 15 20 25 30
True
Figure 21 True vs. predicted tryptophan concentration using the jirst two principal
components in the regression model
and yo, yielding the residual matrix Xl and vector yl. The process is repeated
with the second scores determined from the second eigenvector and a two-
factor model developed. Figure 21 shows comparative results.
+-* i error
0 5 10 15 20 25 30
True
Figure 22 True vs. predicted tryptophan concentration using the first three principal
components in the regression model
and
and
2F
5s'
3
Table 19 Partial least squares regression1' on the datafrom Table 11. The steps involved are discussed in the text
8.A
A9 An AM A], &I A% An Tr g
Mean 0.529 0.466 0.565 0.506 0.204 0.068 0.061 15.00 2
Xo Yo t, j.
0.103 - 0.174 - 0.247 - 0.070 0.092 0.001 0.018 - 13 - 0.296 ,b
0.029 - 0.191 - 0.147 - 0.038 0.054 0.048 0.01 1 - 11 - 0.235
0.036 - 0.166 - 0.173 - 0.005 0.075 - 0.028 - 0.009 -9 - 0.227 '3,.
0.020 - 0.134 - 0.063 0.003 0.020 0.013 - 0.043 -7 - 0.129
0.041 - 0.115 - 0.116 - 0.026 0.018 - 0.012 - 0.036 -5 - 0.159
- 0.256 - 0.157 - 0.138 - 0.182 - 0.048 - 0.012 0.019 -3 - 0.279
- 0.253 - 0.088 - 0.145 - 0.241 - 0.141 - 0.049 - 0.055 -1 - 0.255
- 0.060 - 0.022 - 0.015 - 0.050 - 0.023 - 0.005 - 0.008 1 - 0.046
- 0.025 0.085 0.020 0.018 - 0.032 0.042 0.017 3 0.074
0.025 0.100 0.089 0.007 - 0.036 0.002 0.022 5 0.133
- 0.028 0.087 0.102 0.015 - 0.061 0.035 - 0.026 7 0.130
- 0.065 0.170 0.126 0.019 - 0.082 0.009 0.039 9 0.201
0.214 0.277 0.336 0.279 0.109 0.020 0.01 1 11 0.514
0.225 0.327 0.374 0.267 0.057 - 0.044 0.034 13 0.574
wlT 0.114 0.647 0.675 0.326 - 0.069 0.002 0.039
PI= 0.292 0.599 0.652 0.430 0.043 0.009 0.046 q, = 26.47
b 3.012 17.13 17.85 8.64 - 1.83 0.05 1.03 a = - 8.73
Table 19 continued
Calibration and Regression Analysis
m ~ o d m r m~ m m ~ ~ m mmo r ~
,q 8 gS~~8%8885=q %a3q
00000000000000 o o m
p:
I I I I I I I 1 1 I l l
Chapter 6
0 5 10 15 20 25 30
True
Figure 23 True vs. predicted tryptophan concentration using a one-factor partial least
squares regression model
where Wis the matrix of loading weights, each column is a weight vector, and P
the matrix of loadings.
With the single factor in the model the regression equation is,
The scatter plots are shown in Figure 24. The sums of squares of the residuals
for the one, two, and three-factor models are 201, 11.53, and 10.64 respectively
and the estimated tryptophan concentrations from the test solutions are
0 5 10 15 20 25 30
True
Figure 24 True vs. predicted tryptophan concentration using a two-factor partial least
squares regression model
As with PCR, a regression model built from two orthogonal new variables
serves to provide good predictive ability.
Regression analysis is probably the most popular technique in statistics and
data analysis, and commercial software packages will usually provide for
multiple linear regression with residuals analysis and variables subset selection.
The efficacy of the least squares method is susceptible to outliers, and graphic
display of the data is recommended to allow detection of such data. In an
attempt to overcome many of the problems associated with ordinary least
squares regression, several other calibration and prediction models have been
developed and applied. As well as principal components regression and partial
least squares regression, ridge regression should be noted. Although PCR has
been extensively applied in chemometrics it is seldom recommended by statis-
ticians. Ridge regression, on the other hand, is well known and often advocated
amongst statisticians but has received little attention in chemometrics. The
method artificially reduces the correlation amongst variates by modifying the
correlation matrix in a well defined but empirical manner. Details of the
method can be found in Afifi and Clark.6To date there have been relatively few
direct comparisons of the various multivariate regression techniques, although
Frank and Friedman14and Wold1' have published a theoretical, statistics based
comparison which is recommended to interested readers.
Sample A, k2 Xj
% Transmission
'h,
where m is the number of rows and n is the number of columns. Each individual
element of the matrix is usually written as a, (i = 1 . . . m, j = 1 . . . n). If n = m
then the matrix is square and if a, = aji it is symmetric.
A matrix with all elements equal to zero except those on the main diagonal is
called a diagonal matrix. An important diagonal matrix commonly encountered
in matrix operations is the unit matrix, or identity matrix, denoted I, in which all
the diagonal elements have the value 1, Table 2.
In a similar fashion, the transpose of a row vector is a column vector, and vice
versa. Note that a symmetric matrix is equal to its transpose.
Matrix operations with scalar quantities is straightforward. To multiply the
matrix A by the scalar number k implies multiplying each element of A by k .
Sample A, A2
Absorbance
0.09 0.24
0.12 0.12
0.24 0.60
0.19 0.27
0.60 0.49
0.49 0.44
0.35 0.27
0.25 0.77
0.24 0.23
0.33 0.19
100
c, = log -
a,
where A is the matrix of absorbance values for the mixtures, E the matrix of
absorption coefficients, and C the matrix of concentrations. The right-hand
side of Equation (7) involves the multiplication of two matrices, and the
equation can be written as
Appendix
Table 4 The diagonal matrix of weightsfor normalizing the absorbance data (a)
and the normalized absorbance data matrix (b)
Sample A, hz A3
Absorbance
Appendix 209
and the result represents the sums of the products of the elements of x and y.
where 0 is the angle between the lines connecting the two points defined by each
vector and the origin, Figure 3.
If xT.y = 0 then, from Equation (12), the two vectors are at right angles to
each other and are said to be orthogonal.
Sums of squares and products are basic operations in statistics and chemo-
metrics. For a data matrix represented by X, the matrix of surns of squares and
products is simply A?X. This can be extended to produce a weighted surns of
squares and products matrix, C
where W is a diagonal matrix, the diagonal elements of which are the weights
for each sample.
These operations have been employed extensively throughout the text; see,
for example, the calculation of covariance and correlation about the mean and
the origin developed in Chapter 3.
Therefore,
and subtracting,
and
In the mixture there are 1.2 mg 1-' of tryptophan and 1.5 mg 1-' of tyrosine.
and if A = I,
Thus, the quadratic form generally expands to the quadratic equation des-
cribing an ellipse in two dimensions or an ellipsoid, or hyper-ellipsoid, in higher
dimensions, as described in Chapter 1.
Subject Index
Calibration, 155
Cauchy function, 14 Decision limit 32, 33
Central limit theorem, 6 Degrees of freedom, 8
Centroid clustering, 106 Dendrogram, 97, 105
Characteristic roots, 73 Detection limit, 32, 33
Characteristic vectors, 73 Determination limit, 32, 33
Chernoff faces, 25 Differentiation, 55
City-block distance, 100 Savitsky-Golay, 57
Classification, 93, 123 Discriminant function, 124, 130
Cluster analysis, 94 Discriminant score, 130
hierarchical, 195 Discrimination, 123
k-means, 109 Dispersion matrix, 82
Clustering Distance measures, 99
centroid, 106 Dixon's Q-test, 13
complete linkage, 106
furthest neighbours, 103, 107 Eigenanalysis, 54
fuzzy, 104, 105 Eigenvalue, 7 1
group average, 106 Eigenvector, 7 1
Clustering algorithms, 103 Equivalent width, 44
Co-adding spectra, 35 Error rate, of classification, 126
Coefficient of determination, 160 Errors, 1
Coefficient of variation, 5 and regression analysis, 158
Common factors, 85 Euclidean distance, 99
co&unality, 87 weighted, 100
Subject Index
F-Test, 9 Mapping displays, 23
Factor analysis, 79 Matrix, confusion, 127
Q-mode, 84 determinant, 212
R-mode, 85 dispersion, 82
target transform, 91 identity, 206
Feature extraction, 54 inverse, 210
Feature selection, 54 quadratic form, 212
Filtering, 41 singular, 2 11
Flicker noise, 3 1 square, 204
Fourier integral 41 symmetric, 204
Fourier pairs 42 Matrix multiplication, 207
Fourier transform 28 Mean centring, 17
Furthest neighbours clustering, 103, 107 Mean value, 2
Fuzzy clustering, 104, 115 Membership function, 117
Minkowski metrics, 99
Gaussian distribution, 2 Moving average, 36
Generalized delta rule, 150 Multiple correlation, 183
Goodness of fit, 159 Multiple regression, backward
Group average clustering, 106 elimination, 182
Multiple regression, forward selection,
Half-width, 15 182
Hammering distance, 140 Multivariate regression, 171
Heaviside function, 144
Hidden layers, in ANN, 151 Nearest neighbours, classification, 138
Homoscedastic data, 159 clustering, 103, 107
Neural networks, 147
Identity matrix, 206 Noise, 31
Integration, 62 Normal distribution, 2
Simpson's method, 64 multivariate, 21
Interference noise, 31 Normal equations, 39
Interpolation, 47 Normalized linear combinations, 65
linear, 47 Null hypothesis, 6
polynomial, 48 Numerical taxonomy, 94
spline, 50 Nyquist theory, 29