5.21. Chemometric Methods Applied To Analytical Data
5.21. Chemometric Methods Applied To Analytical Data
5.21. Chemometric Methods Applied To Analytical Data
General Notices (1) apply to all monographs and other texts 783
5.21. Chemometric methods applied to analytical data EUROPEAN PHARMACOPOEIA 9.0
situation where unknown samples are compared with suitable Nevertheless, RMSEP is a good error estimate in cases where
reference materials, either by direct comparison or indirect both calibration and validation sample sets are representative
estimation, e.g. using a chemometric model. of future samples.
Quantitative data analysis, on the other hand, mainly consists A confidence interval for predicted y-values would
of calibration, followed by direct application to new and be ± n × RMSEP, with n fixed by the operator. A common
unknown samples. Calibration consists of predicting the choice is n = 2. This choice should be dependent on the
mathematical relationship between the property to be requirements of the specific analytical method.
evaluated (e.g. concentration) and the variables measured. Chemometric models can end up with better precision
1-2. GOOD CHEMOMETRIC PRACTICE than the reference methods used to acquire calibration
and testing data. This is typically observed for water
The following notation will be used in the chapter :
content determinations by NIR and PLS where semi-micro
X, Y data sets determination titration (2.5.12) is the reference method.
X independent variable 1-2-1-2. Standard error of calibration and coefficient of
determination
Y dependent variable
Figures of merit can be calculated to help assess how well
X, Y matrices the calibration fits the data. Two examples of such statistical
x, y vectors expressions are the standard error of calibration (SEC) and the
coefficient of determination (R2).
x, y scalar values
SEC has the same units as the dependent variables and
i, j indices, points reflects the degree of modelling error, but cannot be used
to estimate future prediction errors. It is an indication of
xi ith value of vector x whether the calculation using the calibration equation will be
sufficiently accurate for its intended purpose. In practice SEC
xi,j ith and jth value of matrix X has to be compared with the error of the reference method
transpose of matrix X
(SEL, Standard Error of Laboratory, see Glossary). Usually
X T
SEC is larger than SEL, in particular if modelling does not
X-1 inverse (if it exists) of matrix X account for all interferences in the samples or if other physical
phenomena are present.
mean centre of matrix X
The coefficient of determination (R2) is a dimensionless
estimate of matrix X measure of how well the calibration fits the data. R2 can
have values between 0 and 1. A value close to 0 indicates
|X| determinant of (square) matrix X that the calibration fails to relate the data to the reference
x norm of vector x values and as the coefficient of determination increases, the
X-data becomes an increasingly more accurate predictor of
b regression equation coefficient the reference values. Where there is more than 1 independent
e residuals of X variable, adjusted R2 should be used rather than R2, since the
number of independent variables in the model inflates the
f residuals of Y latter even if the fraction of variance explained by the model
is not increased.
1-2-1. Figures of merits for regression
1-2-2. Implementation steps
In quantitative analysis, building a regression model involves The implementation of chemometric methods varies case by
fitting a mathematical relationship to the corresponding case depending on the specific requirements of the system to
independent data (X) and dependent data (Y). The be analysed. The following generic approach can be followed
independent data may represent a collection of signals, i.e. when analysing non-designed data sets:
responses from a number of calibration samples, while the
dependent data may correspond to the values of an attribute, – in formulating the study problem, define the precise
i.e. the property of interest in the calibration samples. It objective of data collection and the expected analysis
is advisable to test the regression model with internal and results;
external test sets. The internal test set consists of samples – investigate the origin and availability of the data. The data
that are used to build the model (or achieve calibration) by set should cover the variation of the explored variable(s) or
applying resampling within the calibration data and samples attribute(s);
that are initially left out of the calibration in order to validate – if the available data does not cover the expected variation,
the model. Use of the internal test set is part of model prepare and measure samples that fill the gap;
optimisation and model selection. The external independent – variable selection : sometimes selecting the right variables
test set represents data that normally is available after the can give more robustness and also enhance model accuracy;
model has been fixed, thus the external test set challenges the – raw data may have to be transformed and mathematical
model and tests its robustness for the analysis of future data. pre-treatments performed;
1-2-1-1. Root mean square error of prediction – elaborate the model through calibration and validation;
The link between X and Y is explored through a common – challenge the model and check its performance on new
set of samples (calibration set) from which both x and samples or data;
y-values have been collected and are clearly known. For a – validate the method according to current pharmaceutical
second set of samples (validation set) the predicted y-values usage and requirements.
are then compared to the reference y-values, resulting in a
prediction residual that can be used to compute a validation 1-2-3. Data considerations
residual variance, i.e. a measure of the uncertainty of future 1-2-3-1. Sample quality
predictions, which is referred to as root mean square error of Careful sample selection increases the likelihood of extracting
prediction (RMSEP). This value estimates the uncertainty that useful information from the analytical data. Whenever it is
can be expected when predicting y-values for new samples. possible to actively adjust selected variables or parameters
Since no assumptions concerning statistical error distribution according to an experimental design, the quality of the
are made during modelling, prediction error cannot be used results is increased. Experimental design (also referred to
to report a valuable statistical interval for the predicted values. as design of experiments, DoE) can be used to introduce
systematic and controlled changes between samples, not only or error attributable to the modelled analytical technique.
for analytes, but also for interferences. When modelling, However, it is difficult to assess if this error is more significant
common considerations include the determination of which than the reference method error or vice versa.
variables are necessary to adequately describe the samples, 1-2-3-6. Pre-processing and variable selection
which samples are similar to each other and whether the data
The raw data may not be optimal for analysis and are generally
set contains related sub-groups.
pre-processed before performing chemometric calculations to
1-2-3-2. Data tables, geometrical representations improve the extraction of physical and chemical information.
Sample responses result in a group of numerical values relating Interferences, for example background effects, baseline shifts
to signal intensities (X-data), i.e. the independent variables. and measurements in different conditions, can impede the
However, it should be recognised that these variables are not extraction of information when using multivariate methods.
necessarily linearly independent (i.e. orthogonal) according to It is therefore important to minimise the noise introduced by
mathematical definitions. These values are best represented in such effects by carrying out pre-processing operations.
data tables and by convention each sample is associated with
a specific row of data. A collection of such rows constitutes A wide range of transformations (scaling, smoothing,
a matrix, where the columns are the variables. Samples normalisation, derivatives, etc.) can be applied to X-data as
can then be associated with certain features reflecting their well as Y-data for pre-processing prior to multivariate data
characteristics, i.e. the value of a physical or chemical property analysis in order to enhance the modelling. The main purpose
or attribute and these data are usually referred to as the Y-data, of these transformations is focussing the data analysis on
i.e. the dependent variables. It is possible to add this column the pertinent variability within the data set. For example,
of values to the sample response matrix, thereby combining pre-processing may involve mean centering of variables so
both the response and the attribute of each sample. that the mean does not influence the model and thus reduce
the model rank.
When n objects are described by m variables the data table
corresponds to an n×m matrix. Each of the m variables The selection of the pre-processing is mostly driven by
represents a vector containing n data values corresponding parameters such as type of data, instrument or sample, the
to the objects. Each object therefore appears as a point in an purpose of the model and user experience. Pre-processing
m dimensional space described by its m coordinate values methods can be combined, for example standard normal
(1 value for each variable in the m axes). variate (SNV) with 1st derivative, as an empirical choice.
1-2-3-3. First assessment of data 1-2-4. Maintenance of chemometric models
Before performing multivariate data analysis, the quality of Chemometric methods should be reassessed regularly to
the sample response can be optionally assessed using statistical demonstrate a consistent level of acceptable performance.
tools. Graphical tools are recommended for the 1st visual In addition to this periodical task, an assessment should
assessment of the data, e.g. histograms and/or boxplots for be carried out for critical parameters when changes are
variables for evaluation of the data distribution, and scatter made to application conditions of the chemometric model
plots for detection of correlations. Descriptive statistics are (process, sample sources, measurement conditions, analytical
useful for obtaining a rapid evaluation of each variable, taken equipment, software, etc.).
separately, before starting multivariate analysis. For example, The aim of maintaining chemometric models up-to-date is
mean, standard deviation, variance, median, minimum, to provide applications that are reliable over a longer period
maximum and lower/upper quartile can be used to assess the of use. The extent of the validation required, including the
data and detect out-of-range values and outliers, abnormal choice of the necessary parameters, should be based on risk
spread or asymmetry. These statistics reveal anomalies in analysis, taking into account the analytical method used and
a data table and indicate whether a transformation might the chemometric model.
be useful or not. Two-way statistics, e.g. correlation, show
1-3. ASSESSMENT AND VALIDATION OF CHEMOMETRIC
how variations in 2 variables may correlate in a data table.
METHODS
Verification of these statistics is also useful when reducing the
size of the data table, as they help in avoiding redundancies. 1-3-1. Introduction
1-2-3-4. Outliers Current use of the term ‘validation’ refers to the regulatory
An outlier is a sample that is not well described by the model. context as applied to analytical methods, but the term is
Outliers can be X or Y in origin. They reflect unexpected also used to characterise a typical computation step in
interference in the original data or measurement error. The chemometrics. Assessment of a chemometric model consists
predicted data that is very different from the expected value of evaluating the performance of the selected model in order
calls into question the suitability of the modelling procedure to design the best model possible with a given set of data
and the range spanned by the original data. In prediction and prerequisites. Provided sufficient data are available, a
mode, outliers can be caused by changes in the interaction distribution into 3 subsets should always be considered : 1) a
between the instrument and samples or if samples are outside learning set to elaborate models, 2) a validation set to select
the model’s scope. If this new source of variability is confirmed the best model, i.e. the model that enables the best predictions
and is relevant, the corresponding data constitutes a valuable to be made, 3) an independent test set to estimate objectively
source of information. Investigation is recommended to the performance of the selected final model. Introducing a
decide whether the existing calibration requires strengthening 3rd set for objective model performance evaluation is necessary
(updating) or whether the outliers should be ignored as to estimate the model error, among other performance
uncritical or unrelated to the process (i.e. operator error). indicators. An outline is given below on how to perform
a typical assessment of a chemometric model, starting
In the case of classification an outlier test should be performed
on each class separately. with the initial validation, followed by an independent test
validation and finally association/correlation with regulatory
1-2-3-5. Data error requirements.
Types of data error include random error in the reference
1-3-2. Assessment of chemometric models
values of the attributes, random error in the collected response
data and systematic error in the relationship between the two. 1-3-2-1. Validation during modelling
Sources of calibration error are problem specific, for example, Typically, algorithms are iterative and perform
reference method errors and errors due to either sample self-optimisation during modelling through an on-going
non-homogeneity or the presence of non-representative evaluation of performance criteria and figures of merit.
samples in the calibration set. Model selection during This step is called validation. The performance criteria
calibration usually accounts for only a fraction of the variance are specific to the chemometric technique used and to the
General Notices (1) apply to all monographs and other texts 785
5.21. Chemometric methods applied to analytical data EUROPEAN PHARMACOPOEIA 9.0
nature of the analytical data, as well as the purpose of the model. In some special cases, it might only be necessary to
overall method which includes both the analytical side and validate the chemometric model (see section 1-2-4.).
the chemometric model. The objective of the validation is 1-3-3-1. Qualitative models
to evaluate the model and provide help to select the best
performing model. Selected samples are either specifically For validation of qualitative models, the most critical
assigned for this purpose or are selected dynamically through parameters are specificity and robustness. When not
reselection/re-use of data from a previous data set (sometimes applicable, scientific justification is required.
called resampling – for clarification, see Glossary). A typical Specificity
example of data reselection is cross-validation with specific During validation it has to be shown that the model possesses
segments, for example ‘leave-one-out’ cross-validation when sufficient discriminatory capability. Therefore, a suitable
samples are only a few, or ‘leave-subset-out’ cross-validation set of materials that pose a risk of mix-up must be defined
(Figure 5.21.-1). Another type of resampling is bootstrapping. and justified. If, in addition to chemical identification,
1-3-2-2. Assessment of the model other parameters (such as polymorphism, particle size,
Once the model matches the optimisation requirements, moisture content, etc.) are relevant, a justification for these
fitness for purpose is assessed. Independent samples not parameters should also be included. The selection of materials
used for modelling or model optimisation are introduced at to be included when validating specificity should be based
this stage as an independent test-set in order to evaluate the on logistic considerations (e.g. materials handled close
performance of the model. Ideally, when sufficient data are to the process under review, especially those with similar
available, the sample set can be split into 3 subsets comprising appearance), chemical considerations (e.g. materials with
1 learning set for model computation, 1 validation set for similar structure) and also physical considerations where
optimisation of the model, and 1 test set for evaluation of the relevant (e.g. materials with different physical properties).
prediction ability, i.e. whether the model is fit for purpose. After definition of this set of materials, the discriminatory
The 3 subsets are treated independently and their separation ability of the chemometric method to reject them must be
should be performed in such a way that model computation is proven. Therefore, for each material a representative set of
not biased. The aim is to obtain a representative distribution samples covering typical variance within the material has to be
of the samples within the 3 subsets with regard to their analysed and evaluated. If the specificity of the chemometric
properties and expected values. model is insufficient, the parameters of the model should be
1-3-2-3. Size and partitioning of data sets optimised accordingly and the method revalidated.
The size of the data set needed for building the calibration Whenever new elements that may potentially affect
is dependent on the number of analytes and interfering identification are introduced, e.g. new materials that are
properties that needs to be handled in the model. The size of handled at the same site and represent a risk of mix-up,
the learning data set for calibration usually needs to be larger a revalidation of specificity should be carried out. This
when the interfering variations are acquired randomly than revalidation can be limited to the new element and does not
when all major interferences are known and they can be varied necessarily need to encompass the complete set of elements,
according to a statistical experimental design. The lowest whose constituents may not all be affected by the change.
possible number of samples needed to cover the calibration If properties of materials change over time (e.g. batches of
range can be estimated from the corresponding design. The materials with lower or higher particle size, lower or higher
size of the independent test set should be in the order of moisture content etc.) and these changes become relevant,
20-40 per cent of the samples used for the calibration model. they should also be included as part of the validation.
However, when representative samples are abundant, the This can be achieved for example, by an amendment to
larger the test data set (above 40 per cent), the more reliably the validation protocol and does not necessarily require a
the prediction performance can be estimated. It is common complete revalidation of the chemometric model.
practice to mix learning and model validation sets and as a
result, the definitive assessment of the model relies on the To assess specificity, the number of false-positive and
separate test set. false-negative errors can be evaluated by classification of the
test set.
1-3-3. Validation according to the regulatory framework
Validation principles and considerations are described in Robustness
established international guidelines and apply to the validation For validation of robustness, a comprehensive set of critical
of analytical methods. However, due to the special nature parameters (e.g. process parameters such as temperature,
of data treatment and evaluation, as carried out in most humidity, instrumental performance of the analytical
chemometric methods, additional aspects have to be taken equipment) should be considered. The reliability of the
into account when validating analytical procedures. In this analytical method should be challenged by variation of these
context, validation comprises both the assessment of the parameters. It can be advantageous to use experimental design
analytical method performance and the evaluation of the (DoE) to evaluate the method.
Figure 5.21.-1. – Cross-validation with leave subset of 3 out applied to linear regression. Regression model data = ●. Subset used
for test = ○. The errors of fit (interrupted lines) are collected to form the cumulated cross-validation error.
To assess robustness the number of correct classifications, (replicate measurements of the same sample by another person
correct rejections, false-positive and false-negative errors can on different days). Precision should be assessed at different
be evaluated by classification of samples under robustness analyte values covering the range of the chemometric model,
conditions. or at least at a target value.
1-3-3-2. Quantitative models Robustness
The following parameters should be addressed unless For validation of robustness, the same principles as described
otherwise justified : specificity, linearity, range, accuracy, for qualitative methods apply. Extra care should be taken
precision and robustness. to investigate the effects of any parameters relevant for
Specificity robustness on the accuracy and precision of the chemometric
model. It can be an advantage to evaluate these parameters
It is important to detect that the sample that is quantified is using experimental design.
not an outlier with respect to the calibration space. This can be
done using the correlation coefficient between the sample and The chemometric model can also be investigated using
the calibration mean, as well as Hotelling T2 among others. challenge samples, which may be samples with analyte
concentrations outside the range of the method or samples of
Linearity different identity. During the validation, it must be shown
Linearity should be validated by correlating results from the that these samples are clearly recognised as outliers.
chemometric model with those from an analytical reference
method. It should cover the entire range of the method and 2. CHEMOMETRIC TECHNIQUES
should involve a specifically selected set of samples that is A non-exhaustive selection of chemometric methods are
not part of the calibration set. For orientation purposes, a discussed below. A map of the selected methods is given in
‘leave-subset-out’ cross-validation based on the calibration set Figure 5.21.-2.
may be sufficient, but should not replace assessment using an
independent test set. Linearity can be evaluated through the 2-1. PRINCIPAL COMPONENTS ANALYSIS
correlation coefficient, slope and intercept. 2-1-1. Introduction
Range The complexity of large data sets or tables makes human
The range of analyte reference values defines the range of the interpretation difficult without supplementary methods to
chemometric model, and its lower limit determines the limits aid in the process. Principal components analysis (PCA) is a
of detection and quantification of the analytical method. projection method used to visualise the main variation in the
Controls must be in place to ensure that results outside this data. PCA can show in what respect 1 sample differs from
range are recognised as such and identified. Within the range another, which variables contribute most to this difference
of the model, acceptance criteria for accuracy and precision and whether these variables contribute in the same way and
have to be fulfilled. are correlated or are independent of each other. It also reveals
sample set patterns or groupings within the data set. In
Accuracy addition, PCA can be used to estimate the amount of useful
The accuracy of the chemometric model can be determined information contained in the data table, as opposed to noise
by comparison of analytical results obtained from the or meaningless variations.
chemometric model with those obtained using a reference 2-1-2. Principle
method. The evaluation of accuracy should be carried
out over the defined range of the chemometric model PCA is a linear data projection method that compresses data
using an independent test set. It may also be helpful to by decomposing it to so-called latent variables. The procedure
assess the accuracy of the model using a ‘leave-subset-out’ yields columns of orthogonal vectors (scores), and rows of
cross-validation, although, this should not replace assessment orthonormal vectors (loadings). The principal components
using an independent test set. (PCs), or latent variables, are a linear combination of the
original variable axes. Individual latent variables can be
Precision interpreted via their connection to the original variables.
The precision of the analytical method should be validated In essence, the same data is shown but in a new coordinate
by assessing the standard deviation of the measurements system. The relationships between samples are revealed by
performed through the chemometric model. Precision covers their projections (scores) on the PCs. Similar samples group
repeatability (replicate measurements of the same sample by together in respect to PCs. The distance between samples is a
the same person on the same day) and intermediate precision measure of similarity/dissimilarity.
General Notices (1) apply to all monographs and other texts 787
5.21. Chemometric methods applied to analytical data EUROPEAN PHARMACOPOEIA 9.0
The original data table is transformed into a new, rearranged It expresses the proportion of structure found in the data
matrix whose structure reveals the relationships between by the model. Total residual and explained variances show
rows and columns that may be hidden in the original matrix how well the model fits the data. Models with small total
(Figure 5.21.-3). The new structure constitutes the explained residual variance (close to 0 per cent) or large total explained
part of the original data. The procedure models the original variance (close to 100 per cent) can explain most of the
data down to a residual error, which is considered the variation in the data. With simple models consisting of only
unexplained part of the data and is minimised during the a few components, residual variance falls to 0 ; otherwise,
decomposition step. it usually means that the data contains a large amount of
The underlying idea is to replace a complex data table with a noise. Alternatively, it can also mean that the data structure
simpler counterpart version having fewer dimensions, but still is too complex to be explained using only a few components.
fitting the original data closely enough to be considered a good Variables with small residual variance and large explained
approximation (Figure 5.21.-4). Extraction of information variance for a particular component are well defined by the
from a data table consists of exploring variations between model. Variables with large residual variance for all or the
samples, i.e. finding out what makes a sample different from 1st components have a small or moderate relationship with
or similar to another. Two samples can be described as other variables. If some variables have much larger residual
similar if they have similar values for most variables. From variance than others for all or the 1st components, they may be
a geometric perspective, the combination of measurements excluded in a new calculation and this may produce a model
for 1 sample defines a point in a multidimensional space with that is more suitable for its purpose. Independent test set
as many dimensions as there are variables. In the case of variance is determined by testing the model using data that
close coordinates the 2 points are located in the same area was not used in the actual building of the model itself.
or volume. With PCA, the number of dimensions can be 2-1-4. Critical aspects
reduced while keeping similar samples close to each other
and dissimilar samples further apart in the same way as in PCA catches the main variation within a data set. Thus
the multidimensional space, but compressed into an alternate comparatively smaller variations may not be distinguished.
lower dimensional coordinate system.
2-1-5. Potential use
The principle of PCA is to find the directions in the data space
that describe the largest variation of the data set, i.e. where PCA is an unsupervised method, making it a useful tool for
the data points are furthest apart. Each direction is a linear exploratory data analysis. It can be used for visualisation,
combination of the initial variables that contribute most to the data compression, checking groups and trends in the data,
actual variation between samples. By construction, principal detecting outliers, etc.
components (PCs) are orthogonal to each other and are also
ranked so that each carries more information than any of those For exploratory data analysis, PCA modelling can be applied
that follow. Priority is therefore given to the interpretation to the entire data table once. However, for a more detailed
of these PCs, starting with the 1st, which incorporates the overview of where a new variation occurs, evolving factor
greatest variation and thereby constitutes an alternative less analysis (EFA) can be used and, in this case, PCA is applied in
complex system that is more suitable for interpreting the an expanding or fixed window, where it is possible to identify,
data structure. Normally, only the 1st PCs contain pertinent for example, the manifestation of a new component from a
information, with later PCs being more likely to describe series of consecutive samples.
noise. In practice, a specific criterion is used to ensure that
noise is not mistaken for information and this criterion should PCA also forms the basis for classification techniques such as
be used in conjunction with a method such as cross-validation SIMCA and regression methods such as PCR. The property
or evaluation of loadings in order to determine the number of PCA to capture the largest variations in the 1st principal
of PCs to be used for the analysis. The relationships between components allows subsequent regression to be based on
samples can then be subsequently viewed in 1 or a series of fewer latent variables. Examples of utilising components as
score plots. Residuals Ê keep the variation that is not included independent data in regression are PCR, MCR, and ANN.
in the model, as a measure of how well samples or variables
PCA is used in multivariate statistical process control (MSPC)
fit that model. If all PCs were retained, there would be no
to combine all available data into a single trace and to
approximation at all and the gain in simplicity would consist
apply a signature for each unit operation or even an entire
only of ordering the variation of the PCs themselves by size.
manufacturing process based on, for example, Hotelling
Deciding on the number of components to retain in a PCA
T2 statistics, PCA model residuals or individual scores. In
model is a compromise between simplicity, robustness and
addition to univariate control charts, 1 significant advantage
goodness of fit/performance.
with PCA is that it can be used to detect multivariate outliers,
2-1-3. Assessment of model i.e. process conditions or process output that has a different
Total explained variance R2 is a measure of how much of correlation structure than the one present in the previously
the original variation in the data is described by the model. modelled data.
Figure 5.21.-3. – Geometrical representation of 3 different X-data sets. On the left, objects are plotted in the multivariate space,
and the following examples reveal a hidden structure, i.e. a plane and a line respectively
General Notices (1) apply to all monographs and other texts 789
5.21. Chemometric methods applied to analytical data EUROPEAN PHARMACOPOEIA 9.0
General Notices (1) apply to all monographs and other texts 791
5.21. Chemometric methods applied to analytical data EUROPEAN PHARMACOPOEIA 9.0
method), the expectation-maximisation algorithm (for share the same characteristics and in addition, hierarchical
‘model-based’ methods) and DBSCAN for ‘density-based’ clustering allows for classification within data objects.
algorithms and, also, the ‘grid-based’ methods which are Clustering is used in a vast variety of fields, in particular for
exemplified by the statistical information grid (STING) information retrieval from large databases. For the latter, the
algorithm. term ‘data mining’ is frequently used, where the objective is
Minimum spanning tree clustering, such as Kruskal’s to extract hidden and unexploited information from a large
algorithm, is similar to the graph theory algorithm as all the volume of raw data in search of associations, trends and
data points are first of all connected by drawing a line between relationships between variables.
the closest points. When all data points are linked, the lines of 2-5. MULTIVARIATE CURVE RESOLUTION
largest length are broken, leaving clusters of closely connected
2-5-1. Introduction
points. For nearest neighbour clustering, an iterative
procedure is used to assign a data point to a cluster when the Multivariate curve resolution (MCR) is related to principal
distance between this point and its immediate neighbour (that components analysis (PCA) but, where PCA looks for
belongs to a cluster) is below a pre-defined threshold value. directions that represent maximum variance and are
The K-means algorithm is one of the most popular and as mutually orthogonal, MCR strives to find contribution
with partition algorithms, the number of clusters must be profiles (i.e. MCR scores) and pure component profiles
chosen a priori, together with the initial position of the cluster (i.e. MCR loadings). MCR is also known as self-modelling
centres. A squared error criterion measures the sum of the curve resolution (SMCR) or end-member extraction. When
squared distance between each object and the centroid of its optimising MCR parameters the alternating least squares
corresponding cluster. The K-means algorithm starts with (ALS) algorithm is commonly used.
a random initial partition and progresses by reassigning 2-5-2. Principle
objects to clusters until the desired criteria reach a minimum. MCR-ALS estimates the contribution profiles C and the pure
Some variants of the K-means algorithm allow the splitting component profiles S from the data matrix X, i.e. X = C∙ST
or merging of clusters in order to find the optimum number + E just as in classical least squares (CLS). The difference
of clusters, even when starting from an arbitrary initial between CLS and ALS is that ALS is an iterative procedure
clustering. Model-based clustering attempts to find the best that can incorporate information that is known about the
fit for the data using a preconceived model. An example physicochemical system studied and use this information
of this is the EM or expectation-maximisation algorithm, to constrain the components/factors. For example, neither
which assigns each object to a particular cluster according contribution nor absorbance can be negative by definition.
to the probability of membership for that object. In the EM This fact can be used to extract pure component profiles and
algorithm, the probability function is a multivariate Gaussian contributions from a well-behaved data set. There are also
distribution and that is iteratively adjusted to data by use of other types of constraints that may be used, such as equality,
the maximum-likelihood estimation. The EM algorithm is unimodality, closure and mass balance.
considered as an extension of the K-means algorithm since It is often possible to obtain an accurate estimation of the
the residual sum of squares used for K-means convergence is pure component spectra or the contribution profiles and these
similar to the maximum-likelihood criterion. estimates can then be used as initial values in the constrained
Density-based (DB) clustering, such as the DBSCAN ALS optimisation. New estimates of the profile matrix S
algorithm, assimilates clusters to regions of high density and of the contribution profile C are obtained during each
separated by regions of low or no density. The neighbourhood iteration. In addition, the physical and chemical knowledge of
of each object is examined to determine the number of the system can be used to verify the result, and the resolved
other objects that fit within a specified radius and a cluster pure component contribution profiles should be explainable
is defined when a sufficient number of objects inhabit this using existing knowledge. If the MCR results do not match
neighbourhood. the known system information, then other constraints may
Grid-based algorithms, such as STING, divide the data space be needed.
into a finite number of cells. The distribution of objects 2-5-3. Critical aspects
within each cell is then computed in terms of mean, variance,
minimum, maximum and type of distribution. There are Selection of the correct number of components for the
several levels of cells, providing different levels of resolution ALS calculations is important for a robust solution and a
and each cell of a particular level corresponds to the union of good estimate can be obtained using for example, evolving
4 child cells from the lower level. factor analysis (EFA) or fixed-size moving window EFA.
Furthermore, the constraints can be set as either ‘hard’ or
2-4-3. Critical aspects ‘soft’, where hard constraints are strictly enforced while soft
Algorithms are sensitive to the starting conditions used to constraints leave room for deviations from the restricted
initialise the clustering of data. For example, K-means needs a value. Generally, due to inherent ambiguities in the solution
pre-set number of clusters and the resultant partitioning will obtained, the MCR scores will need to be translated into,
vary according to the chosen number of clusters. The metrics for example, the concentration of the active pharmaceutical
used in distance calculation will also influence data clustering. ingredient, using a simple linear regression step. This means
For Euclidean distances, the K-means algorithm will define that the actual content must be known for at least 1 sample.
spherical clusters whereas they could be ellipsoidal when using When variations of 2 or more chemical entities are in some
Mahalanobis distances. The cluster shape can be modified by way correlated, rank deficiency occurs, for example 1 entity is
data pre-treatments prior to cluster analysis. DB algorithms formed while the other is consumed, or 2 entities are consumed
can deal with arbitrarily shaped clusters, but their weakness at the same rate to yield a third. As a result, the variation of the
is their limitation in handling high-dimensional data, where individual substance is essentially masked and in such cases,
objects are sparsely distributed among dimensions. simultaneous analysis of data from independent experiments
When an object is considered to belong to a cluster with using varied conditions or combined measurements from
a certain probability, algorithms such as density based 2 measurement techniques generally results in better strategies
clustering, allow a soft or fuzzy clustering. In this case, the than analysing the experiments separately one by one.
border region of 2 adjacent clusters can house some objects 2-5-4. Potential use
belonging to both clusters.
MCR can be applied when the analytical method produces
2-4-4. Potential use multivariate data for which the response is either linear or
Clustering is an exploratory method of analysis that helps in linearisable. This has the advantage that only 1 standard is
the understanding of data structure by grouping objects that needed per analyte, which is particularly beneficial when the
measurements are at least partly selective between analytes. – MLR tends to over-fit.
When linearity and selectivity is an issue, more standards per To avoid overfitting MLR is often used with variable selection.
analyte may be required for calibration. When there is no
The selection of the optimal number of X-variables can be
pure analytical response for an analyte, it is also possible to
based on their residual variance, but also on the prediction
estimate starting vectors by applying PCA to analyte mixtures
error.
together alongside varimax rotation of the PCA coordinate
system. ALS implementations of MCR may also allow analyte 2-6-4. Potential use
profiles that are freely varied by the algorithm, which can MLR is typically suited to simple matrices/data sets, where
then be used to model a profile that is difficult to estimate there is a high degree of specificity and full rank. As matrices
separately, for example a baseline. become more complex, more suitable methods such as
2-6. MULTIPLE LINEAR REGRESSION PLS may be required to provide more accurate and/or
robust calibration. In these cases, MLR may be used as a
2-6-1. Introduction screening technique prior to the application of more advanced
Multiple Linear Regression (MLR) is a classical multivariate calibration methodologies.
method that uses a combined set of x-vectors (X-data matrix)
in linear combinations that are fitted as closely as possible to 2-7. PRINCIPAL COMPONENTS REGRESSION
the corresponding single y-vector. 2-7-1. Introduction
MLR extends linear regression to more than 1 selected variable Principal components regression (PCR) is an expansion of
in order to perform a calibration using least squares fit. principal components analysis (PCA) for use in quantitative
2-6-2. Principle applications. It is a two-step procedure whereby the calibration
matrix X is first of all transformed by PCA into the scores and
In MLR, a direct least squares regression is performed loadings matrices and respectively. In the following step,
between the X- and the Y-data. For the sake of simplicity, the the score matrix for the principal components is used as the
regression of only 1 column vector y will be addressed here, input for an MLR model to establish the relationship between
but the method can be readily extended to a Y-matrix, as is the X- and the Y-data.
common when MLR is applied to data from experimental
design (DoE), with multiple responses. In this case, single 2-7-2. Principle
independent MLR models for each y-variable can be applied As in PCA, the calibration matrix is decomposed into scores
to the same X-matrix. and loadings matrices in such a way as to minimise the
The following MLR model equation is an extension of the residual matrix that ideally consists only of random errors,
normal univariate straight line equation ; it may also contain i.e. noise. For quantitative calibration, an additional matrix Y
cross and square terms : with the reference analytical data of the calibration samples
is necessary. As the concentration information is contained
in the orthogonal score vectors of the -matrix it can be
optimally correlated by multiple linear regression using
This can be compressed into the convenient matrix form :
the actual concentrations in the Y-matrix via the matrix
y = Xb + f (Figure 5.21.-8), while minimising the entries in the residual
matrix .
The objective is to find the vector of regression coefficients b
2-7-3. Critical aspects
that best minimises the error term f. This is where the least
squares criterion is applied to the squared error terms, i.e. A crucial point in the development of a model is the selection
to find b-values so that y-residuals f are minimised. MLR of the optimal number of principal components. In this
estimates the model coefficients using the following equation : respect, the plot of the number of principal components versus
the residual Y-variance is an extremely useful diagnostic tool
b = (XTX)-1XTy when defining the optimal number of PCs, i.e. when the
minimum of the residual Y-variance observed during model
This operation involves the matrix inversion of the assessment has been reached. In most cases, additional PCs
variance-covariance matrix (XTX)-1. If any of the X-variables beyond this point do not improve the prediction performance
show any collinearity with each other i.e. if the variables are but the calibration model falls into overfitting.
not linearly independent, then the MLR solution will not be Despite its value as an important tool when dealing with
robust or a solution may not even be possible. collinear X-data, the weakness of PCR lies in its independent
2-6-3. Critical aspects decomposition of the X and Y matrices. This approach
MLR requires independent variables in order to adequately may take into account variations of the X-data that are
explain the data set, but as pharmaceutical samples are not necessarily relevant for an optimal regression with the
comprised of a complex matrix in which components interact Y-data. Also, Y-correlated information may even get lost in
to various degrees, the selection of appropriate variables is not higher order principal components that are neglected in the
straightforward. For example, in ultra-violet spectroscopy, above-mentioned selection process of the optimal number
observed absorbance values are linked because they may of PCs.
describe related behaviours in the spectroscopic data set. A stepwise principal component selection (e.g. selection
When observing the spectra of mixtures, collinearity is of PC2 instead of PC1) may be useful to improve the
commonly found among the wavelengths, and consequently, performance of the calibration model.
MLR will struggle to perform a usable linear calibration. 2-7-4. Potential use
The ability to vary the x-variables independently of each other
PCR is a multivariate technique with many diagnostic tools
is a crucial requirement when using variables as predictors
for the optimisation of the quantitative calibration models and
with this method. This is why in DoE the initial design matrix
the detection of erroneous measurements. In spectroscopy
is generated in such a way as to establish this independence
for example, PCR provides stable solutions when dealing
(i.e. orthogonality) from the start. MLR has the following
with the calibration data of either complete spectra or large
constraints and characteristics :
spectral regions. However, it generally requires more principal
– the number of X-variables must be smaller than the number components than PLS and in view of the limitations and
of samples (n>m), otherwise the matrix cannot be inverted ; disadvantages discussed above, PLS regression has become
– in case of collinearity among X-variables, the b-coefficients the preferred alternative for quantitative modelling of
are not reliable and the model may be unstable ; spectroscopic data.
General Notices (1) apply to all monographs and other texts 793
5.21. Chemometric methods applied to analytical data EUROPEAN PHARMACOPOEIA 9.0
X = original data matrix of n rows and m columns m = number of data points (variables)
^
= scores matrix with n rows and p columns n = number of measurements (samples)
^
T = loadings matrix with p rows and m columns j = number of property values per sample
^
= residual matrix (same size as X-matrix) p = number of principal components (factors)
Y = property matrix of n rows and j columns = data of unknown sample
^
T = correlation matrix of p rows and j columns ^
= predicted property values of unkown
sample
^ ^
= residual matrix (same size as Y-matrix) = score values for unknown sample
^
= matrix of regression coefficients
Figure 5.21.-8. – Decomposition of the matrices for principal components regression (PCR)
2-8. PARTIAL LEAST SQUARES REGRESSION information that is most relevant for the prediction of the
2-8-1. Introduction Y-values of unknown samples. In practice PLS can be applied
to either 1 Y-variable only (PLS1), or to the simultaneous
Partial least squares regression (PLSR, generally known as PLS calibration of several Y-variables (PLS2 model).
and alternatively named projection on latent structures) has
developed into the most popular algorithm for multivariate As the detailed PLS algorithms are beyond the scope of this
regression. chapter, a simplified overview is instead given (Figure 5.21.-9).
PLS relates 2 data sets (X and Y) irrespective of collinearity. Arrows have been included between the and scores
PLS finds latent variables from X and Y data blocks matrices in order to symbolise the interaction of their
simultaneously, while maximising the covariance structure elements in the process of this iteration. While the Y-matrix
between these blocks. In a simple approximation PLS can be is decomposed into the loadings and scores matrices and
viewed as 2 simultaneous PCA analyses applied to the X and respectively, the decomposition of the X-matrix produces
Y-data in such a way that the structure of the Y-data is used not only the loadings and scores matrices and , but also a
for the search of the principal components in the X-data. The loading weights matrix , which represents the relationship
amount of variance modelled, i.e. the explained part of the between the X and Y-data.
data, is maximised for each component. The non-explained To connect the Y-matrix with the X-matrix decomposition
part of the data set is made up of residuals, which function as
a measure of the modelling quality. for the first estimation of the score values, the Y-data are
used as a guide for the decomposition of the X-matrix. By
2-8-2. Principle interchanging the score values of the and matrices, an
The major difference between PCR and PLS regression is interdependent modelling of the X and Y data is achieved,
that the latter is based on the simultaneous decomposition of thereby reducing the influence of large X-variations that do
the X and Y-matrices for the derivation of the components not correlate with Y. Furthermore, simpler calibration models
(preferably denoted as PLS factors, factors, or latent-variables). with fewer PLS-factors can also be developed where, as is the
Consequently, for the important factors, the information that case for PCR, residual variances are used during validation to
describes a maximum variation in X, while correlating as determine the optimal number of factors that model useful
much as possible with Y, is collected. This is precisely the information and consequently, avoid overfitting.
General Notices (1) apply to all monographs and other texts 795
5.21. Chemometric methods applied to analytical data EUROPEAN PHARMACOPOEIA 9.0
Figure 5.21.-11. – The object space, where separation of the 2 classes is not possible, is mapped into a feature space where
separation is possible
Figure 5.21.-12. – In the feature space, the separation of classes 1 and 2 was achieved with toleration of certain misclassified
samples
2-9-4. Potential use of the function (called transfer function). The weights and
bias are the coefficients of the ANN model and are determined
SVMs are mainly used for binary supervised classification.
through a learning process using known examples. An ANN
They can be generalised to multiclass classification or
often contains many neurons arranged in layers, where the
extended to regression problems, though these applications
neurons in each layer are arranged in parallel. They are
are not considered within the scope of this chapter. Objects
connected to neurons in the preceding layer from which
that are difficult to classify rather than those that are clearly
they receive inputs and also to neurons in the following layer
distinct drive the optimisation process in SVMs. SVMs can
where the outputs are sent (Figure 5.21.-13). The output of
be used for the separation of classes of objects, but not for
1 neuron is therefore used as the input for neurons in the
the identification of these objects. They operate well on large
following layer. The input layer is a special layer that receives
data sets that are obtained, for example, by NIR spectroscopy,
data directly from the user and sends this information directly
magnetic resonance, chemical imaging or process data mining,
to the next layer without applying a transfer function. The
where PCA and related methods fail. Their strength mainly
output layer is similar in that its output is also directly used
lies in separation of samples featuring highly correlated
as the model output without any additional processing. The
signals, i.e. polymorphs, excipients, tracing of adulterated
unlimited possibilities when connecting different numbers
substances, counterfeits etc.
and layers of neurons, is often called an ANN architecture,
2-10. ARTIFICIAL NEURAL NETWORKS and provides the potential for ANNs to meet any complicated
2-10-1. Introduction data modelling requirements.
Artificial neural networks (ANNs) are general computational
tools, whose initial development was inspired by the need
for further understanding of biological neural networks and
which have since been widely used in various areas that
require data processing with computers or machines. The
methods for building ANN models and their subsequent
applications can be dramatically different depending on the
architecture of the neural networks themselves. In the field
of chemometrics, ANNs are generally used for multivariate
calibration and unsupervised classification, which is achieved
by using multi-layer feed-forward (MLFF) neural networks,
or self-organising maps (SOM) respectively. As a multivariate
calibration tool, ANNs are more generally associated with the
mapping of non-linear relationships.
2-10-2. Principle
2-10-2-1. General
The basic data processing element in an artificial neural
network is the artificial neuron, which can be understood as a
mathematical function that uses the sum of a weighted vector
and a bias as the input. The vector is the ‘input’ of the neuron
and is obtained either directly from a sample in the data set or Figure 5.21.-13. - Typical arrangements of neuron layers and
calculated from previous neurons. The user chooses the form their inter-connections
General Notices (1) apply to all monographs and other texts 797
5.21. Chemometric methods applied to analytical data EUROPEAN PHARMACOPOEIA 9.0
2-10-2-2. Multi-layer feed-forward artificial neural network also make the interpretation of the coefficients more difficult.
A multi-layer feed-forward network (MLFF ANN) contains However, when linear modelling methods are not flexible
an input layer, an output layer and 1 or more layers of neurons enough to provide the required prediction or classification
in-between called hidden layers. Even though there is no limit accuracy, ANNs may be a good alternative.
on how many hidden layers may be included, an MLFF ANN
with only 1 hidden layer is sufficiently capable of handling 3. GLOSSARY
most multivariate calibration tasks in chemometrics. In an β-distribution : continuous probability distribution with
MLFF ANN, each neuron is fully connected to all the neurons density function
in the neighbouring layers. A hyperbolic tangent sigmoid
transfer function is usually used in MLFF ANN, but other
transfer functions, including linear functions, can also be used.
The initial weights and biases can be set as small random where u > 0, v > 0 are shape parameters (degrees of freedom)
numbers, but can also be initialised using other algorithms. and B is the beta function,
The most popular training algorithm for determining the final
weights and biases is the back-propagation (BP) algorithm or
its related variants. In the BP algorithm, the prediction error,
calculated as the difference between the ANN output and the
actual value, is propagated backward to calculate the changes
needed to adjust the weights and biases in order to minimise The γ-quantile of the Β(u,v)-distribution is denoted by βu,v ;γ
the prediction error. and it is the value q such that the value of the distribution
function F is γ :
An MLFF ANN must be optimised in order to achieve
acceptable performance. This often involves a number of
considerations including the number of layers, the number
of neurons in each layer, transfer functions for each layer or
neuron, initialisation of weights, learning rate, etc.
2-10-2-3. Self-organising map Bootstrapping : a number of sample sets of size n that is
produced from an original sample set of the same size n by
The aim of the self-organising map (SOM) is to create a means of a random selection of samples with replacement.
map where observations that are close to each other have
more similar properties than more distant observations. Centring : a data set is mean centred by calculating the mean
The neurons in the output layer are usually arranged value of each variable and subtracting the variable mean
in a two-dimensional map, where each neuron can be values from each column of variables, in order to make the
represented as a square or a hexagon. SOMs are trained using comparison and interpretation of the data easier.
competitive learning that is different from the above described Collinear/non-collinear: a family of vectors is collinear
method using BP. The final trained SOM is represented as a if at least 1 of the vectors can be represented as a linear
two-dimensional map of properties. combination of the other vectors. Hence a family of vectors
2-10-3. Critical aspects is non-collinear if none of the vectors can be represented as a
linear combination of the others.
The 2 most common pitfalls of using ANNs are over-training
and under-training. Over-training means that an ANN Component (or factor, latent variable) : in chemometrics :
model can predict the training set very well but ultimately underlying, non-observed, non-measured, hypothetical
fails to make good predictions. Under-training means that variable that contributes to the variance of a collection of
the ANN training ended too soon and therefore the resultant measured variables. The variables are linear combinations of
ANN model underperforms when making predictions. Both the factors and these factors are assumed to be uncorrelated
of these pitfalls should be avoided when using ANNs for with each other.
calibration. A representative data set with a proper size, i.e. Data mining : process of exploration, extraction and
more observations or samples than variables, is required modelling of large collections of data in order to discover a
before a good ANN model can be trained. Generally, since priori unknown relationships or patterns.
the models are non-linear, more observations are needed
Dependent variable : also a response, regressand : a variable
than for a comparable data set subjected to linear modelling.
that is related by a formal (explicit) or empirical mathematical
As for other multivariate calibration methods, the input
relationship to 1 or more other variables (typically the Y-data).
may need pre-processing to balance the relative influence of
variables. One advantage of pre-processing is the reduction Empirical model : a data-driven model established without
in the number of degrees of freedom of input to the ANN, for assuming an explicit mathematical relationship, or without a
example by compression of the X-data to scores by PCA and description of the behaviour of a system based on accepted
then using the resulting scores for the observations as input. laws of physics.
2-10-4. Potential use Exploratory data analysis : the process for uncovering
The advantage of MLFF ANN in multivariate calibration lies unexpected or latent patterns in order to build future
in its ability to model non-linear relationships. Since the hypotheses.
neurons are fully connected, all the interactions between Factor : see component.
variables are automatically considered. It has been proven that Hotelling T2 statistics : multivariate version of the t-statistic.
a MLFF ANN with sufficient hidden neurons can map any In general, this statistic can be used to test if the mean vector
complicated relationship between the inputs and outputs. of a multivariate data set has a certain value or to compare
SOMs can be used to visualise high-dimensional data while the means of the variables. The T2 statistic is also used for
preserving the topology in the original data. They are based detection of outliers in multivariate data sets. A multivariate
on unsupervised learning, and are mainly useful as tools to statistical test using the Hotelling T2 statistic can be done.
explore features in data sets where no prior knowledge of the A confidence ellipse can be included in score plots to reveal
patterns and relationships of the samples exists. points outside the ellipse as potential outliers.
The ANNs often have a large number of coefficients (weights Independent variable : input variable on which other
and biases) that give the ANN the potential to model any variables are dependent through a mathematical relationship
complicated relationships in the data set but as a result, can (typically the X-data).
Indirect prediction : process for estimating the value of a Scores or factor score coefficients : coordinates of the
response on the basis of a multivariate model and observed samples in the new coordinate system defined by the principal
data. components. Scores represent how samples are related to each
Interference : effect of substances, physical phenomena or other’s given the measurement variables.
instrument artefacts, separate from the target analyte, that can Score (normalised) : jth score value ti,j of the ith sample divided
be measured by the chosen analytical method. Then there is a by the norm of the scores matrix :
risk of confusion between the analyte and interference if the
interference is not varied independently or at least randomly
in relation to the analyte.
Latent variable : see component.
where p is the number of parameters in the model.
Leave-one-out : in a ‘leave-one-out’ procedure only 1 sample
Standard error of calibration : a function of the predictive
at a time is removed from the data set in order to create a
residual sum of squares to estimate the accuracy considering
new data set.
the number of parameters :
Leave-subset-out : in a ‘leave-subset-out’ procedure a subset
of samples is removed from the data set in order to create a
new data set.
Leverage : a measure of how extreme a data point or a variable
is compared to the majority. Points or variables with a high where n is the number of samples of the learning set, p the
leverage are likely to have a large influence on the model. number of parameters in the model to be estimated by using
the sample data, ŷi the ith fitted value in the calibration, and yi
Loadings : loadings are estimated when information carried the ith reference value. In multiple regression with m variables
by several variables is focussed onto a few components. Each p = m + 1 (1 coefficient for each of the m variables and
variable has a loading alongside each model component. The 1 intercept).
loadings show how well a variable is taken into account by
the model components. Standard error of laboratory : concerns to the intermediate
precision or reproducibility, whichever is applicable.
Orthogonal : 2 vectors are orthogonal if their scalar product
is 0. Supervised : refers to modelling data labelled by classes or
values.
Orthonormal vectors : orthogonal and normalised
(unit-length) vectors. Unsupervised (non-supervised) : refers to exploring data
without prior assumptions.
Outlier: for a numerical data set, it relates to a value
Underfitting : the reverse of overfitting.
statistically different from the rest. Also refers to the sample
associated with that value. Specific statistical testing for Variable : property of a sample that can be assessed (attribute,
outliers may be used. descriptor, feature, property, characteristics).
Overfitting : for a model, overfitting is a tendency to Varimax rotation : orthogonal analytical rotation of factors
describe too much of the variation in the data, so that in that maximises the variance of squared factor loadings, thereby
addition to the consistent underlying structure, some noise increasing the large factor loadings and large eigenvalues and
or non-informative variation is also taken into account and decreasing the small ones in each factor.
unreliable predictions will be obtained.
4. ABBREVIATIONS
Property : see variable.
Resampling : the process of impartial rearrangement and ALS alternating least squares
sub-sampling of the original data set. This occurs during
optimisation/validation procedures that repeatedly calculate a ANN artificial neural network
property and the error associated with it. Typical examples are
BP back-propagation
cross-validation and bootstrapping, which create successive
evaluation data sets by repeated sub-sampling.
CLS classical least squares
Reselection : reuse of samples (see resampling).
DB density-based
Residuals : a measure of the variation that is not taken into
account by the model or a deviation between predicted and
DBSCAN density-based spatial clustering of
reference values.
applications with noise
Root mean square error of prediction : a function of the
predictive residual sum of squares to estimate the accuracy : DoE design of experiments
EM expectation maximisation
where ŷi is the predicted response for the ith sample of the test LDA linear discriminant analysis
data set and yi the observed response of the ith sample, and
n is the number of samples. MCR multivariate curve resolution
Sample : object, observation, or individual from which data MLFF multi-layer feed-forward
values are collected.
Sample attribute : qualitative or quantitative property of the MLR multiple linear regression
sample.
MSPC multivariante statistical process control
Sample selection : the process of drawing a subset or
a collection from a population in order to estimate the NIR near infrared
properties of the population.
General Notices (1) apply to all monographs and other texts 799
5.21. Chemometric methods applied to analytical data EUROPEAN PHARMACOPOEIA 9.0
PAT process analytical technology RMSEP root mean square error of prediction
PCR principal components regression SIMCA soft independent modelling of class analogy
PLS-DA partial least squares discriminant analysis SNV standard normal variate