Chemometrics and Intelligent Laboratory Systems: Raju Rimal, Trygve Almøy, Solve Sæbø

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Chemometrics and Intelligent Laboratory Systems 190 (2019) 10–21

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems


journal homepage: www.elsevier.com/locate/chemometrics

Comparison of multi-response prediction methods


Raju Rimal a, *, Trygve Almøy a, Solve Sæbø b
a
Faculty of Chemistry and Bioinformatics, Norwegian University of Life Sciences, Ås, Norway
b
Norwegian University of Life Sciences, Ås, Norway

A R T I C L E I N F O A B S T R A C T

Keywords: While data science is battling to extract information from the enormous explosion of data, many estimators and
Model-comparison algorithms are being developed for better prediction. Researchers and data scientists often introduce new
Multi-response methods and evaluate them based on various aspects of data. However, studies on the impact of/on a model with
Simrel
multiple response variables are limited. This study compares some newly-developed (envelope) and well-
established (PLS, PCR) prediction methods based on real data and simulated data specifically designed by
varying properties such as multicollinearity, the correlation between multiple responses and position of relevant
principal components of predictors. This study aims to give some insight into these methods and help the
researcher to understand and use them in further studies.

1. Introduction contemporary prediction methods such as simultaneous envelope esti-


mation (Senv) [8] and envelope estimation in predictor space (Xenv) [7]
The prediction has been an essential component of modern data sci- with customary prediction methods such as Principal Component
ence, whether in the discipline of statistical analysis or machine learning. Regression (PCR), Partial Least Squares Regression (PLS) using simulated
Modern technology has facilitated a massive explosion of data however, dataset with controlled properties. In the case of PLS, we have used PLS1
such data often contain irrelevant information that consequently makes which fits individual response separately and PLS2 which fits all the
prediction difficult. Researchers are devising new methods and algo- responses together. Experimental design and the methods under com-
rithms in order to extract information to create robust predictive models. parison are discussed further, followed by a brief discussion of the
Such models mostly contain predictor variables that are directly or strategy behind the data simulation.
indirectly correlated with other predictor variables. In addition, studies
often consist of many response variables correlated with each other. 2. Simulation model
These interlinked relationships influence any study, whether it is pre-
dictive modelling or inference. Consider a model where the response vector ðyÞ with m elements and
Modern inter-disciplinary research fields such as chemometrics, predictor vector ðxÞ with p elements follow a multivariate normal dis-
econometrics and bioinformatics handle multi-response models exten- tribution as follows,
sively. This paper attempts to compare some multivariate prediction      
methods based on their prediction performance on linear model data y μy Σyy Σyx
N ; (1)
with specific properties. The properties include the correlation between x μx Σxy Σxx
response variables, the correlation between predictor variables, number
of predictor variables and the position of relevant predictor components. where, Σxx and Σyy are the variance-covariance matrices of x and y,
These properties are discussed more in the Experimental Design section. respectively, Σxy is the covariance between x and y and μx and μy are
Among others, Sæbø et al. [26] and Almøy [2] have conducted a similar mean vectors of x and y, respectively. A linear model based on (1) is,
comparison in the single response setting. In addition, Rimal et al. [25]
y ¼ μy þ βt ðx  μx Þ þ ε (2)
have also conducted a basic comparison of some prediction methods and
their interaction with the data properties of a multi-response model. The
where, βt is a matrix of regression coefficients and ε is an error term
main aim of this paper is to present a comprehensive comparison of mp

* Corresponding author.
E-mail addresses: [email protected] (R. Rimal), [email protected] (T. Almøy), [email protected] (S. Sæbø).

https://doi.org/10.1016/j.chemolab.2019.05.004
Received 19 March 2019; Received in revised form 30 April 2019; Accepted 9 May 2019
Available online 15 May 2019
0169-7439/© 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
R. Rimal et al. Chemometrics and Intelligent Laboratory Systems 190 (2019) 10–21

such that ε  N ð0;Σyjx Þ. Here, βt ¼ Σyx Σ1 1


xx and Σyjx ¼ Σyy  Σyx Σxx Σxy .
[14] and Rimal et al. [25] implemented similar simulation examples
In a model like (2), we assume that the variation in response y is similar to those we are discussing in this study. This paper, however,
partly explained by the predictor x. However, in many situations, only a presents an elaborate comparison of the prediction using multi-response
subspace of the predictor space is relevant for the variation in the simulated linear model data. The properties of the simulated data are
response y. This space can be referred to as the relevant space of x and the varied through different levels of simulation-parameters based on an
rest as irrelevant space. In a similar way, for a certain model, we can experimental design. Rimal et al. [25] provide a detailed discussion of
assume that a subspace in the response space exists and contains the the simulation model that we have adopted here. The following section
information that the relevant space in predictor can explain (Fig. 1). Cook presents the estimators being compared in more detail.
et al. [7] and Cook and Zhang [8] have referred to the relevant space as
material space and the irrelevant space as immaterial space. 3. Prediction methods
With an orthogonal transformation of y and x to latent variables w and
z, respectively, by w ¼ Qy and z ¼ Rx, where Q and R are orthogonal Partial least squares regression (PLS) and Principal component
rotation matrices, an equivalent model to (1) in terms of the latent var- regression (PCR) have been used in many disciplines such as chemo-
iables can be written as, metrics, econometrics, bioinformatics and machine learning, where wide
      predictor matrices, i.e. p (number or predictors) > n (number of obser-
w μw Σww Σwz vation) are common. These methods are popular in multivariate analysis,
N ; (3)
z μz Σzw Σzz especially for exploratory studies and predictions. In recent years, a
concept of envelope introduced by Cook et al. [6] based on the reduction
where, Σww and Σzz are the variance-covariance matrices of w and z, in the regression model was implemented for the development of
respectively. Σzw is the covariance between z and w. μw and μz are the different estimators. This study compares these prediction methods based
mean vector of z and w respectively. on their prediction performance on data simulated with different
Here, the elements of w and z are the principal components of re- controlled properties.
sponses and predictors, which will respectively be referred to respec- Principal Components Regression (PCR): Principal components are
tively as “response components” and “predictor components”. The the linear combinations of predictor variables such that the trans-
column vectors of respective rotation matrices Q and R are the eigen- formation makes the new variables uncorrelated. In addition, the varia-
vectors corresponding to these principal components. We can write a tion of the original dataset captured by the new variables is sorted in
linear model based on (3) as, descending order. In other words, each successive component captures
  maximum variation left by the preceding components in predictor vari-
w ¼ μw þ αt z  μz þ τ (4) ables [18]. Principal components regression uses these principal com-
ponents as a new predictor to explain the variation in the response.
where, αt is a matrix of regression coefficients and τ is an error term
mp Partial Least Squares (PLS): Two variants of PLS: PLS1 and PLS2 are
such that τ  N ð0; Σwjz Þ. used for comparison. The first one considers individual response vari-
Following the concept of relevant space, a subset of predictor com- ables separately, i.e. each response is predicted with a single response
ponents can be imagined to span the predictor space. These components model, while the latter considers all response variables together. In PLS
can be regarded as relevant predictor components. Naes and Martens regression, the components are determined so as to maximize a covari-
[22] introduced the concept of relevant components which was explored ance between response and predictors [10]. Among other, there are three
further by Helland [11], Næs and Helland [21], Helland and Almøy [13] main PLS algorithms NIPALS, SIMPLS and Kernel Algorithm all of which
and Helland [12]. The corresponding eigenvectors were referred to as removes the extracted information through deflation and makes the
relevant eigenvectors. A similar logic is introduced by Cook et al. [7] and resulting new variables orthogonal. The algorithms differ in the deflation
later by Cook et al. [5] as an envelope which is the space spanned by the strategy and computation of various weight vectors [1] and here we have
relevant eigenvectors [4, pp. 101]. used the kernel version of PLS. R-package pls [20] is used for both PCR
In addition, various simulation studies have been performed with the and PLS methods.
model based on the concept of relevant subspace. A simulation study by Envelopes: The envelope, introduced by Cook et al. [6], was first used
Almøy [2] has used a single response simulation model based on reduced to define response envelope [7] as the smallest subspace in the response
regression and has compared some contemporary multivariate estima- space and must be a reducing subspace of Σyjx such that the span of
tors. In recent years Helland et al. [15], Sæbø et al. [26], Helland et al. regression coefficients lies in that space. Since a multivariate linear
regression model contains relevant (material) and irrelevant (immate-
rial) variation in both response and predictor, the relevant part provides
information, while the irrelevant part increases the estimative variation.
The concept of the envelope uses the relevant part for estimation while
excluding the irrelevant part consequently increasing the efficiency of
the model [9].
The concept was later extended to the predictor space, where the
predictor envelope was defined [5]. Further Cook and Zhang [8] used
envelopes for joint reduction of the responses and predictors and argued
that this produced efficiency gains that were greater than those derived
by using individual envelopes for either the responses or the predictors
separately. All the variants of envelope estimations are based on
maximum likelihood estimation. Here we have used predictor envelope
(Xenv) and simultaneous envelope (Senv) for the comparison. R-package
Renvlp [19] is used for both Xenv and Senv methods.

3.1. Modification in envelope estimation

Since envelope estimators (Xenv and Senv) are based on maximum


Fig. 1. Relevant space in a regression model. likelihood estimation (MLE), it fails to estimate in the case of wide

11
R. Rimal et al. Chemometrics and Intelligent Laboratory Systems 190 (2019) 10–21

matrices, i.e. p > n. To incorporate these methods in our comparison, we corresponding values for the maximum correlation were 0.998 to 0.923.
have used the principal components ðzÞ of the predictor variables ðxÞ as Correlation in response variables: Correlation among response
predictors, using the required number of components for capturing variables has been explored to a lesser extent. Here we have tried to
97.5% of the variation in x for the designs where p > n. The new set of explore that part with four levels of correlation in the response variables.
variables z were used for envelope estimation. The regression coefficients We have used the eta (η) parameter of simrel for controlling the decline in
α Þ corresponding to these new variables z were transformed back to
ðb eigenvalues corresponding to the response variables as (6).
obtain coefficients for each predictor variabl
κ j ¼ eηðj1Þ ; η > 0 and j ¼ 1; 2; …; m (6)
b αk
β ¼ ek c
Here, κj ; i ¼ 1; 2; …m are the eigenvalues of the response variables
and m is the number of response variables. We have used 0, 0.4, 0.8 and
where ek is a matrix of eigenvectors with the first k number of compo-
1.2 as different levels of eta. The larger the value of eta, the larger will be
nents. Only simultaneous envelope allows to specify the dimension of
the correlation will be between response variables and vice versa. In our
response envelope and all the simulation is based on a single latent
simulation, the different levels of eta from small to large correspond to
dimension of response, so it is fixed at two in the simulation study. In the
the maximum correlation of 0, 0.442, 0.729 and 0.878 between the
case of Senv, when the envelope dimension for response is the same as
response variables respectively.
the number of responses, it degenerates to the Xenv method and if the
Position of predictor components relevant to the response: The
envelope dimension for the predictor is the same as the number of pre-
principal components of the predictors are ordered. The first principal
dictors, it degenerates to the standard multivariate linear regression [19].
component captures most of the variation in the predictors. The second
captures most of the remainder left by the first principal component and
4. Experimental design
so on. In highly collinear predictors, the variation captured by the first
few components is relatively high. However, if those components are not
This study compares prediction methods based on their prediction
relevant for the response, prediction becomes difficult [13]. Here, two
ability. Data with specific properties are simulated, some of which are
levels of the positions of these relevant components are used as 1, 2, 3, 4
easier to predict than others. These data are simulated using the R-
and 5, 6, 7, 8.
package simrel, which is discussed in Sæbø et al. [26] and Rimal et al.
Moreover, a complete factorial design from the levels of the above
[25]. Here we have used four different factors to vary the property of the
parameters gave us 32 designs. Each design is associated with a dataset
data: a) Number of predictors (p), b) Multicollinearity in predictor var-
having unique properties. Fig. 2, shows all the designs. For each design
iables (gamma), c) Correlation in response variables (eta) and d) position
and prediction method, 50 datasets were simulated as replicates. In total,
of predictor components relevant for the response (relpos). Using two
there were 5  32  50, i.e. 8000 simulated datasets.
levels of p, gamma and relpos and four levels of eta, 32 sets of distinct
Common parameters: Each dataset was simulated with n ¼ 100
properties are designed for the simulation.
number of observation and m ¼ 4 response variables. Furthermore, the
Number of predictors: To observe the performance of the methods
coefficient of determination corresponding to each response components
on tall and wide predictor matrices, 20 and 250 predictor variables are
in all the designs is set to 0.8. The informative and uninformative latent
simulated with the number of observations fixed at 100. Parameter p
components are generated according to (3). Since Σww and Σzz are di-
controls these properties in the simrel function.
agonal matrices, the components are independent within w and z, but
Multicollinearity in predictor variables: Highly collinear pre-
dependence between the latent spaces of x and y are secured through the
dictors can be explained completely by a few components. The parameter
non-zero elements of Σwz with positions defined by the relpos and ypos
gamma (γ) in simrel controls decline in the eigenvalues of the predictor
parameters. The latent components are subsequently rotated to obtain
variables as (5).
the population covariance structure of response and predictor variables.
λi ¼ eγði1Þ ; γ > 0 and i ¼ 1; 2; …; p (5) In addition, we have assumed that there is only one informative response
component. Hence, the informative response component after the
Here, λi ; i ¼ 1; 2; …p are eigenvalues of the predictor variables. We orthogonal rotation together with three uninformative response com-
have used 0.2 and 0.9 as different levels of gamma. The higher the value ponents generates four response variables. This spreads out the infor-
of gamma, the higher the multicollinearity will be, and vice versa. In our mation in all simulated response variables. For further details on the
simulations, the higher and lower gamma values corresponded to the simulation tool, see Ref. [25].
maximum correlation between the predictors equal to 0.990 and 0.709, An example of simulation parameters for the first design is as follows:
respectively, in the case of p ¼ 20 variables. In the case of p ¼ 250 the

12
R. Rimal et al. Chemometrics and Intelligent Laboratory Systems 190 (2019) 10–21

Fig. 2. Experimental Design of simulation parameters. Each point represents a unique data property.

Fig. 3. (left) Covariance structure of latent components (right) Covariance structure of predictor and response.

methods with an emphasis specifically on the interaction between the


The covariance structure of the data simulated with this design in properties of the data controlled by the simulation parameters and the
Fig. 3 shows that the predictor components at positions 1, 2, 3 and 4 are prediction methods. The prediction performance is measured based on
relevant for the first response component. After the rotation with an the following:
orthogonal rotation matrix, all predictor variables are somewhat relevant
for all response variables, satisfying other desired properties such as a) The average prediction error that a method can give using an arbi-
multicollinearity and coefficient of determination. For the same design, trary number of components and
Fig. 4 (top left) shows that the predictor components 1, 2, 3 and 4 are b) The average number of components used by the method to give the
relevant for the first response component. All other predictor compo- minimum prediction error
nents are irrelevant and all other response components are uninforma-
tive. However, due to the orthogonal rotation of the informative response Let us define,
component together with uninformative response components, all h
1 t  i
response variables in the population have similar covariance with the P E ijkl ¼ E βij  b
β ijkl ðΣxx Þi βij  b
β ijkl þ 1 (7)
σ 2y
relevant predictor components (Fig. 4 (top right)). The sample co- ij jx
variances between the predictor components and predictor variables
with response variables are shown in Fig. 4 (bottom left) and (bottom as a prediction error of response j ¼ 1; …4 for a given design i ¼ 1; 2; …
right) respectively. 32 and method k ¼ 1ðPCRÞ; …5ðSenvÞ using l ¼ 0; …10 number of
A similar description can be made for all 32 designs, where each of components. Here, ðΣxx Þi is the true covariance matrix of the predictors,
the designs holds the properties of the data they simulate. These data are unique for a particular design i and σ 2y x for response j ¼ 1; …m is the true
jj
used by the prediction methods discussed in the previous section. Each
model error. Here prediction error is scaled by the true model error to
prediction method is given independently simulated datasets in order to
remove the effects of influencing residual variances. Since both the
give them an equal opportunity to capture the dynamics in the data.
expectation and the variance of b β are unknown, the prediction error is
5. Basis of comparison estimated using data from 50 replications as follows,

This study focuses mainly on the prediction performance of the

13
R. Rimal et al. Chemometrics and Intelligent Laboratory Systems 190 (2019) 10–21

Fig. 4. Expected Scaled absolute covariance between predictor components and response components (top left). Expected Scaled absolute covariance between
predictor components and response variables (top right). Sample scaled absolute covariance between predictor components and response variables (bottom left).
Sample scaled absolute covariance between predictor variables and response variables (bottom right). The bar graph in the background represents eigenvalues
corresponding to each component in the population (top plots) and in the sample (bottom plots). One can compare the top-right plot (true covariance of the pop-
ulation) with bottom-left (covariance in the simulated data) which shows a similar pattern for different components.

50 h
1 X  t  i discussed above is summarized as constructing the following two smaller
Pd
E ijkl ¼ 2 β b
β ijklr ðΣxx Þi βij  b
β ijklr þ 1 (8)
σ y x r¼0 ij datasets. Let us call them Error Dataset and Component Dataset.
ij j
Error Dataset: For each prediction method, design and response, an
average prediction error is computed over all replicates for each
where Pd E ijkl is the estimated prediction error averaged over r ¼ 50 component. Next, a component that gives the minimum of this average
replicates. prediction error is selected, i.e.,
The following section focuses on the data for the estimation of these
" #
prediction errors that are used for the two models discussed above in a) 1 X50

and b) of this section. l∘ ¼ argmin ðP E ∘ Þijklr (9)


l 50 i¼1

6. Data preparation Using the component l∘ , a dataset of ðP E ∘ Þijkl∘ r is used as the Error
Dataset. Let uð80004Þ ¼ ðuj Þ for j ¼ 1; …4 be the outcome variables
A dataset for estimating (7) is obtained from simulation which con- measuring the prediction error corresponding to the response number j in
tains a) five factors corresponding to simulation parameters, b) predic- the context of this dataset.
tion methods, c) number of components, d) replications and e) prediction Component Dataset: The number of components that gives the mini-
error for four responses. The prediction error is computed using predictor mum prediction error in each replication is referred to as the Component
components ranging from 0 to 10 for every 50 replicates as, Dataset, i.e.,
  1 h t  i 
l∘ ¼ argmin P E ijklr

Pd
E∘ ¼ βij  b
β ijklr ðΣxx Þi βij  b
β ijklr þ 1 (10)
ijklr σ 2y l
ij jx
Here l∘ is the number of components that gives minimum prediction
Thus there are 32 (designs)  5 (methods)  11 (number of com- error ðP E ∘ Þijklr for design i, response j, method k and replicate r. Let
ponents)  50 (replications), i.e. 88000 observations corresponding to
vð80004Þ ¼ ðvj Þ for j ¼ 1; …4 be the outcome variables measuring the
the response variables from Y1 to Y4.
number of components used for minimum prediction error correspond-
Since our discussions focus on the average minimum prediction error
ing to the response j in the context of this dataset.
that a method can obtain and the average number of components they
use to get the minimum prediction error in each replicates, the dataset

14
R. Rimal et al. Chemometrics and Intelligent Laboratory Systems 190 (2019) 10–21

Fig. 5. Scores density corresponding to first principal component of error dataset (u) subdivided by methods, gamma and eta and grouped by relpos.

7. Exploration number of components to give minimum prediction error. The plot also
shows that the relevant predictor components at 5, 6, 7, 8 give larger
This section explores the variation in the error dataset and the prediction errors than those in positions 1, 2, 3, 4. The pattern is more
component dataset for which we have used Principal Component Analysis distinct in large multicollinearity cases and PCR and PLS methods. Both
(PCA). Let tu and tv be the principal component score sets corresponding the envelope methods have shown equally enhanced performance at both
to PCA run on the u and v matrices respectively. The scores density in levels of relpos and gamma. However, for data with low multicollinearity
Fig. 5 corresponds to the first principal component of u, i.e. the first (γ ¼ 0:2), the envelope methods have used a lesser number of compo-
column of tu . nents on average than in the high multicollinearity cases to achieve
Since higher prediction errors correspond to high scores, the plot minimum prediction error.
shows that the PCR, PLS1 and PLS2 methods are influenced by the two
levels of the position of relevant predictor components. When the rele- 8. Statistical analysis
vant predictors are at positions 5, 6, 7, 8, the eigenvalues corresponding
to them are relatively smaller. This also suggests that PCR, PLS1 and PLS2 This section has modelled the error data and the component data as a
depend greatly on the position of the relevant components, and the function of the simulation parameters to better understand the connec-
variation of these components affects their prediction performance. tion between data properties and prediction methods using multivariate
However, the envelope methods appeared to be less influenced by relpos analysis of variation (MANOVA).
in this regard. Let us consider a model with third order interaction of the simulation
In addition, the plot also shows that the effect of gamma, i.e., the level parameters (p, gamma, eta and relpos) and Methods as in (11) and (12)
of multicollinearity, has a lesser effect when the relevant predictors are at using datasets u and v, respectively. Let us refer to them as the error model
positions 1, 2, 3, 4. This indicates that the methods are somewhat robust and the component model.
for handling collinear predictors. Nevertheless, when the relevant pre- Error Model:
dictors are at positions 5, 6, 7, 8, high multicollinearity results in a small
variance of these relevant components and consequently yields poor uabcdef ¼ μu þ ðpa þ gammab þ etac þ relposd þ Methodse Þ3 þ ðεu Þabcdef
prediction. This is in accordance with the findings of Helland and Almøy (11)
[13].
Component Model:
Furthermore, the density curves for PCR, PLS1 and PLS2 are similar
for different levels of eta, i.e., the factor controlling the correlation be-
vabcdef ¼ μv þ ðpa þ gammab þ etac þ relposd þ Methodse Þ3 þ ðεv Þabcdef (12)
tween responses. However, the envelope models have been shown to
have distinct interactions between the positions of relevant components
where, uabcdef is a vector of prediction errors in the error model and vabcdef
(relpos) and eta. Here higher levels of eta have yielded higher scores and
is a vector of the number of components used by a method to obtain
clear separation between two levels of relpos. In the case of high multi-
minimum prediction error in the component model.
collinearity, envelope methods have resulted in some large outliers
Although there are several test-statistics for MANOVA, all are essen-
indicating that in some cases that the methods can result in giving an
tially equivalent for large samples [17]. Here we will use Pillai's trace
unexpected prediction.
statistic which is defined as,
In Fig. 6, the higher scores suggest that methods have used a larger

15
R. Rimal et al. Chemometrics and Intelligent Laboratory Systems 190 (2019) 10–21

Fig. 6. Score density corresponding to the first principal component of the component dataset (v) subdivided by methods, gamma and eta and grouped by relpos.

 Xm
νi 8.1. Effect analysis of error model
Pillai statistic ¼ tr ðE þ HÞ1 H ¼ (13)
i¼1
1 þ νi
The large difference in the prediction error for the envelope models in
Here the matrix H holds between-sum-of-squares and sum-of- Fig. 8 (left) is intensified when the position of the relevant predictor is at
products for each of the predictors. The matrix E has a within the sum 5, 6, 7, 8. The results also show that the envelope methods are more
of squares and sum of products for each of the predictors. νi represents sensitive to the levels of eta than the rest of the methods. In the case of
the eigenvalues corresponding to E1 H [24]. PCR and PLS, the difference in the effect of levels of eta is small.
For both the models (11) and (12), Pillai's trace statistic is used for In Fig. 8 (right), we can see that the multicollinearity (controlled by
accessing the effect of each factor and returns an F-value for the strength gamma) has affected all the methods. However, envelope methods have
of their significance. Fig. 7 plots the Pillai's trace statistics as bars with better performance on low multicollinearity, as opposed to high multi-
corresponding F-values as text labels for both models. collinearity, and PCR, PLS1 and PLS2 are robust for high multi-
Error Model: Fig. 7 (left) shows the Pillai's trace statistic for factors of collinearity. Despite handling high multicollinearity, these methods have
the error model. The main effect of Method followed by relpos, eta and higher prediction error in both cases of multicollinearity than the enve-
gamma have the largest influence on the model. A highly significant two- lope methods.
factor interaction of Method with gamma followed by the relpos and eta
clearly shows that methods perform differently for different levels of 8.2. Effect analysis of the component model
these data properties. The significant third order interaction between
Method, eta and gamma suggest that the performance of a method differs Unlike for prediction errors, Fig. 9 (left) shows that the number of
for a given level of multicollinearity and the correlation between the components used by the methods to obtain minimum prediction error is
responses. Since only some methods consider modelling predictor and less affected by the levels of eta. All methods appear to use on average
response together, the prediction is affected by the level of correlation more components when eta increases. Envelope methods are able to
between the responses (eta) for a given method. obtain minimum prediction error by using components ranging from 1 to
Component Model: Fig. 7 (right) shows the Pillai's trace statistic for 3 in both the cases of relpos. This value is much higher in the case of PCR
factors of the component model. As in the error model, the main effects of as its prediction is based only on the principal components of the pre-
the Method, relpos, gamma and eta have a significantly large effect on dictor matrix. The number of components used by this method ranges
the number of components that a method has used to obtain minimum from 3 to 5 when relevant components are at positions 1, 2, 3, 4 and 5 to 8
prediction error. The two-factor interactions of Method with simulation when relevant components are at positions 5, 6, 7, 8.
parameters are larger in this case. This shows that the Methods and these When relevant components are at position 5, 6, 7, 8, the eigenvalues
interactions have a larger effect on the use of the number of component of relevant predictors become smaller and responses are relatively diffi-
than the prediction error itself. In addition, a similar significant high cult to predict. This becomes more critical for high multicollinearity
third-order interaction as found in the error model is also observed in this cases. Fig. 9 (right) shows that the envelope methods are less influenced
model. by the level of relpos and are particularly better in achieving minimum
The following section will continue to explore the effects of different prediction error using a fewer number of components than other
levels of the factors in the case of these interactions. methods.

16
R. Rimal et al. Chemometrics and Intelligent Laboratory Systems 190 (2019) 10–21

Fig. 7. Pillai Statistic and F-value for the MANOVA model. The bar represents the Pillai Statistic and the text labels are F-value for the corresponding factor.

Fig. 8. Effect plot of some interactions of the multivariate linear model of prediction error.

17
R. Rimal et al. Chemometrics and Intelligent Laboratory Systems 190 (2019) 10–21

Fig. 9. Effect plot of some interactions of the multivariate linear model of the number of components to get minimum prediction error.

Fig. 10. (Left) Bar represents the eigenvalues corresponding to Raman Spectra. The points and line are the covariances between response and the principal com-
ponents of Raman Spectra. All the values are normalized to scale from 0 to 1. (Middle) Cumulative sum of eigenvalues corresponding to predictors. (Right) The
cumulative sum of eigenvalues corresponding to responses. The top and bottom row corresponds to test and training datasets respectively.

9. Examples variables were obtained as predictors. Raman spectroscopy provides


detailed chemical information from minor components in food. The aim
In addition to the analysis with the simulated data, the following two of this example is to compare how well the prediction methods that we
examples explore the prediction performance of the methods using real have considered are able to predict the contents of PUFA using these
datasets. Since both examples have wide predictor matrices, principal Raman spectra.
components explaining 97.5% of the variation in them are used for en- Fig. 10 (left) shows that the first few predictor components are
velope methods. The coefficients were transformed back after the somewhat correlated with response variables. In addition, the most
estimation. variation in predictors is explained by less than five components (mid-
dle). Further, the response variables are highly correlated, suggesting
9.1. Raman spectra analysis of contents of polyunsaturated fatty acids that a single latent dimension explains most of the variation (right). We
(PUFA) may therefore also believe that the relevant latent space in the response
matrix is of dimension one. This resembles Design 19 (Fig. 2) from our
This dataset contains 44 training samples and 25 test samples of fatty simulation.
acid information expressed as a) percentage of total sample weight and b) Using a range of components from 1 to 15, regression models were
the percentage of total fat content. The dataset is borrowed from Næs fitted using each of the methods. The fitted models were used to predict
et al. [23] where more information can be found. The samples were the test observation, and the root mean squared error of prediction
analysed using Raman spectroscopy from which 1096 wavelength (RMSEP) was calculated. Fig. 11 shows that PLS2 obtained a minimum

18
R. Rimal et al. Chemometrics and Intelligent Laboratory Systems 190 (2019) 10–21

Fig. 11. Prediction Error of different prediction methods using different number of components.

Fig. 12. (Left) Bar represents the eigenvalues corresponding to NIR Spectra. The points and line are the covariances between response and the principal components of
NIR Spectra. All the values are normalized to scale from 0 to 1. (Middle) Cumulative sum of eigenvalues corresponding to predictors. (Right) The cumulative sum of
eigenvalues corresponding to responses.

prediction error of 3.783 using 9 components in the case of response % which indicates that the component is less relevant for any of the re-
Pufa, while PLS1 obtained a minimum prediction error of 1.308 using 11 sponses. In addition, two response components have explained most of
components in the case of response PUFA%emul. However, the figure the variation in response variables (right). This structure is also some-
also shows that both envelope methods have reached to almost minimum what similar to Design 19, although it is uncertain whether the dimen-
prediction error in fewer number of components. This pattern is also sion of the relevant space in the response matrix is larger than one.
visible in the simulation results (Fig. 9). Fig. 13 (corresponding to Fig. 11) shows the root mean squared error
for both test and train prediction of the biscuit dough data. Here four
9.2. Example-2: NIR spectra of biscuit dough different methods have minimum test prediction error for the four re-
sponses. As the structure of the data is similar to that of the first example,
The dataset consists of 700 wavelengths of NIR spectra the pattern in the prediction is also similar for all methods.
(1100–2498 nm in steps of 2 nm) that were used as predictor variables. The prediction performance on the test data of the envelope methods
There are four response variables corresponding to the yield percentages appears to be more stable compared to the PCR and PLS methods.
of (a) fat, (b) sucrose, (c) flour and (d) water. The measurements were Furthermore, the envelope methods achieve good performance generally
taken from 40 training observation of biscuit dough. A separate set of 32 using fewer components, which is in accordance with Fig. 6.
samples created and measured on different occasions were used as test
observations. The dataset is borrowed from Indahl [16] where further 10. Discussions and conclusion
information can be obtained.
Fig. 12 (left) shows that the first predictor component has the largest Analysis using both simulated data and real data has shown that the
variance and also has large covariance with all response variables. The envelope methods are more stable, less influenced by relpos and gamma
second component, however, has larger variance (middle) than the suc- and in general, performed better than PCR and PLS methods. These
ceeding components but has a small covariance with all the responses, methods are also found to be less dependent on the number of

19
R. Rimal et al. Chemometrics and Intelligent Laboratory Systems 190 (2019) 10–21

Fig. 13. Prediction Error of different prediction methods using different number of components.

components. predictors in envelope methods has shown similar results, we have used
Since the facet in Figs. 5 and 6 have their own scales, despite having principal components that have explained 97.5% of the variation, as
some large prediction errors seen at the right tail, envelope methods still mentioned previously, in the cases of envelope methods for the designs
have a smaller prediction error and have used a fewer number of com- where p > n. Using 97.5% is slightly arbitrary here, but for the chosen
ponents than the other methods. simulation designs this proportion captured a fair amount of variations in
The envelope methods may have this problem of being caught in a predictor variables and also reduce the dimension significantly while
local optimum of the objective function. If these cases of sub-optimal enabling us to use envelope methods in all settings. The analyst should
convergence were identified and rerun to obtain better convergence, choose this number to balance the explained amount of variation to the
the envelope results may have become even better. Particularly in the number of components which is practical for model fitting using the
case of the simultaneous envelope, since users can specify the number of envelope model. The methodology used to adapt envelopes to settings in
dimension for the response envelope, the method can leverage the rele- which p > n is, in fact, the same as that used by PLS: reduce by principal
vant space of response while PCR, PLS and Xenv are constrained to play components, run the method, and then back transform to the original
only on predictor space. scale. The minor relative impact of p shown in Fig. 7 suggests that this
Furthermore, we have fixed the coefficient of determination (R2 ) as a adaptation method is useful.
constant throughout all the designs. Initial simulations (not shown) The results from this study will help researchers to understand these
indicated that low R2 affects all methods in a similar manner and that the methods for their performance in various linear model data and
MANOVA is highly dominated by R2 . Keeping the value of R2 fixed has encourage them to use newly developed methods such as the envelopes.
allowed us to analyze other factors properly. Since this study has focused entirely on prediction performance, further
Two clear comments can be made about the effect of correlation of analysis of the estimative properties of these methods is required. A study
response on the prediction methods. The highly correlated response has of estimation error and the performance of methods on the non-optimal
shown the highest prediction error in general and the effect is most number of components can give a deeper understanding of these
distinct in envelope methods. Since the envelope methods identify the methods.
relevant space as the span of relevant eigenvectors, the methods are able A shiny application [3] is available at http://therimalaya.shin
to obtain the minimum average prediction error by using a lesser number yapps.io/Comparison where all the results related to this study can be
of components for all levels of eta. visualized. In addition, a GitHub repository at https://github.com/ther
To our knowledge, the effect of correlation in the response on PCR imalaya/03-prediction-comparison can be used to reproduce this study.
and PLS methods has been explored only to a limited extent. In this
regards, it is interesting to see that these methods have applied a large Acknowledgment
number of components and returned a larger prediction error than en-
velope methods in the case of highly correlated responses. To fully un- We are grateful to Inge Helland for his inputs on this paper
derstand the effect of eta, it is necessary to study the estimation throughout the period. His guidance on the envelope models and his
performance of these methods with different numbers of components. review of the paper helped us greatly. Our gratitude also goes to thank
In addition, since using principal components or actual variables as Kristian Lillan, Ulf Indahl, Tormod Næs, Ingrid Måge and the team for

20
R. Rimal et al. Chemometrics and Intelligent Laboratory Systems 190 (2019) 10–21

providing the data for analysis. We are also thankful to the reviewers for [13] I.S. Helland, T. Almøy, Comparison of prediction methods when only a few
components are relevant, J. Am. Stat. Assoc. 89 (426) (1994) 583–591.
their comments which helped us to improve this paper.
[14] I.S. Helland, S. Saebø, T. Almøy, R. Rimal, S. Sæbø, T. Almøy, R. Rimal, Model and
estimators for partial least squares regression, J. Chemom. 32 (9) (sep 2018),
References e3044.
[15] I.S. Helland, S. Saebø, H.K. Tjelmeland, Near optimal prediction from relevant
[1] A. Alin, Comparison of pls algorithms when number of objects is much larger than components, Scand. J. Stat. 39 (4) (mar 2012) 695–713.
number of variables, Stat. Pap. 50 (4) (2009) 711–720. https://doi.org/10.100 [16] U. Indahl, A twist to partial least squares regression, J. Chemom. 19 (1) (2005)
7/s00362-009-0251-7. 32–44.
[2] T. Almøy, A simulation study on comparison of prediction methods when only a few [17] R. Johnson, D. Wichern, Applied Multivariate Statistical Analysis (Classic Version).
components are relevant, Comput. Stat. Data Anal. 21 (1) (jan 1996) 87–107. Pearson Modern Classics for Advanced Statistics Series. Pearson Education Canada,
[3] W. Chang, J. Cheng, J. Allaire, Y. Xie, J. McPherson, Shiny: Web Application 2018. https://books.google.no/books?id¼QBqlswEACAAJ.
Framework for R. R Package Version 1.2.0, 2018. https://CRAN.R-project.org/p [18] I.T. Jolliffe, Principal Component Analysis, second ed., 2002.
ackage¼shiny. [19] M. Lee, Z. Su, Renvlp: Computing Envelope Estimators. R Package Version 2.5,
[4] R.D. Cook, An Introduction to Envelopes : Dimension Reduction for Efficient 2018. https://CRAN.R-project.org/package¼Renvlp.
Estimation in Multivariate Statistics, first ed., John Wiley & Sons, Hoboken, NJ, [20] B.-H. Mevik, R. Wehrens, K.H. Liland, Pls: Partial Least Squares and Principal
2018, 2018. Component Regression. R Package Version 2.7-0, 2018. https://CRAN.R-project.org
[5] R.D. Cook, I.S. Helland, Z. Su, Envelopes and partial least squares regression, J. R. /package¼pls.
Stat. Ser. Soc. B Stat. Methodol. 75 (5) (2013) 851–877. [21] T. Næs, I.S. Helland, Relevant components in regression, Scand. J. Stat. 20 (3)
[6] R.D. Cook, B. Li, F. Chiaromonte, Dimension reduction in regression without matrix (1993) 239–250.
inversion, Biometrika 94 (3) (aug 2007) 569–584. [22] T. Naes, H. Martens, Comparison of prediction methods for multicollinear data,
[7] R.D. Cook, B. Li, F. Chiaromonte, Envelope models for parsimonious and efficient Commun. Stat. Simulat. Comput. 14 (3) (jan 1985) 545–576.
multivariate linear regression, Stat. Sin. 20 (3) (2010) 927–1010. [23] T. Næs, O. Tomic, N.K. Afseth, V. Segtnan, I. Måge, Multi-block regression based on
[8] R.D. Cook, X. Zhang, Simultaneous envelopes for multivariate linear regression, combinations of orthogonalisation, pls-regression and canonical correlation
Technometrics 57 (1) (2015) 11–25. analysis, Chemometr. Intell. Lab. Syst. 124 (2013) 32–42.
[9] R.D. Cook, X. Zhang, Algorithms for envelope estimation, J. Comput. Graph. Stat. [24] A.C. Rencher, Methods of Multivariate Analysis, vol. 492, John Wiley & Sons, 2003.
25 (1) (2016) 284–300. [25] R. Rimal, T. Almøy, S. Sæbø, A tool for simulating multi-response linear model data,
[10] S. de Jong, SIMPLS: an alternative approach to partial least squares regression, Chemometr. Intell. Lab. Syst. 176 (may 2018) 1–10.
Chemometr. Intell. Lab. Syst. 18 (3) (mar 1993) 251–263. [26] S. Sæbø, T. Almøy, I.S. Helland, Simrel - a versatile tool for linear model data
[11] I.S. Helland, Partial least squares regression and statistical models, Scand. J. Stat. 17 simulation based on the concept of a relevant subspace and relevant predictors,
(2) (1990) 97–114. Chemometr. Intell. Lab. Syst. 146 (2015) 128–135.
[12] I.S. Helland, Model reduction for prediction in regression models, Scand. J. Stat. 27
(1) (mar 2000) 1–20.

21

You might also like