Project 2021
Project 2021
Project 2021
Group members:
Kedir Adem Hussen (2055633)
Eduardo Luigi Miguel C. Lim (2055637)
Malai Nhim (2055640)
Lecturers:
Prof. Dr. Olivier Thas
Abstract
Over the past few years, prediction has become an increasingly important objective
within the field of statistics, now matching the importance placed on statistical infer-
ence. Predictive models are primarily assessed by the accuracy of their predictions.
Several methods have been proposed to estimate this prediction error. Despite this,
no consensus has been made on a gold standard for prediction error estimation. This
paper compares three methods: the cross-validation and bootstrap methods which are
resampling approaches and the closely related test set approach that uses an external
dataset. The test set approach was found to be the most simplistic but was rarely
used in practice due to issues with data availability. Cross-validation was found to have
suffered from issues in implementation. Both cross-validation and bootstrap methods
have proposed extensions to their initial estimators to tackle underestimation. For clas-
sification problems in small samples, cross-validation and bootstrap performed equally
well. Cross-validation may be a more attractive solution due to the complexity of new
bootstrap methods.
Key Words: predictive models, predictive error estimation, test set approach, cross
validation, bootstrap
1 Introduction
The importance of prediction in modelling within the field of statistics has increased over
the past few years due in part to the advancement of technology and the rising popularity
of adjacent fields (i.e., machine learning and data science). Lacking the statistical theory on
the properties of these predictions, assessment of the efficiency of these methods rely solely
on estimates of the prediction accuracy. Thus, several methods of estimation of prediction
accuracy have been put forward. No clear consensus has been made on a golden standard
for prediction accuracy estimation. Assessment also needs to be tempered based on the
response being predicted, for example the difference between predicting the classification
into a group of a future observation vs. the evolution of a particular characteristic in an
observation.
As the field of Statistics develops more novel methods for prediction, a critical review of
the methods of prediction assessment is a necessary consequence. This paper serves as an
(1) introduction to predictive modeling and its core elements and (2) an attempt to map
out current efforts to compare methods of prediction assessment. The paper is structured
as follows: The predictive modelling paradigm is briefly discussed and contrasted with
the inferential paradigm. A non-exhaustive list of methods of prediction assessment are
then discussed. Shortcomings and solutions used to resolve for these different estimation
methods are also presented. Finally, some papers comparing the different prediction error
estimation methods are reviewed to evaluate their performance on real and simulated data.
1
2 The Predictive Model
2.1 Historical Note
It is worth noting that the origin of predictive models are difficult to pinpoint as several
methods emerged from different fields of study in parallel (e.g. Computer Science, Physics,
Signal Processing, etc.) with the same underlying principle. Models constructed for sta-
tistical inference are innately capable of prediction which complicates the matter further.
This historical note on predictive modelling in statistics looks at the shift from a focus on
statistical inference towards prediction due to several factors. Prior to its integration within
the field of statistics, the emergence of algorithms such as the neural net and decision trees
led to various practitioners in different fields to pursue precise predictions based on these
methods (Breiman, 2001). This new-found flexibility allowed the exploration of real-life
data that was found infeasible to fit with known parametric distributions.
Within the field of statistics, development of inferential and predictive models began to
fork and evolve independent of each other. Efron & Hastie found that rapidly increasing
computing power resulted in renewed interest in the aforementioned algorithms; predictive
models began to be developed, divorced from the concept of classical statistical inference
(Efron & Hastie, 2021).
2
2.3 The Common Task Framework
In Donoho (2017), one of the most important pillars of the predictive modelling paradigm
was discussed. Briefly, the Common Task Framework (CTF) was devised as a means of
widespread collaboration and/or competition from members across industry and academia
as well as fair and unbiased judging of the capabilities of the models that were produced
by these entities. The main components of the CTF were a publicly available dataset, the
competitors, and a scoring referee independent of all the competitors. Progenitors of the
modern CTF originate from machine translation and (then) problems with the objective
evaluation of their theories and methods.
This framework stresses the empirical performance of such models with respect to some
scoring procedure. Through the CTF, model selection is facilitated across a broad class of
models. These scoring procedures are often concerned with prediction error which lends
even greater importance to an assessment of its estimation methods.
EX ∗ {Err(X ∗ , O)}
with the conditional test error
Err(X ∗ , O) = EY ∗ |X ∗ {(ŷ(X ∗ ; O) − Y ∗ )2 |X ∗ , O}
3
Efron (1983) classifies two broad statistical theories to assess predictive capabilities:
covariance penalties and cross-validation. The former is a parametric solution to the esti-
mation problem based on the model while the latter are a class of non-parametric solutions.
This paper focuses on the latter, illustrating common methods used to assess the predictive
model such as cross-validation and bootstrap and a related method, the test set approach.
3.2 Cross-Validation
To solve the problem of not having external data for testing the model, K-fold cross-
validation (CV) uses some observations in the available data and leaves some for testing
the model (Hastie et al., 2009). The training data is partitioned into k non-overlapping
subsets, O = ∪kj=1 Oj . The fold j is defined as the subset of observations excluded from
training the model on which the model will be tested. Finally, the performance of that
model to predict the observations in fold j is observed. This procedure is done for every
fold j and then the final cross-validated estimator is obtained as the average of M SEj from
all folds. The k-fold CV estimator is given by:
k
1X
CV (O) = M SEj .
k
j=1
where
#Oj
1 X
M SEj = (ŷ(xi ; O|Oj ) − yi )2
#Oj
i=1
3.3 Bootstrap
The bootstrap is a general tool for assessing statistical accuracy which relies on random
sampling with replacement (Hastie et al., 2009).
Assume we have a model fit to a set of training data Z = (z1 , z2 , ..., zN ) where zi =
(xi , yi ). We draw B random datasets of each of size N with replacement from the training
4
data, known as the Bootstrap samples (replicates). The model is then fit to each of the
bootstrap replicates, and the behaviour of the fits will be examined over the B replicates
(or bootstrap training data). For instance, let S(Z) be the mean squared error of the linear
regression model and the parameter of interest. Within each replicate, S(Z ∗ ) is computed.
The B S(Z ∗ )’s are used to assess the statistical accuracy of S(Z). The procedure is shown
in Figure 1.
The bootstrap procedure can also be used to estimate prediction error of a model. In
the simplest bootstrap procedure, each bootstrap replicate is considered as training data,
while the original data is considered as test data. The model will be fitted based on each
bootstrap replicate, and each fitted model will be applied to the original sample to provide
B estimates of prediction error. Then, the overall prediction error estimate is obtained by
averaging the B prediction error estimates. Let Q(y, fˆ) be a measure of error between the
response y and the prediction fˆ, then the simplest bootstrap prediction error estimate is
given by:
B X
N
ˆ boot = 1 1
X
Err Q yi , fˆ∗b (xi )
BN
b=1 i=1
where fˆ∗b (xi ) is the predicted value at xi from the model fitted to the b
th bootstrap
5
error can be calculated as follows:
N B X
N
! !
X 1 1 X
Ê[err(x∗ , Fˆ∗ )] = EF̂ ∗ ˆ∗b ∗
Q yib , f (xi ) = ∗ ˆ∗b ∗
Q yib , f (xi )
BN
i=1 b=1 i=1
where fˆ∗b (x∗i ) is the predicted value at x∗i based on the model estimated using the bth
∗ is the response value of the ith observation for the
bootstrap sample, b = 1, 2, ..., B, and yib
bth bootstrap sample.
However, the bootstrap methods mentioned above have a problem of significant overlap
between the test and training data. This overlapping results in overfit prediction (i.e.,
underestimation of the prediction). As a result, this method was not found to be good in
general (Efron & Tibshirani, 1994; Hastie et al., 2009).
To determine how much the true prediction error is underestimated, the difference
between the estimated prediction error based on both methods for each bootstrap sample
can be computed. This approach is known as “optimism” or “refined” bootstrap (Efron
& Tibshirani, 1994). Similar to the previous two methods, the overall prediction error is
computed by averaging these differences. It is given as follows:
N
ˆ (1) 1 X 1 X
ˆ∗b (xi )
Err = Q yi , f
N |C −1 | −1
i=1 b∈C
where C −1 is the set of indices of the bootstrap samples b that do not contain the ith
observation, and |C −1 | is the number of such samples.
The leave-one-out bootstrap, like the cross-validation method, suffers from training-
set-size bias. The bias can introduced due to non-distinct observations in the bootstrap
samples that result from sampling with replacement. To alleviate this bias, Efron (1983)
proposed another estimator known as the .632 estimator, which is defined as follows:
6
Efron & Tibshirani (1997) improved the .632 estimator by taking into account the
amount of overfitting . The new designed estimator is given by:
ˆ (.632+) = (1 − ŵ)err
Err ¯ (1)
¯ + ˆ(w)Err
0.632
where ŵ = ranges from 0.632 to 1 as the relative overfitting rate R̂ range from 0
1−0.368R̂
ˆ (.632+) ranges from Err
to 1. Consequently, the Err ˆ (.632) to Err
ˆ (1) .
4 Discussion
4.1 Test Set Approach
Gütlein et al. (2013) study the comparison between two assessment method: k fold cross-
validation and external test set validation (test set approach) performing a large-scale
experimental setup. From the experiment, when cross-validation is used, the variation of
the predictivity estimate from true performance on unknown compounds is less variable.
Cross-validation underestimates true predictability and is best suited to small datasets.
When applying external test set validation to small datasets, the tests suggest that splitting
the external test set into strata improves validation outcomes. However, if the distribution
of unseen compounds differs from the model building and validation data, both methods
fail to provide accurate performance estimates. Nevertheless, this study suggests that cross-
validation could be a useful tool for assessing predictive model predictability.
Test set approach is rarely used in practice since the external test set is hardly available.
On the other hand, validation set approach is used instead by dividing the available dataset
into two datasets: training set and validation test set. Apart from this, the validation test
approach works in the same fashion as the test set approach. Martens & Dardenne (1998)
make a comparison of four different assessment methods of the multivariate predictive model
including independent validation set approach and cross-validation (full cross-validation)
and two others. The comparison uses both real and simulated data. Multivariate calibration
is used on each subject of the model. The study consists of a Monte Carlo simulation within
a large data based on real small data sets (40 – 120 objects). Martens & Dardenne (1998)
conclude that the independent validation test set was inefficient and unreliable, yielding too
optimistic predictions of future predictive error. When part of the available objects was
set aside as separate validation tests, the true prediction error was substantially larger on
average and showed much more variability than when all of the available objects were use
for calibration and full cross-validation used for estimating. Even worse, test set validation
significantly underestimated the true prediction error.
7
4.2 Cross-Validation
In situations involving high-dimensional data, Cross-validation is not always used in a
correct way. Hastie et al. (2009) give an example when CV is used incorrectly in the
context that the cross validation is applied after selecting the predictors from the total
samples. This may lead to the problem that the predictors in the final model have an
unfair advantage because they are chosen from all the samples meaning that the predictors
already have seen the left out samples for testing. In this case there is no completely
independent test set. To solve the problem, Hastie et al. (2009) propose the proper way
by first dividing the sample into K cross-validation folds at random and for each fold k
choosing the predictors based on all the observations except observations in fold k. The
multivariate classifier can then be built and be used to predict the class labels for the sample
in the fold k. By doing this, the cross-validation estimate of the predictor error is coming
from the averages of the error estimates from over all the k folds.
Despite correcting the application, cross-validation does not always perform adequately
in a high-dimensional classification situation (Hastie et al., 2009). Consider the following
scenario, N = 20 samples in two equal-sized classes, with p = 500 quantitative predictors
independent of class labels. The true error rate of any classifier is 50%. If the 5-fold CV was
done, the predictor that splits the data will be well obtained. The same predictor should
be able to split either 54 and 15 of the data well. As a result, that CV does not provide a
reliable error estimator because the cross-validation error will be minimal (less than 50%).
According to Bates et al. (2021), in the special case of linear regression, when evaluating
the prediction error of the final model, the CV estimate of error has a bigger mean squared
error (MSE) than when estimating the average prediction error of models over numerous
unknown data sets. For confidence intervals of prediction error, the estimate of variance
is too small, and the intervals are too narrow, hence intervals based on CV can fail badly,
providing coverage for below the nominal level. To solve this problem, Bates et al. (2021)
created nested cross-validation (NCV), a cross-validation modification that provides cov-
erage close to the nominal level, even in difficult cases when the standard cross-validation
intervals have mis-coverage rates two to three times higher than the nominal rate.
4.3 Bootstrap
According to Efron & Tibshirani (1994) and Efron & Hastie (2021), simple bootstrap leads
to underestimate the true prediction error as each bootstrap sample has significant overlap
with the original data. Approximately about two-thirds of the original data points appear
in each bootstrap sample. The overlapping problem is also a case for the modified (im-
proved) bootstrap in which each bootstrap serves as both training and test data sets, which
is described by Efron & Tibshirani (1994). To fix this problem, a method by mimicking
cross-validation in a bootstrap approach was suggested in such a way that by keeping track
of predictions for each bootstrap from bootstrap samples not containing that observation
8
(Efron & Tibshirani, 1994; Trevor Hastie, 2009). This method, leave-one-out bootstrap,
solve the overfitting problem in these methods. Efron (1983) conclude that the Err ˆ 0.632
estimator outperformed all competitors. However, the simulation study by Efron & Tib-
shirani (1997) revealed that this estimator did not perform better in situations with a high
amount of overfitting, err
ˆ = 0. Thus, new estimator, Err ˆ (.632+) , has been designed in order
to compromise between the training error rate and the leave-out-bootstrap estimate.
Molinaro et al. (2005) conducted a simulation study on comparing resampling methods
for prediction error in the presence of feature selection and discovered that for small sample,
leave-one-out cross-validation (LOOCV), 10-fold CV, and the .632+ bootstrap have the
smallest bias for diagonal discriminant analysis, nearest neighbor, and classification trees.
For linear discriminant analysis, the cross validation methods (LOOCV and 10-fold CV)
and the .632+ bootstrap have the lowest mean square error.
5 Concluding Remarks
In this paper, the increasing importance of prediction in statistics and the need for more
accurate assessments of their performance was discussed. Common methods of prediction
error estimation were given and contrasted with one another.
Although the test set approach greatly simplifies the calculation of the test error and
has attractive properties, it encounters problems with regards to data availability. When
the external data set is not available, validation set approach is an alternative which has
identical procedure. However, these methods lead to loss of efficiency when estimating the
predictive error. Moreover, in small sample size settings, cross-validation works better than
test set approach for assessing the performance of the model prediction. Therefore, these
approaches are rarely used and resampling approaches are favored.
Resampling approaches can be seen as a solution to the problems of the test set ap-
proach. Cross-validation is one of the most commonly used methods for assessing the
predictive model. Nevertheless, it can be used incorrectly which leads to the problem of
dependency of the validation test set. It is advised to pay close attention to the procedure
when applying this method to the model. To be noted, CV does not give an accurate esti-
mate of the error in some cases. In some special cases, this has led to too narrow confidence
intervals and a nested extension was needed to correct for this underestimation.
Bootstrap is broadly used in the statistical field. It depends on random sampling with
replacement. It also used to estimate the prediction error of a model. A simple bootstrap
method can underestimate the true prediction error due to overlapping of the original
data in each sample. This needs to be corrected by introducing cross-validation in the
bootstrap approach. However, this correction does not work well with high amount of
over-fitting. In a small sample simulation study, leave-one-out cross-validation and .632+
bootstrap have similar performance. Nevertheless, due to the complexity of the derivation
in later bootstrap methods, cross-validation provides a simpler, more attractive approach
9
for estimating prediction error.
Although this paper focused on prediction error assessment, there are still a number
of hot-button topics regarding predictive modelling. One such topic is a debate on the
appropriateness of the “black box” approach with which most predictive models are con-
structed. Rodu & Baiocchi (2021) discuss shortcomings of the CTF and a need for a shift
from focusing on a specific task whereabout the algorithm is constructed to one where the
features of a specific research problem are the focus. This shift may be a way to bridge gaps
in understanding for people unfamiliar with these algorithms but have problems that need
to be tackled by them. Such considerations are the next logical step as one moves from
choosing a best model to deploying it for real world application and sharing the results to
non-technical audiences.
10
References
Bates, S., Hastie, T., & Tibshirani, R. (2021). Cross-validation: what does it estimate and
how well does it do it? arXiv preprint arXiv:2104.00673 .
Breiman, L. (2001). Statistical modeling: the two cultures. Statistical Science, 16 (3), 199–
231. Retrieved from http://dx.doi.org/10.1214/ss/1009213726 (With comments
and a rejoinder by the author) doi: 10.1214/ss/1009213726
Efron, B. (1983). Estimating the error rate of a prediction rule: Improvement on cross-
validation. Journal of the American Statistical Association, 78 (382), 316–331. doi:
10.1080/01621459.1983.10477973
Efron, B., & Hastie, T. (2021). Computer age statistical inference: Algorithms, evidence,
and data science. Cambridge University Press. doi: 10.1017/CBO9781316576533
Efron, B., & Tibshirani, R. (1997). Improvements on cross-validation: the 632+ bootstrap
method. Journal of the American Statistical Association, 92 (438), 548–560.
Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press.
Gütlein, M., Helma, C., Karwath, A., & Kramer, S. (2013). A large-scale empirical evalua-
tion of cross-validation and external test set validation in (q) sar. Molecular Informatics,
32 (5-6), 516–528.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning:
Data mining, inference, and prediction. Springer Open.
Martens, H. A., & Dardenne, P. (1998). Validation and verification of regression in small
data sets. Chemometrics and intelligent laboratory systems, 44 (1-2), 99–121.
Molinaro, A. M., Simon, R., & Pfeiffer, R. M. (2005). Prediction error estimation: a
comparison of resampling methods. Bioinformatics, 21 (15), 3301–3307.
Rodu, J., & Baiocchi, M. (2021). When black box algorithms are (not) appropriate: a
principled prediction-problem ontology.
Trevor Hastie, J. F., Robert Tibshirani. (2009). The elements of statistical learning: Data
mining, inference, and prediction. Springer.
11