Project 2021

Evaluating Prediction Assessment:
An Introduction and Comparisons
Inference for Statistics and Data Science (4564)

2021-2022
2nd year Master of Statistics

Hasselt University
Group members:
Kedir Adem Hussen (2055633)
Eduardo Luigi Miguel C. Lim (2055637)
Malai Nhim (2055640)
Submission Date: 10 December, 2021
Lecturers:
Prof. Dr. Olivier Thas
Abstract
Over the past few years, prediction has become an increasingly important objective
within the field of statistics, now matching the importance placed on statistical infer-
ence. Predictive models are primarily assessed by the accuracy of their predictions.
Several methods have been proposed to estimate this prediction error. Despite this,
no consensus has been made on a gold standard for prediction error estimation. This
paper compares three methods: the cross-validation and bootstrap methods which are
resampling approaches and the closely related test set approach that uses an external
dataset. The test set approach was found to be the most simplistic but was rarely
used in practice due to issues with data availability. Cross-validation was found to have
suffered from issues in implementation. Both cross-validation and bootstrap methods
have proposed extensions to their initial estimators to tackle underestimation. For clas-
sification problems in small samples, cross-validation and bootstrap performed equally
well. Cross-validation may be a more attractive solution due to the complexity of new
bootstrap methods.
Key Words: predictive models, predictive error estimation, test set approach, cross
validation, bootstrap
1 Introduction
The importance of prediction in modelling within the field of statistics has increased over
the past few years due in part to the advancement of technology and the rising popularity
of adjacent fields (i.e., machine learning and data science). Lacking the statistical theory on
the properties of these predictions, assessment of the efficiency of these methods rely solely
on estimates of the prediction accuracy. Thus, several methods of estimation of prediction
accuracy have been put forward. No clear consensus has been made on a golden standard
for prediction accuracy estimation. Assessment also needs to be tempered based on the
response being predicted, for example the difference between predicting the classification
into a group of a future observation vs. the evolution of a particular characteristic in an
observation.
As the field of Statistics develops more novel methods for prediction, a critical review of
the methods of prediction assessment is a necessary consequence. This paper serves as an
(1) introduction to predictive modeling and its core elements and (2) an attempt to map
out current efforts to compare methods of prediction assessment. The paper is structured
as follows: The predictive modelling paradigm is briefly discussed and contrasted with
the inferential paradigm. A non-exhaustive list of methods of prediction assessment are
then discussed. Shortcomings and solutions used to resolve for these different estimation
methods are also presented. Finally, some papers comparing the different prediction error
estimation methods are reviewed to evaluate their performance on real and simulated data.
1
2 The Predictive Model
2.1 Historical Note
It is worth noting that the origin of predictive models are difficult to pinpoint as several
methods emerged from different fields of study in parallel (e.g. Computer Science, Physics,
Signal Processing, etc.) with the same underlying principle. Models constructed for sta-
tistical inference are innately capable of prediction which complicates the matter further.
This historical note on predictive modelling in statistics looks at the shift from a focus on
statistical inference towards prediction due to several factors. Prior to its integration within
the field of statistics, the emergence of algorithms such as the neural net and decision trees
led to various practitioners in different fields to pursue precise predictions based on these
methods (Breiman, 2001). This new-found flexibility allowed the exploration of real-life
data that was found infeasible to fit with known parametric distributions.
Within the field of statistics, development of inferential and predictive models began to
fork and evolve independent of each other. Efron & Hastie found that rapidly increasing
computing power resulted in renewed interest in the aforementioned algorithms; predictive
models began to be developed, divorced from the concept of classical statistical inference
(Efron & Hastie, 2021).
2.2 Difference from Inference

Material differences between the inferential and predictive paradigm lie mainly in the objec-
tive of the respective models (Thas, 2021). Inferential models aim to uncover associations
between the response and a pool of regressors. Under the inferential paradigm, the signifi-
cance and uncertainty regarding these values are of key importance; these are reflected in
the standard tools of these models, hypothesis tests and confidence intervals.
The predictive paradigm’s main objective, on the other hand, is to accurately predict
future outcomes using a set of predictors. Most predictive models operate under a ”black
box”; this means that associations between the predictors and the outcome are not really
uncovered. Predictive models often have no concern on individual predictor’s and how they
relate to the outcome but rather an optimal selection of predictors with which prediction
is most accurate.
Predictive models often lack optimality properties that were derived for specific estima-
tors of inferential models. Concepts such as unbiasedness and variance-bias trade-off and
their respective theories are constrained to inferential models. As a direct consequence of
this, however, predictive models are often extremely flexible; no distributional assumptions
on the conditional response is necessary to build a good predictive model. However, this
also prompted the need to compare and rank different predictive models which did not rely
on statistical theory
2
2.3 The Common Task Framework
In Donoho (2017), one of the most important pillars of the predictive modelling paradigm
was discussed. Briefly, the Common Task Framework (CTF) was devised as a means of
widespread collaboration and/or competition from members across industry and academia
as well as fair and unbiased judging of the capabilities of the models that were produced
by these entities. The main components of the CTF were a publicly available dataset, the
competitors, and a scoring referee independent of all the competitors. Progenitors of the
modern CTF originate from machine translation and (then) problems with the objective
evaluation of their theories and methods.
This framework stresses the empirical performance of such models with respect to some
scoring procedure. Through the CTF, model selection is facilitated across a broad class of
models. These scoring procedures are often concerned with prediction error which lends
even greater importance to an assessment of its estimation methods.
3 Predictive Model Assessment

Predictive models are assessed by the magnitude of agreement between their prediction
and the observed values (i.e., how small their prediction error is). As this true prediction
error is unobservable, estimators must be used to assess model performance. The in-sample
or apparent error, a natural estimator for the true prediction error, was found to be too
optimistic as it is minimized explicitly by the construction of the model and the same data is
used to construct and evaluate the model. The test error, also known as the generalization
error, is a measure for how well a model generalizes to new data. It is found to be the more
appropriate estimator for the true prediction error. The assessment of the generalization
error relates to the model’s predictive capabilities on independent test data and is extremely
important because it measures the predictive quality of the chosen model (Hastie et al.,
2009). Where O is the training dataset, X ∗ is the future predictive value, and Y ∗ is the
future outcome, the test or generalization error or sometimes known as out-sample error is
defined by:
Err(O) = EY ∗ X ∗ {(y(X ∗ ; O) − Y ∗ )2 |O}.

or
EX ∗ {Err(X ∗ , O)}
with the conditional test error
Err(X ∗ , O) = EY ∗ |X ∗ {(ŷ(X ∗ ; O) − Y ∗ )2 |X ∗ , O}
3
Efron (1983) classifies two broad statistical theories to assess predictive capabilities:
covariance penalties and cross-validation. The former is a parametric solution to the esti-
mation problem based on the model while the latter are a class of non-parametric solutions.
This paper focuses on the latter, illustrating common methods used to assess the predictive
model such as cross-validation and bootstrap and a related method, the test set approach.
3.1 Test Set Approach

An alternative method to estimate the test error is the test set approach. Two data sets
are required for this approach: one for training the model and one for testing the model.
The model is trained only with the training data while the test error is estimated only from
the independent test data. This estimate is now a function of both the training and test
data. Suppose that the test data, OT is available with m observations. The test error is
estimated by:
m
1 X
Errtest (O, OT ) = (ŷ(xi ; O) − yi )2 .
m
i=1
3.2 Cross-Validation
To solve the problem of not having external data for testing the model, K-fold cross-
validation (CV) uses some observations in the available data and leaves some for testing
the model (Hastie et al., 2009). The training data is partitioned into k non-overlapping
subsets, O = ∪kj=1 Oj . The fold j is defined as the subset of observations excluded from
training the model on which the model will be tested. Finally, the performance of that
model to predict the observations in fold j is observed. This procedure is done for every
fold j and then the final cross-validated estimator is obtained as the average of M SEj from
all folds. The k-fold CV estimator is given by:
k
1X
CV (O) = M SEj .
k
j=1
where
#Oj
1 X
M SEj = (ŷ(xi ; O|Oj ) − yi )2
#Oj
i=1
3.3 Bootstrap
The bootstrap is a general tool for assessing statistical accuracy which relies on random
sampling with replacement (Hastie et al., 2009).
Assume we have a model fit to a set of training data Z = (z1 , z2 , ..., zN ) where zi =
(xi , yi ). We draw B random datasets of each of size N with replacement from the training
4
data, known as the Bootstrap samples (replicates). The model is then fit to each of the
bootstrap replicates, and the behaviour of the fits will be examined over the B replicates
(or bootstrap training data). For instance, let S(Z) be the mean squared error of the linear
regression model and the parameter of interest. Within each replicate, S(Z ∗ ) is computed.
The B S(Z ∗ )’s are used to assess the statistical accuracy of S(Z). The procedure is shown
in Figure 1.
Figure 1: Schematic of the bootstrap process from (Hastie et al., 2009)
The bootstrap procedure can also be used to estimate prediction error of a model. In
the simplest bootstrap procedure, each bootstrap replicate is considered as training data,
while the original data is considered as test data. The model will be fitted based on each
bootstrap replicate, and each fitted model will be applied to the original sample to provide
B estimates of prediction error. Then, the overall prediction error estimate is obtained by
averaging the B prediction error estimates. Let Q(y, fˆ) be a measure of error between the
response y and the prediction fˆ, then the simplest bootstrap prediction error estimate is
given by:
B X
N
ˆ boot = 1 1
X
Err Q yi , fˆ∗b (xi )
BN
b=1 i=1
where fˆ∗b (xi ) is the predicted value at xi from the model fitted to the b
th bootstrap
replicate, b = 1, 2, ..., B, and yi is the response value of the ith observation.

In regression analysis, Q(y, fˆ) = (y − fˆ) is often chosen. In classification analysis, the
misclassification indicator function, Q(y, fˆ) = Iy̸=fˆ, is used (Efron & Tibshirani, 1994).
This indicator takes on the value 1 when the prediction misclassifies the observation and 0
otherwise.
Efron & Tibshirani (1994) suggested that instead of the original dataset, each bootstrap
replicate could serve as both training and test data. The average of these B estimates, then,
will give the overall estimate of prediction error. The estimated prediction errors in this
method are on average lower than in the first method. The average estimated prediction
5
error can be calculated as follows:
N B X
N
! !
X 1 1 X
Ê[err(x∗ , Fˆ∗ )] = EF̂ ∗ ˆ∗b ∗
Q yib , f (xi ) = ∗ ˆ∗b ∗
Q yib , f (xi )
BN
i=1 b=1 i=1
where fˆ∗b (x∗i ) is the predicted value at x∗i based on the model estimated using the bth
∗ is the response value of the ith observation for the
bootstrap sample, b = 1, 2, ..., B, and yib
bth bootstrap sample.
However, the bootstrap methods mentioned above have a problem of significant overlap
between the test and training data. This overlapping results in overfit prediction (i.e.,
underestimation of the prediction). As a result, this method was not found to be good in
general (Efron & Tibshirani, 1994; Hastie et al., 2009).
To determine how much the true prediction error is underestimated, the difference
between the estimated prediction error based on both methods for each bootstrap sample
can be computed. This approach is known as “optimism” or “refined” bootstrap (Efron
& Tibshirani, 1994). Similar to the previous two methods, the overall prediction error is
computed by averaging these differences. It is given as follows:
w(F̂ ) = EF̂ [err(x∗ , F̂ ) − err(x∗ , F̂ ∗ ]

B X
N B X
N
!
1 1 X
ˆ∗b
X
= Q yi , f (x∗i ) − Q ∗ ˆ∗b ∗
yib , f (xi )
BN
b=1 i=1 b=1 i=1
Another method was proposed to address the overfitting issue in the two bootstrap
methods. In this method, a predicted value is computed for each observation using boot-
strap samples in which the observation does not appear. The leave-one-out bootstrap
estimate of prediction error is given as follow:
N
ˆ (1) 1 X 1 X
ˆ∗b (xi )

Err = Q yi , f
N |C −1 | −1
i=1 b∈C
where C −1 is the set of indices of the bootstrap samples b that do not contain the ith
observation, and |C −1 | is the number of such samples.
The leave-one-out bootstrap, like the cross-validation method, suffers from training-
set-size bias. The bias can introduced due to non-distinct observations in the bootstrap
samples that result from sampling with replacement. To alleviate this bias, Efron (1983)
proposed another estimator known as the .632 estimator, which is defined as follows:
ˆ (.632) = 0.368 ∗ err

Err ˆ (1)
¯ + 0.632 ∗ Err
1 PN ˆ 1 by averaging
where err
¯ = N i=1 (yi −f (xi ))
This estimator correct the upward bias in Err
ˆ (.632) .
it with the downwardly biased estimate Err
6
Efron & Tibshirani (1997) improved the .632 estimator by taking into account the
amount of overfitting . The new designed estimator is given by:
ˆ (.632+) = (1 − ŵ)err
Err ¯ (1)
¯ + ˆ(w)Err
0.632
where ŵ = ranges from 0.632 to 1 as the relative overfitting rate R̂ range from 0
1−0.368R̂
ˆ (.632+) ranges from Err
to 1. Consequently, the Err ˆ (.632) to Err
ˆ (1) .
4 Discussion
4.1 Test Set Approach
Gütlein et al. (2013) study the comparison between two assessment method: k fold cross-
validation and external test set validation (test set approach) performing a large-scale
experimental setup. From the experiment, when cross-validation is used, the variation of
the predictivity estimate from true performance on unknown compounds is less variable.
Cross-validation underestimates true predictability and is best suited to small datasets.
When applying external test set validation to small datasets, the tests suggest that splitting
the external test set into strata improves validation outcomes. However, if the distribution
of unseen compounds differs from the model building and validation data, both methods
fail to provide accurate performance estimates. Nevertheless, this study suggests that cross-
validation could be a useful tool for assessing predictive model predictability.
Test set approach is rarely used in practice since the external test set is hardly available.
On the other hand, validation set approach is used instead by dividing the available dataset
into two datasets: training set and validation test set. Apart from this, the validation test
approach works in the same fashion as the test set approach. Martens & Dardenne (1998)
make a comparison of four different assessment methods of the multivariate predictive model
including independent validation set approach and cross-validation (full cross-validation)
and two others. The comparison uses both real and simulated data. Multivariate calibration
is used on each subject of the model. The study consists of a Monte Carlo simulation within
a large data based on real small data sets (40 – 120 objects). Martens & Dardenne (1998)
conclude that the independent validation test set was inefficient and unreliable, yielding too
optimistic predictions of future predictive error. When part of the available objects was
set aside as separate validation tests, the true prediction error was substantially larger on
average and showed much more variability than when all of the available objects were use
for calibration and full cross-validation used for estimating. Even worse, test set validation
significantly underestimated the true prediction error.
7
4.2 Cross-Validation
In situations involving high-dimensional data, Cross-validation is not always used in a
correct way. Hastie et al. (2009) give an example when CV is used incorrectly in the
context that the cross validation is applied after selecting the predictors from the total
samples. This may lead to the problem that the predictors in the final model have an
unfair advantage because they are chosen from all the samples meaning that the predictors
already have seen the left out samples for testing. In this case there is no completely
independent test set. To solve the problem, Hastie et al. (2009) propose the proper way
by first dividing the sample into K cross-validation folds at random and for each fold k
choosing the predictors based on all the observations except observations in fold k. The
multivariate classifier can then be built and be used to predict the class labels for the sample
in the fold k. By doing this, the cross-validation estimate of the predictor error is coming
from the averages of the error estimates from over all the k folds.
Despite correcting the application, cross-validation does not always perform adequately
in a high-dimensional classification situation (Hastie et al., 2009). Consider the following
scenario, N = 20 samples in two equal-sized classes, with p = 500 quantitative predictors
independent of class labels. The true error rate of any classifier is 50%. If the 5-fold CV was
done, the predictor that splits the data will be well obtained. The same predictor should
be able to split either 54 and 15 of the data well. As a result, that CV does not provide a
reliable error estimator because the cross-validation error will be minimal (less than 50%).
According to Bates et al. (2021), in the special case of linear regression, when evaluating
the prediction error of the final model, the CV estimate of error has a bigger mean squared
error (MSE) than when estimating the average prediction error of models over numerous
unknown data sets. For confidence intervals of prediction error, the estimate of variance
is too small, and the intervals are too narrow, hence intervals based on CV can fail badly,
providing coverage for below the nominal level. To solve this problem, Bates et al. (2021)
created nested cross-validation (NCV), a cross-validation modification that provides cov-
erage close to the nominal level, even in difficult cases when the standard cross-validation
intervals have mis-coverage rates two to three times higher than the nominal rate.
4.3 Bootstrap
According to Efron & Tibshirani (1994) and Efron & Hastie (2021), simple bootstrap leads
to underestimate the true prediction error as each bootstrap sample has significant overlap
with the original data. Approximately about two-thirds of the original data points appear
in each bootstrap sample. The overlapping problem is also a case for the modified (im-
proved) bootstrap in which each bootstrap serves as both training and test data sets, which
is described by Efron & Tibshirani (1994). To fix this problem, a method by mimicking
cross-validation in a bootstrap approach was suggested in such a way that by keeping track
of predictions for each bootstrap from bootstrap samples not containing that observation
8
(Efron & Tibshirani, 1994; Trevor Hastie, 2009). This method, leave-one-out bootstrap,
solve the overfitting problem in these methods. Efron (1983) conclude that the Err ˆ 0.632
estimator outperformed all competitors. However, the simulation study by Efron & Tib-
shirani (1997) revealed that this estimator did not perform better in situations with a high
amount of overfitting, err
ˆ = 0. Thus, new estimator, Err ˆ (.632+) , has been designed in order
to compromise between the training error rate and the leave-out-bootstrap estimate.
Molinaro et al. (2005) conducted a simulation study on comparing resampling methods
for prediction error in the presence of feature selection and discovered that for small sample,
leave-one-out cross-validation (LOOCV), 10-fold CV, and the .632+ bootstrap have the
smallest bias for diagonal discriminant analysis, nearest neighbor, and classification trees.
For linear discriminant analysis, the cross validation methods (LOOCV and 10-fold CV)
and the .632+ bootstrap have the lowest mean square error.
5 Concluding Remarks
In this paper, the increasing importance of prediction in statistics and the need for more
accurate assessments of their performance was discussed. Common methods of prediction
error estimation were given and contrasted with one another.
Although the test set approach greatly simplifies the calculation of the test error and
has attractive properties, it encounters problems with regards to data availability. When
the external data set is not available, validation set approach is an alternative which has
identical procedure. However, these methods lead to loss of efficiency when estimating the
predictive error. Moreover, in small sample size settings, cross-validation works better than
test set approach for assessing the performance of the model prediction. Therefore, these
approaches are rarely used and resampling approaches are favored.
Resampling approaches can be seen as a solution to the problems of the test set ap-
proach. Cross-validation is one of the most commonly used methods for assessing the
predictive model. Nevertheless, it can be used incorrectly which leads to the problem of
dependency of the validation test set. It is advised to pay close attention to the procedure
when applying this method to the model. To be noted, CV does not give an accurate esti-
mate of the error in some cases. In some special cases, this has led to too narrow confidence
intervals and a nested extension was needed to correct for this underestimation.
Bootstrap is broadly used in the statistical field. It depends on random sampling with
replacement. It also used to estimate the prediction error of a model. A simple bootstrap
method can underestimate the true prediction error due to overlapping of the original
data in each sample. This needs to be corrected by introducing cross-validation in the
bootstrap approach. However, this correction does not work well with high amount of
over-fitting. In a small sample simulation study, leave-one-out cross-validation and .632+
bootstrap have similar performance. Nevertheless, due to the complexity of the derivation
in later bootstrap methods, cross-validation provides a simpler, more attractive approach
9
for estimating prediction error.
Although this paper focused on prediction error assessment, there are still a number
of hot-button topics regarding predictive modelling. One such topic is a debate on the
appropriateness of the “black box” approach with which most predictive models are con-
structed. Rodu & Baiocchi (2021) discuss shortcomings of the CTF and a need for a shift
from focusing on a specific task whereabout the algorithm is constructed to one where the
features of a specific research problem are the focus. This shift may be a way to bridge gaps
in understanding for people unfamiliar with these algorithms but have problems that need
to be tackled by them. Such considerations are the next logical step as one moves from
choosing a best model to deploying it for real world application and sharing the results to
non-technical audiences.
10
References
Bates, S., Hastie, T., & Tibshirani, R. (2021). Cross-validation: what does it estimate and
how well does it do it? arXiv preprint arXiv:2104.00673 .
Breiman, L. (2001). Statistical modeling: the two cultures. Statistical Science, 16 (3), 199–
231. Retrieved from http://dx.doi.org/10.1214/ss/1009213726 (With comments
and a rejoinder by the author) doi: 10.1214/ss/1009213726
Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical

Statistics, 26 (4), 745–766. doi: 10.1080/10618600.2017.1384734
Efron, B. (1983). Estimating the error rate of a prediction rule: Improvement on cross-
validation. Journal of the American Statistical Association, 78 (382), 316–331. doi:
10.1080/01621459.1983.10477973
Efron, B., & Hastie, T. (2021). Computer age statistical inference: Algorithms, evidence,
and data science. Cambridge University Press. doi: 10.1017/CBO9781316576533
Efron, B., & Tibshirani, R. (1997). Improvements on cross-validation: the 632+ bootstrap
method. Journal of the American Statistical Association, 92 (438), 548–560.
Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press.
Gütlein, M., Helma, C., Karwath, A., & Kramer, S. (2013). A large-scale empirical evalua-
tion of cross-validation and external test set validation in (q) sar. Molecular Informatics,
32 (5-6), 516–528.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning:
Data mining, inference, and prediction. Springer Open.
Martens, H. A., & Dardenne, P. (1998). Validation and verification of regression in small
data sets. Chemometrics and intelligent laboratory systems, 44 (1-2), 99–121.
Molinaro, A. M., Simon, R., & Pfeiffer, R. M. (2005). Prediction error estimation: a
comparison of resampling methods. Bioinformatics, 21 (15), 3301–3307.
Rodu, J., & Baiocchi, M. (2021). When black box algorithms are (not) appropriate: a
principled prediction-problem ontology.
Thas, O. (2021). Lecture notes on inference vs. prediction. Diepenbeek: Universiteit

Hasselt.
Trevor Hastie, J. F., Robert Tibshirani. (2009). The elements of statistical learning: Data
mining, inference, and prediction. Springer.
11

Project 2021

Uploaded by

Copyright:

Available Formats

Project 2021

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project 2021

Uploaded by

Copyright:

Available Formats

Evaluating Prediction Assessment:

An Introduction and Comparisons

Inference for Statistics and Data Science (4564)

2nd year Master of Statistics

Submission Date: 10 December, 2021

2.2 Difference from Inference

3 Predictive Model Assessment

Err(O) = EY ∗ X ∗ {(y(X ∗ ; O) − Y ∗ )2 |O}.

3.1 Test Set Approach

Figure 1: Schematic of the bootstrap process from (Hastie et al., 2009)

replicate, b = 1, 2, ..., B, and yi is the response value of the ith observation.

w(F̂ ) = EF̂ [err(x∗ , F̂ ) − err(x∗ , F̂ ∗ ]

ˆ (.632) = 0.368 ∗ err

Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical

Thas, O. (2021). Lecture notes on inference vs. prediction. Diepenbeek: Universiteit

You might also like