Evaluating Learning Algorithms A Classification Perspective - 2011
Evaluating Learning Algorithms A Classification Perspective - 2011
Evaluating Learning Algorithms A Classification Perspective - 2011
The field of machine learning has matured to the point where many sophisticated
learning approaches can be applied to practical applications. Thus it is of critical
importance that researchers have the proper tools to evaluate learning approaches
and understand the underlying issues.
This book examines various aspects of the evaluation process with an emphasis
on classification algorithms. The authors describe several techniques for classifier
performance assessment, error estimation and resampling, and obtaining statistical
significance, as well as selecting appropriate domains for evaluation. They also
present a unified evaluation framework and highlight how different components of
evaluation are both significantly interrelated and interdependent. The techniques
presented in the book are illustrated using R and WEKA, facilitating better practical
insight as well as implementation.
Aimed at researchers in the theory and applications of machine learning, this
book offers a solid basis for conducting performance evaluations of algorithms in
practical settings.
NATHALIE JAPKOWICZ
University of Ottawa
MOHAK SHAH
McGill University
cambridge university press
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore,
São Paulo, Delhi, Dubai, Tokyo, Mexico City
A catalog record for this publication is available from the British Library.
Cambridge University Press has no responsibility for the persistence or accuracy of URLs for
external or third-party Internet Web sites referred to in this publication and does not guarantee that
any content on such Web sites is, or will remain, accurate or appropriate.
This book is dedicated to the memory of my father, Michel Japkowicz
(1935–2008), who was my greatest supporter all throughout my studies and
career, taking a great interest in any project of mine. He was aware of the
fact that this book was being written, encouraged me to write it, and would
be the proudest father on earth to see it in print today.
Nathalie
Mohak
Contents
Preface page xi
Acronyms xv
1 Introduction 1
1.1 The De Facto Culture 3
1.2 Motivations for This Book 6
1.3 The De Facto Approach 7
1.4 Broader Issues with Evaluation Approaches 12
1.5 What Can We Do? 16
1.6 Is Evaluation an End in Itself? 18
1.7 Purpose of the Book 19
1.8 Other Takes on Evaluation 20
1.9 Moving Beyond Classification 20
1.10 Thematic Organization 21
3 Performance Measures I 74
3.1 Overview of the Problem 75
3.2 An Ontology of Performance Measures 81
3.3 Illustrative Example 82
3.4 Performance Metrics with a Multiclass Focus 85
3.5 Performance Metrics with a Single-Class Focus 94
3.6 Illustration of the Confusion-Matrix-Only-Based Metrics
Using WEKA 107
vii
viii Contents
9 Conclusion 335
9.1 An Evaluation Framework Template 336
9.2 Concluding Remarks 349
9.3 Bibliographic Remarks 350
Bibliography 393
Index 403
Preface
This book was started at Monash University (Melbourne, Australia) and Laval
University (Quebec City, Canada) with the subsequent writing taking place at
the University of Ottawa (Ottawa, Canada) and McGill University (Montreal,
Canada). The main idea stemmed from the observation that while machine
learning as a field is maturing, the importance of evaluation has not received
due appreciation from the developers of learning systems. Although almost
all studies make a case for the evaluation of the algorithms they present, we
find that many (in fact a majority) demonstrate a limited understanding of
the issues involved in proper evaluation, despite the best intention of their
authors. We concede that optimal choices cannot always be made due to limiting
circumstances, and trade-offs are inevitable. However, the methods adopted in
many cases do not reflect attention to the details warranted by a proper evaluation
approach (of course there are exceptions and we do not mean to generalize this
observation).
Our aim here is not to present the readers with yet another recipe for evaluation
that can replace the current default approach. Rather, we try to develop an
understanding of and appreciation for the different concerns of importance in the
practical application and deployment of learning systems. Once these concerns
are well understood, the other pieces of the puzzle fall quickly in place since
the researcher is not left shooting in the dark. A proper evaluation procedure
consists of many components that should all be considered simultaneously so as
to correctly address their interdependence and relatedness. We feel that the best
(read most easily understood) manner to bring this holistic view of evaluation
to the fore is in the classification setting. Nonetheless, most of the observations
that we make with regard to the various evaluation components extend just as
well to other learning settings and paradigms since the underlying evaluation
principles and objectives are essentially the same.
Altogether, this book should be viewed not only as a tool designed to increase
our understanding of the evaluation process in a shared manner, but also as a first
xi
xii Preface
(in both her fetal and infant states), which made it possible for Nathalie to
continue working on the project prior to and after her birth. Nathalie’s father,
Michel Japkowicz, and her mother, Suzanne Japkowicz, have also always been
an unconditional source of loving support and understanding. Without their
constant interest in her work, she would not be where she is today. Nathalie is
also grateful to her in-laws, Toba and Michael Ripsman, for being every bit as
supportive as her own parents during the project and beyond.
On the personal front, Mohak would like to acknowledge his mother Raxika
Shah and his sister Tamanna Shah for their unconditional love, support, and
encouragement. It is indeed the unsung support of family and friends that moti-
vates you and keeps you going, especially in difficult times. Mohak considers
himself exceptionally fortunate to have friends like Sushil Keswani and Ruma
Paruthi in his life. He is also grateful to Rajeet Nair, Sumit Bakshi, Arvind
Solanki, and Shweta (Dhamani) Keswani for their understanding, support, and
trust.
Finally, we heartily apologize to friends and colleagues whose names may
have been inadvertently missed in our acknowledgments.
Nathalie Japkowicz and Mohak Shah
Ottawa and Montreal
2010
Acronyms
Algorithms
1nn 1-nearest-neighbor nn nearest neighbor
ada AdaBoost using decision rf random forest
trees rip Ripper
c45 decision tree (c4.5) scm set covering machine
nb naive Bayes svm support vector machine
Algorithms are set in small caps to distinguish them from acronyms.
1
2 Introduction
1 These developments have resulted both from empirically studied behaviors and from exploiting the
theoretical frameworks developed in other fields, especially mathematics.
2 Although the worth of a study that results in marginal empirical improvements sometimes lies in the
more significant theoretical insights obtained.
1.1 The De Facto Culture 3
insights obtained should be considered in a more general sense toward the study
of all learning paradigms and settings. Many of these approaches can indeed
be readily exported (with a few suitable modifications) to other scenarios such
as unsupervised learning, regression and so on. The issues we consider in the
book deal not only with evaluation measures, but also with the related and
important issues of obtaining (and understanding) the statistical significance
of the observed differences, efficiently computing the evaluation measures in
as unbiased a manner as possible, and dealing with the artifacts of the data
that affect these quantities. Our aim is to raise an awareness of the proper
way to conduct such evaluations and of how important they are to the prac-
tical utilization of the advances being made in the field. While developing an
understanding of the relevant evaluation strategies, some that are widely used
(although sometimes with little understanding) as well as some that are not cur-
rently too popular, we also try to address a number of practical criticisms and
philosophical concerns that have been raised with regard to their usage and effec-
tiveness and examine the solutions that have been proposed to deal with these
concerns.
Our aim is not to suggest a recipe for evaluation to replace the previous de
facto one, but to develop an understanding and appreciation of the evaluation
strategies, of their strengths, and the underlying caveats. Before we go further
and expand our discussion pertaining to the goals of this book by bringing forth
the issues with our current practices, we discuss the de facto culture that has
pervaded the machine learning community to date.
Indeed, statistical tests are needed and are even useful so as to obtain “confi-
dence” in the difference in performance observed over a given domain for two
or more algorithms. Generally the machine learning community has settled on
merely rejecting the null hypothesis that the apparent differences are caused
by chance effects when the t test is applied. In fact, the issue is a lot more
involved.
The point is that no single evaluation strategy consisting of a combination
of evaluation methods can be prescribed that is appropriate in all scenarios. A
de facto – or perhaps, more appropriately, a panacea – approach to evaluation,
even with minor variations for different cases, is hence neither appropriate nor
possible or even advisable. Broader issues need to be taken into account.
Getting back to the issue of our general underappreciation of the importance
of evaluation, let us now briefly consider this question: Why and how has the
machine learning community allowed such a de facto or panacea culture to
take root? The answer to this question is multifold. Naturally we can invoke the
1.1 The De Facto Culture 5
argument about the ease of comparing novel results with existing published ones
as a major advantage of sticking to a very simple comparison framework. The
reasons for doing so can generally be traced to two main sources: (i) the unavail-
ability of other researchers’ algorithm implementations, and (ii) the ease of
not having to replicate the simulations even when such implementations are
available. The first concern has actually encouraged various researchers to come
together in calling for the public availability of algorithmic implementations
under general public licenses (Sonnenburg et al., 2007). The second concern
should not be mistaken for laziness on the part of researchers. After all, there
can be no better reward in being able to demonstrate, fair and square – i.e., by
letting the creators of the system themselves demonstrate its worth as best as
they can – the superiority of one’s method to the existing state of the art.
Looking a little bit beyond the issues of availability and simplicity, we believe
that there are more complex considerations that underlie the establishment of
this culture. Indeed, the implicit adoption of the de facto approach can also
be linked to the desire of establishing an “acceptable” scientific practice in
the field as a way to validate an algorithm’s worth. Unfortunately, we chose to
achieve such acceptability by using a number of shortcuts. The problem with this
practice is that our comparisons of algorithms’ performance, although appearing
acceptable, are frequently invalid. Indeed, many times, validity is lost as a result
of the violation of the underlying assumptions and constraints of the methods
that we use. This can be called the “politically correct” way of doing evaluations.
Such considerations are generally, and understandably, never stated as they are
implicit.
Digging even deeper, we can discover some of the reasons for this standard
adoption. A big part of the problem is attributable to a lack of understand-
ing of the evaluation approaches, their underlying mode of application, and
the interpretation of their results. Although advances have been made in find-
ing novel evaluation approaches or their periodic refinements, these advances
have not propagated to the mainstream. The result has been the adoption of a
“standard” simple evaluation approach comprising various elements that are rel-
atively easily understood (even intuitive) and widely accepted. The downside of
this approach is that, even when alternative (and sometimes better-suited) evalu-
ation measures are utilized by researchers, their results are met with scepticism.
If we could instill a widespread understanding of the evaluation methodologies
in the community, it would be easier to not only better evaluate our classifiers
but also to better appreciate the results that were obtained. This can further result
in a positive-feedback loop from which we can obtain a better understanding of
various learning approaches along with their bottlenecks, leading in turn to bet-
ter learning algorithms. This, however, is not to say that the researchers adopting
alternative, relatively less-utilized elements of evaluation approaches are com-
pletely absolved of any responsibility. Instead, these researchers also have the
onus of making a convincing case as to why such a strategy is more suitable than
6 Introduction
those in current and common use. Moreover, the audience – both the reviewers
and the readers – should be open to better modes of evaluation that can yield
a better understanding of the learning approaches applied in a given domain,
bringing into the light their respective strengths and limitations. To realize this
goal, it is indeed imperative that we develop a thorough understanding of such
evaluation approaches and promote this in the basic required machine learning
and data mining courses.
Note that the de facto method, even if suitable in many scenarios, is not a
panacea. Broader issues need to be taken into account. Such awareness can be
brought about only from a better understanding of the approaches themselves.
This is precisely what this book is aimed at. The main idea of the book is
not to prescribe specific recipes of evaluation strategies, but rather to educate
researchers and practitioners alike about the issues to keep in mind when adopt-
ing an evaluation approach, to enable them to objectively apply these approaches
in their respective settings.
While furthering the community’s understanding of the issues surround-
ing evaluation, we also seek to simplify the application of different evaluation
paradigms to various practical problems. In this vein, we provide simple and
intuitive implementations of all the methods presented in the book. We devel-
oped these by using WEKA and R, two freely available and highly versatile
platforms, in the hope of making the discussions in the book easily accessible
to and further usable by all.
Before we proceed any further, let us see, with the help of a concrete example,
what we mean by the de facto approach to evaluation and what types of issues
can arise as a result of its improper application.
1.3.1 An Illustration
Consider an experiment that consists of running a set of learning algorithms on
a number of domains to compare their generic performances. The algorithms
used for this purpose include naive bayes (nb), support vector machines (svms),
1-nearest neighbor (1nn), AdaBoost using decision trees (ada), Bagging (bag),
a C4.5 decision tree (c45), random forest (rf), and Ripper (rip).
Tables 1.2 and 1.3 illustrate the process just summarized with actual exper-
iments. In particular, Table 1.1 shows the name, dimensionality (#attr), size
(#ins), and number of classes (#cls) of each domain considered in the study.
Table 1.2 shows the results obtained by use of accuracy, 10-fold stratified cross-
validation, and t tests with 95% confidence, and averaging of the results obtained
by each classifier on all the domains. In Table 1.2, we also show the results of the
t test with each classifier pitted against nb. A “v” next to the result indicates the
significance test’s success of the concerned classifier against nb, a “*” represents
a failure, against NB (i.e. NB wins) and no symbol signals a tie (no statistically
significant difference). The results of the t test are summarized at the bottom
of the table. Table 1.3 shows the aggregated t-test results obtained by each
classifier against each other in terms of wins–ties–losses on each domain. Each
classifier was optimized prior to being tested by the running of pairwise t tests
on different parameterized versions of the same algorithm on all the domains.
The parameters that win the greatest numbers of t tests among all the others, for
one single classifier, were selected as the optimal ones.
As can be seen from these tables, results of this kind are difficult to interpret
because they vary too much across both domains and classifiers. For example,
the svm seems to be superior to all the other algorithms on the balance scale and
it apparently performs worst on breast cancer. Similarly, bagging is apparently
Table 1.2. Accuracy results of various classifiers on the datasets of Table 1.1
9
Glass 70.63 62.21 70.50 44.91 * 69.63 66.75 79.87 70.95
Hepatitis 83.21 80.63 80.63 82.54 84.50 83.79 84.58 78.00
Hypothyroid 98.22 93.58 * 91.52 * 93.21 * 99.55 v 99.58 v 99.39 v 99.42
Tic-tac-toe 69.62 99.90 v 81.63 v 72.54 v 92.07 v 85.07 v 93.94 v 97.39 v
Average 78.15 82.35 77.69 71.19 81.42 81.92 83.40 82.05 v
t test 3/6/1 2/6/2 1/5/4 3/7/0 3/7/0 4/6/0 4/6/0
Note: The final row gives the numbers of wins/ties/losses for each algorithm against the nb classifier.
Notes: A “v” indicates the significance test’s success in favor of the corresponding classifier against nb while a “∗” indicates this success in
favor of nb. No symbol indicates the result between the concerned classifier and nb were not found to be statistically significantly different.
10 Introduction
Table 1.3. Aggregate number of wins/ties/losses of each algorithm against the others
over the datasets of Table 1.1
the second best learner on the hepatitis dataset and is average, at best, on breast
cancer. As a consequence, the aggregation of these results over domains is not
that meaningful either. Several other issues plague this evaluation approach in
the current settings. Let us look at some of the main ones.
Statistical Validity – II. In fact, the dearth of data is only one problem plaguing
the validity of the t test. Other issues are problematic as well, e.g., the inter-
dependence between the number of experiments and the significance level of a
statistical test. As suggested by Salzberg (1997), because of the large number of
experiments run, the significance level of 0.05 used in our t test is not stringent
enough: It is possible that, in certain cases, this result was obtained by chance.
This is amplified by the fact that the algorithms were tuned on the same datasets
as they were tested on and that these experiments were not independent. Such
problems can be addressed to a considerable extent by use of well-known pro-
cedures such as analysis of variance (ANOVA) or other procedures such as the
Friedman test (Demšar, 2006) or the Bonferroni adjustment (Salzberg, 1997).
Aggregating the Results. The averaging of the results shown in Table 1.2 is
not meaningful either. Consider, for example, ada and nb on audiology and
breast cancer. Although, in audiology, ada’s performance with respect to 1nn’s
is dismal and rightly represented as such with a drop in performance of close
to 30%, ada’s very good performance in breast cancer compared with 1nn’s
is represented by only a 5% increase in performance. Averaging such results
(among others) does not weigh the extent of performance differences between
the classifiers. On the other hand, the win/tie/loss results give us quantitative
but not qualitative assessments: We know how many times each algorithm won
over, tied with, or lost against each other, but not by how much. Alternative
approaches as well as appropriate statistical tests have been proposed for this
purpose that may prove to be helpful.
available. This exploratory pursuit, though, is different from the issue of decid-
ing what classifier is best for a task or for a series of tasks (the primary purpose
of evaluation in learning experiments). Yet the two processes are usually merged
into a gray area with no attempt to truly subdivide them – and hence minimize
the biases in evaluation estimates. This is mainly due to the fact that machine
learning researchers are both the designers and the evaluators of their systems.
In other areas, as will be touched on later, these two processes are subdivided.
Internal versus External Validation. The last issue we wish to point out has
to do with the fact that, although we carefully use cross-validation to train and
test on different partitions of the data, at the end of the road, these partitions all
come from the same distribution because they belong to the same dataset. Yet
the designed classifiers are applied to different, even if related, data. This is an
issue discussed by Hand (2006) in the machine learning community; it was also
considered in the philosophy community by Forster (2000) and Busemeyer and
Wang (2000). We discuss some of the solutions that were proposed to deal with
this issue.
This short discussion illustrates some of the reasons why the de facto approach
to evaluation cannot be applied universally. As we can see, some of these
reasons pertain to the evaluation measures considered, others pertain to the
statistical guarantees believed to have been obtained, and still others pertain to
the overall evaluation framework. In all cases, the issues concern the evaluator’s
belief that the general worth of the algorithm of interest has been convincingly
and undeniably demonstrated in comparison with other competitive learning
approaches. In fact, critics argue, this belief is incorrect and the evaluator should
be more aware of the controversies surrounding the evaluation strategy that has
been utilized.
limitation, as it turns out, can have serious implications in that in most practical
scenarios there is almost always an unequal misclassification cost associated
with each class. This problem was recognized early. Kononenko and Bratko
(1991) proposed an information-based approach that takes this issue into con-
sideration, along with the questions of dealing with classifiers that issue different
kinds of answers (categorical, multiple, no answer or probabilistic) and com-
parisons on different domains. Although their method generated some interest
and a following in some communities, it did not receive large-scale acceptance.
Perhaps this is due to the fact that it relies on knowledge of the cost matrix and
prior class probabilities, which cannot generally be estimated accurately.
More successful has been the effort initiated by Provost et al. (1998), who
introduced ROC analysis to the machine learning community. ROC analysis
allows an evaluator to commit neither to a particular class prior distribution nor
to a particular cost matrix. Instead, it analyzes the classifier’s performance over
all the possible priors and costs. Its associated metric, the AUC, has started to be
used relatively widely, especially in cases of class imbalances. There have been
standard metrics adopted by other related domains as well to take into account
the class imbalances. For instance, the area of text categorization often uses
metrics such as precision, recall, and the F measure. In medical applications,
it is not uncommon to encounter results expressed in terms of sensitivity and
specificity, as well as in terms of positive predictive values (PPVs) and negative
predictive values (NPVs).
As mentioned previously, although accuracy remains overused, many
researchers have taken note of the fact that other metrics are available and
easy to use. As a matter of fact, widely used machine learning libraries such as
WEKA (Witten and Frank, 2005a) contains implementations of many of these
metrics whose results can be obtained in an automatically generated summary
sheet. Drummond (2006) also raised an important issue concerning the fact that
machine learning researchers tend to ignore the kind of algorithmic properties
that are not easy to measure. For example, despite its noted interest, the com-
prehensibility of a classifier’s result cannot be measured very convincingly and
thus is often not considered in comparison studies. Other properties that are sim-
ilarly perhaps better formulated qualitatively than quantitatively are also usually
ignored. We discuss these and related issues in greater depth in Chapters 3, 4,
and 8.
given that the same value may take different meanings, depending on the domain.
Recognizing this problem, researchers sometimes use a win/tie/loss approach,
counting the number of times each classifier won over all the others, tied with
the best, or lost against one or more. This, however, requires that the perfor-
mance comparison be deterministically categorized as a win, tie, or loss without
the margins being taken into account. That is, any information pertaining to how
close classifiers were to winning or tieing is essentially ignored. We discuss the
issue of aggregation in Chapter 8.
should, at the least, be aware of all the assumptions that are violated and the
possible consequences of this action.
Alternatives to cross-validation testing in such scenarios in which statistical
significance testing effects are kept in mind have been suggested. Two of the
main resampling statistics that have appeared to be useful, but have so far
eluded the community, are bootstrapping and randomization. Bootstrapping has
attracted a bit of interest in the field (e.g., Kohavi, 1995, and Margineantu and
Dietterich, 2000), but is not, by any means, widely used. Randomization, on the
other hand, has gone practically unnoticed except for rare citings (e.g., Jensen
and Cohen, 2000).
Resampling tests appear to be strong alternatives to parametric tests for sta-
tistical significance too, in addition to facilitating error estimation. We believe
that the machine learning community should engage in more experimentation
with them to establish alternatives in case the assumptions, constraints, or both,
of the standard tests such as the t test are not satisfied, rendering them inappli-
cable. Error-estimation methods are the focus of our discussions in Chapter 5.
1.4.5 Datasets
The experimental framework used by the machine learning community often
consists of running large numbers of simulations on community-shared domains
such as those from the UCI Repository for Machine Learning. There are many
advantages to working in such settings. In particular, new algorithms can easily
be tested under real-world conditions; problems arising in real-world settings
can thus be promptly identified and focused on; and comparisons between
new and old algorithms are easy because researchers share the same datasets.
Unfortunately, along with these advantages, are also a couple of disadvantages:
such as the multiplicity effect and the issues with community experiments. We
discuss these in detail in Chapter 7.
Exploratory pursuit versus Final Evaluation. The division between the two
kinds of research is problematic. Although it is clear that they should be sepa-
rated, some researchers (e.g., Drummond, 2006) recognize that they cannot be,
and that, as a result, we should not even get into issues of statistical validation
of our results and so on, because we are doing only exploratory research that
does not demand such formal testing. This is a valid point of view, although we
cannot help but hope that the results we obtain be more definitive than those
advocated by this point of view.
Better Data Exchanges Between Applied Fields and Machine Learning. The
view taken here is that we need to strongly encourage the collection of real
datasets to test our algorithms. A weakness in our present approach that needs
to be addressed is the reliance on old data. The old data have been used too
frequently and for far too long; hence results based on these may be untrust-
worthy. The UCI repository was a very good innovation, but despite being
a nonstatic collection with new datasets being collected over time, it does
not grow at a fast-enough pace, nor does it discriminate among the different
domains that are contributed. What the field lacks, mainly, are data focused on
particular topics. It would be useful to investigate the possibility of a data-for-
analysis exchange between practitioners in various fields and machine learning
researchers.
not suggesting that we all become statistical experts, deriving new methodologies
and mathematically proving their adequacies. Instead, we are proposing that we
develop a level of understanding and awareness in these issues similar to those
that exist in empirically stringent fields such as psychology or economics that
can enable us to craft experiments and validate models more rigourously.
and their wide use for various learning tasks and in part because of the ease
of illustrating various concepts in their context. However, we wish to make the
point that, despite its relative simplicity, the evaluation of classifiers is not an
easy endeavor, as evidenced by the mere length of this book. Many techniques
needed to be described and illustrated within the context of classification alone.
Providing a comprehensive survey of evaluation techniques and methodologies,
along with their corresponding issues, would have indeed been prohibitive.
Further, it is extremely important to note that the insights obtained from the
discussion of various components of the evaluation framework over classifi-
cation algorithms are not limited to this paradigm but readily generalize to a
significantly wider extent. Indeed, although the performance metrics or mea-
sures differ from task to task (e.g., unsupervised learning is concerned about
cluster tightness and between-cluster distance; association rule mining looks at
the notions of support and confidence), error estimation and resampling, sta-
tistical testing and dataset selection can make use of the techniques that apply
to classifier evaluation. This book should thus be useful for researchers and
practitioners in any data analysis task because most of its material applies to the
different learning paradigms they may consider. Because performance metrics
are typically the best-known part of evaluation in every learning field, the book
will remain quite useful even if it does not delve into this particular topic for the
specific task considered.
various aspects of the classifier evaluation approach. All the methods discussed
in the book are also illustrated using the R and WEKA packages, at the end
of their respective chapters. In addition, the book includes three appendices.
Appendix A contains all the statistical tables necessary to interpret the results of
the statistical tests of Chapters 5 and 6. Appendix B lists details on some of the
data we used to illustrate our discussions, and, finally, Appendix C illustrates
the framework of Chapter 9 with two case studies.
2
Machine Learning and Statistics
Overview
23
24 Machine Learning and Statistics Overview
future phenomena. In this book, we focus on a very useful, although much more
specific, aspect of inductive inference: classification.
constraints, functions, and even models. Moreover, there have recently been
attempts to learn from examples with and without labels together, an approach
largely known as semisupervised learning.
In addition, the mode of providing the learning information also plays an
important role that gives rise to two main models of learning: active learn-
ing (a learner can interact with the environment and affect the data-generation
process) and passive learning (in which the algorithm is given a fixed set of
examples as a way of observing its environment, but lacks the ability to interact
with it). It should be noted that active learning is different from online learn-
ing in which a master algorithm uses the prediction of competing hypotheses
to predict the label of a new example and then learns from its actual label.
In this book, the focus is on passive learning algorithms; more specifically,
classification algorithms in this paradigm. We illustrate the classification prob-
lem with an example. We further characterize this problem concretely a bit
later.
A Classification Example
The aim of classification algorithm A is to obtain a mapping from examples x
to their respective labels y in the form of a classifier f that can also predict
the labels for future unseen examples. When y takes on only two values as
labels, the problem is referred to as a binary classification problem. Although
restricting our discussions to this problem makes it easier to explain various
evaluation concepts, in addition to the fact that these algorithms are those that
are the most familiar to the researchers, enabling them to put the discussion
in perspective, this choice in no way limits the broader understanding of the
message of the book in more general contexts. Needless to say, many of the
proposed approaches extend to the multiclass case as well as to other learning
scenarios such as regression, and our focus on binary algorithms in no way
undermines the importance and contribution of these paradigms in our ability to
learn from data.
Consider the following toy example of concept learning aimed at providing
a flu diagnosis of patients with given symptoms. Assume that we are given the
database of Table 2.1 of imaginary patients. The aim of the learning algorithm
here is to find a function that is consistent with the preceding database and
can then also be used to provide a flu or no-flu diagnosis on future patients
based on their symptoms. Note that by consistent we mean that the function
should agree with the diagnosis, given the symptoms in the database provided.
As we will see later, this constraint generally needs to be relaxed to avoid
overspecializing the function that, even though agrees on the diagnosis of every
patient in the database, might not be as effective in diagnosing future unseen
patients.
From the preceding data, the program could infer that anyone with a body
temperature above 38 ◦ C and sinus pain has the flu. Such a formula can then be
26 Machine Learning and Statistics Overview
applied to any new patient whose symptoms are described according to the same
four parameters used to describe the patients from the database (i.e., temperature,
cough, sore throat, and sinus pain) and a diagnosis issued.1
Several observations are worth making at this point from this example: First,
many formulas could be inferred from the given database. For example, the
machine learning algorithm could have learned that anyone with a body temper-
ature of 38.5 ◦ C or more and cough, a sore throat, or sinus pain has the flu or it
could have learned that anyone with sinus pain has the flu.
Second, as suggested by our example, there is no guarantee that any of the
formulas inferred by the machine learning algorithm is correct. Because the
formulas are inferred from the data, they can be as good as the data (in the best
case), but not better. If the data are misleading, the result of the learning system
will be too.
Third, what learning algorithms do is different from what a human being
would do. A real (human) doctor would start with a theoretical basis (a kind
of rule of thumb) learned from medical school that he or she would then refine
based on his or her subsequent observations. He or she would not, thankfully,
acquire all his or her knowledge based on only a very limited set of observations.
In terms of the components of learning described in the preceding formula-
tion, we can look at this illustration as follows. The first component corresponds
to the set of all the potential patients that could be represented by the four
parameters that we listed (temperature, cough, sore throat, and sinus pain). The
example database we use lists only six patients with varying symptoms, but
many more could have been (and typically are) presented.
The second component, in our example, refers to the diagnosis, flu and no-
flu, associated with the symptoms of each patient. A classification learning
algorithm’s task is to find a way to infer the diagnosis from the data. Naturally,
this can be done in various ways.
The third component corresponds to this choice of the classification learning
algorithm. Different algorithms tend to learn under different learning paradigms
to obtain an optimal classifier and have their own respective learning biases.
1 In fact, choosing such a formula would obviate the need to measure symptoms other than temperature
and sinus pain.
2.1 Machine Learning Overview 27
where the probability measure D(z) = D(x, y) is unknown. This risk is often
referred to as the true risk of the classifier f . For the zero–one loss, i.e.,
L(y, f (x)) = 1 when y = f (x) and 0 otherwise, we can write the expected
risk as
def
R(f ) = Pr(x,y)∼D (y = f (x)) (2.2)
Note that the classifier f in question is defined given a training set. This
makes the loss function a training-set-dependent quantity. This fact can have
important implications in studying the behavior of a learning algorithm as well
as in making inferences on the true risk of the classifier. We will see this in some
more concrete terms in Subsection 2.1.7.
Empirical Risk
It is generally not possible to estimate the true or expected risk of the classifier
without the knowledge of the true underlying distribution of the data and possibly
their labels. As a result, the expected risk takes the form of a measurable quantity
known as the empirical risk. Hence the learner often computes the empirical
risk RS (f ) of any given classifier f = A(S) induced by the algorithm A on a
training set S of size m according to
1
m
def
RS (f ) = L(yi , f (xi )), (2.3)
m i=1
which is the risk of the classifier with respect to the training data. Here,
L(y, f (x)) is the specific loss function that outputs the loss of mislabeling
an example. Note that this function can be a binary function (outputting only 1
or 0) or a continuous function, depending on the class of problems.
28 Machine Learning and Statistics Overview
R(f ),
def
R(A, S) = R(f ) − inf
f ∈F
2 Infimum indicates the function f that minimizes R(f ) but f need not be unique.
2.1 Machine Learning Overview 29
from the classifier space being explored. Given some training set S, the ERM
algorithm Aerm basically outputs the classifier that minimizes the empirical risk
on the training set. That is,
f ∈F
Asrm (S) = f = argmin RS (f ) + p(d, |S|) ,
def
f ∈Fd ,d∈N
where p(d, |S|) is a function that penalizes the algorithm for classifier spaces of
increasing complexity.
Let cm be some complexity measure on classifier space. We would have a
set of classifier spaces F = {F1 , . . . , Fk } such that the complexity of classifier
space Fi denoted as cmFi is greater than or equal to cmFi−1 . Then the SRM
algorithm would be to compute a set of classifiers minimizing the empirical
risk over the classifier spaces F = {F1 , . . . , Fk } and then to select the classifier
space that gives the best trade-off between its complexity and the minimum
empirical risk obtained over it on the training data.
Regularization
There are other approaches that extend the preceding algorithms, such as reg-
ularization, in which one tries to minimize the regularized empirical risk. This
is done by defining a regularizing term or a regularizer (typically a norm on
2.1 Machine Learning Overview 31
the classifier ||f ||) over the classifier space F such that the algorithm outputs a
classifier f :
The regularization in this case can also be seen in terms of the complexity
of the classifier. Hence a regularized risk criterion restricts the classifier space
complexity. Here it is done by restricting the norm of the classifier. Other variants
of this approach also exist such as normalized regularization. The details on these
and other approaches can be found in (Hastie et al., 2001), (Herbrich, 2002),
and (Bousquet et al., 2004b), among others.
where the indicator function I (a) represents the zero–one loss such that I (a) = 1
if predicate a is true and 0 otherwise.
Similarly, the empirical risk RS (f ) can be shown to be
1
m
def def
RS (f ) = I (yi = f (xi )) = E(x,y)∼S I (y = f (x)). (2.5)
m i=1
Note that the loss function L(·, ·) in Equation (2.3) is replaced with the
indicator function in the case of Equation (2.5) for the classification problem.
Given this definition of risk, the aim of the classification algorithm is to find a
classifier f given the training data, such that, the risk of f is as close as possible
to that of the optimal classifier f ∈ F (minimizing the generalization error).
This problem of selecting the best classifier from the classifier space given the
training data is sometimes also referred to as model selection.
Recall our toy example from Subsection 2.1.1. A classifier based on the
criterion anyone with a cough or a sore throat and a temperature at or above
38.5 ◦ C has the flu would have an empirical risk of 1/6, whereas another based
on the criterion anyone with a cough or a sore throat has the flu would have
an empirical risk of 1/2 on the data of Table 2.1. Hence, given the two criteria,
ERM learning algorithms would select the former. How does model selection,
i.e., choice of a classifier based on the training data, affect the performance of
the classifier on future unseen data (also referred to as classifier’s generalization
performance)? This can be intuitively seen easily in the case of the regression
problem, as follows.
Figure 2.1. A simple example of overfitting and underfitting problems. Consider the prob-
lem of fitting a function to a set of data points in a two-dimensional plane, known as the
regression problem. Fitting a curve passing through every data point leads to overfitting. On
the other hand, the other extreme, approximation by a line, might be a misleading approxi-
mation and underfits if the data are sparse. The issue of overfitting is generally addressed
by use of approaches such as pruning or boosting. Approaches such as regularization and
Bayesian formulations have also shown promise in tackling this problem. In the case of
sparse datasets, we use some kind of smoothing or back-off strategy. A solution in between
the two extremes, e.g., the dash-dotted curve, might be desirable.
specific to generalize well in the future. Such a solution is hence not generally
preferred. Similar is the problem in which f underfits the data, that is, f is too
general. See Figure 2.1 for an example.
In our previous example, it is possible that the formula anyone with a cough
or a sore throat and a temperature at or above 38.5 ◦ C has the flu overfits the
data in that it closely represents the training set, but may not be representative
of the overall distribution. On the other hand, the formula anyone with a cough
or a sore throat has the flu underfits the data because it is not specific enough
and, again, not representative of the true distribution.
Generally speaking, we do not focus on the problem of model selection when
we refer to comparison or evaluation of classifiers in this book. We assume that
the learning algorithm A has access to some method that is believed to enable
it to choose the best classifier f from F based on the training sample S. The
method on which A relies is what is called the learning bias of an algorithm.
34 Machine Learning and Statistics Overview
This is the preferential criterion used by the learning algorithm to choose one
classifier over others from one classifier space. For instance, an ERM algorithm
would select a classifier f ∈ F if it outputs an empirical risk that is lower than
any other classifier in F given a specific training sample S. Even though our
primary focus in this book is on evaluation, we explore some concepts that also
have implications in the model selection.
Finally, in the case of most of the algorithms, the classifier (or a subspace of
the classifier space) is defined in terms of the user-tunable parameters. Hence the
problem of model selection in this case also extends to finding the best values for
the user-tunable parameters. This problem is sometimes referred to separately as
that of parameter estimation. However, this is, in our opinion, a part of the model
selection problem because these parameters are either used to characterize or
constrain the classifier class (e.g., as in the case of regularization) or are part of
the model selection criterion (e.g., penalty on the misclassification). We use the
terms model selection and parameter estimation interchangeably.
Bias–Variance Decomposition
In this subsection, we describe the notion of bias and variance of arbitrary
loss functions along the lines of Domingos (2000). More precisely, given an
arbitrary loss function, we are concerned with its bias–variance decomposition.
As we will see, studying the bias–variance decomposition of a loss function
gives important insights into both the behavior of the learning algorithm and the
model selection dilemma. Moreover, as we will see in later chapters, this analysis
also has implications for the evaluation because the error-estimation methods
(e.g., resampling) can have significant effects on the bias–variance behavior of
the loss function, and hence different estimations can lead to widely varying
estimates of the empirical risk of the classifier.
We described the notion of the loss function in Subsection 2.1.2. Let L be an
arbitrary loss function that gives an estimate of the loss that a classifier incurs
as a result of disagreement between the assigned label f (x) and the true label y
of an example x. Then we define the notion of main prediction of the classifier
as follows.
Definition 2.1. Given an arbitrary loss function L and a collection of training
sets S, the main prediction is the one that minimizes the expectation E of this
loss function over the training sets S (denoted as ES ). That is,
y = argmin ES [L(y , f (x))].
y
That is, the main prediction of a classifier is the one that minimizes the
expected loss of the classifier on training sets S. In the binary classification
scenario, this will be the most frequently predicted label over the examples
in the training sets S ∈ S. That is, the main prediction characterizes the label
predicted by the classifier that minimizes the average loss relative to the true
labels over all the training sets. Hence this can be seen as the most likely
prediction of the algorithm A given training sets S.
Note here the importance of a collection of training sets S, recalling the
observation that we previously made in Subsection 2.1.2 on the nature of the
loss function. The fact that a classifier is defined given a training set establishes
a dependency of the loss function behavior on a specific training set. Averaging
over a number of training sets can alleviate this problem to a significant extent.
The size of the training set has an effect on the loss function behavior too.
Consequently, what we are interested in is averaging the loss function estimate
over several training sets of the same size. However, such an averaging is easier
said than done. In practice, we generally do not have access to large amounts
of data that can enable us to have several training sets. A solution to this
36 Machine Learning and Statistics Overview
problem can come in the form of data resampling. We explore some of the
prominent techniques of data resampling in Chapter 5 and also study the related
issues.
Let us next define the optimal label y † of an example x such that y † =
argmin Ey [L(y, y )]. That is, if an example x is sampled repeatedly then the
y
associated label y need not be the same because y is also sampled from a con-
ditional distribution3 Y|X . The optimal prediction y † denotes the label that is
closest to the sampled labels over these repeated samplings. This essentially sug-
gests that, because there is a nondeterministic association between the examples
and their respective labels (in an ideal world this should not be the case), the
sampled examples are essentially noisy, with their noise given by the difference
between the optimal label and the sampled label.
Hence the noise of the example can basically be seen as a measure of mis-
leading labels. The more noisy an example is, the higher the divergence, as
measured by the loss function, of its label from the optimal label.
An optimal model would hence be the one that has f (x) = y † for all x. This
is nothing but the Bayes classifier in the case of classification with zero–one loss
function. The associated risk is called the Bayes risk.
We can now define the bias of a learning algorithm A.
Definition 2.3. Given a classifier f , for any example z = (x, y), let y † be the
optimal label of x; then the bias of an algorithm A is defined as
B A = Ex [BA (x)].
3 Recall that the examples and the respective labels are sampled from a joint distribution X × Y.
2.1 Machine Learning Overview 37
V A = Ex [VA (x)].
Unlike the bias of the algorithm, the variance is not independent of the training
set, even though it is independent of the true label of each x. The variance of
the algorithm can be seen as a measure of the stability, or lack thereof, of the
algorithm from the average prediction in response to the variation in the training
sets. Hence, when averaged, this variance gives an estimate of how much the
algorithm diverges from its most probable estimate (the average prediction).
Finally, the bias and variance are nonnegative if the loss function is nonnegative.
With the preceding definitions in place, we can decompose an arbitrary loss
function into its bias and variance components as follows.
We have an example x with true label y and a learning algorithm predicting
f (x) given a training set S ∈ S. Then, for an arbitrary loss function L, the
expected loss over the training sets S and the true label y can be decomposed as
where λ1 and λ2 are loss-function-based factors and the other quantities are as
previously defined.
Let us now look at the application of this decomposition on two specific loss
functions in the context of regression and classification respectively. We start
with the regression case where the loss function of choice is the squared loss.
The squared loss is defined as
with both y and f (x) being real valued. It can be shown that, for squared loss,
y † = Ey [y] and y = ES [f (x)] and further that λ1 = λ2 = 1.
Hence, in the case of squared loss, the following decomposition holds, with
λ1 = λ2 = 1:
Coming to the focus of the book, the binary classification scenario, we can define
the bias–variance decomposition for both the asymmetric loss (unequal loss in
38 Machine Learning and Statistics Overview
4 This can be seen by analyzing the behavior of the bias and variance in Theorem 2.2 in an analogous
manner.
40 Machine Learning and Statistics Overview
can further be aggravated if the data are sparse. This explains why too complex
a model can result in a very high error in the regression case (squared loss) when
the data are sparse.
It does not mean that the contribution of the variance term in the case of too
simple a model and the bias term in the case of too complex a model is zero.
However, in the two cases, the bias and the variance terms, respectively, are
very dominant in their contribution of the error, rendering the contribution of the
other terms relatively negligible. A model that can both fit the training data well
and generalize successfully at the same time can essentially be obtained by a
nontrivial trade-off of the bias and the variance of the model. A model with a very
high bias can result in underfitting whereas a model with a very high variance
results in overfitting. The best model would depend on an optimal bias–variance
trade-off and the nature of the training data (mainly sparsity and size).
5 It should be noted that there have also been recent attempts at designing algorithms in which this
restriction on both the training and the test data coming from the same distribution is not imposed.
2.1 Machine Learning Overview 41
trading this accuracy off in favor of a better measure over classifier complex-
ity. The behavior of the error of the learning algorithm can be analyzed by
decomposing it into its components, mainly bias and variance. The study of
bias and variance trade-off also gives insight into the learning process and
preferences of a learning algorithm. However, explicit characterization of the
bias and variance decomposition behavior of the learning algorithm is difficult
owing to two main factors: the lack of knowledge of the actual data generat-
ing distribution and a limited availability of data. The first limitation is indeed
the final goal of the learning process itself and hence cannot be addressed
in this regard. The second limitation is important in wake of the first. In the
absence of the knowledge of the actual data-generating distribution, the quan-
tities of interest need to be estimated empirically by use of the data at hand.
Naturally, the more data at hand, the closer the estimate will be to the actual
value. A smaller dataset size can significantly hamper reliable estimates. Lim-
ited data availability plays a very significant role both in the case of model
selection and in assessing the performance of the learning algorithm on test
data. However, this issue can, to some extent, be ameliorated by use of what
are known as data resampling techniques. We discuss various resampling tech-
niques, their use and implications, and their effect on the performance estimates
of the learning algorithm in detail in Chapter 5. Even though our main focus is
not on the effect of resampling on model selection, we briefly discuss the issue
of model selection, where pertinent, while discussing some of the resampling
approaches.
In this book, we assume that, given a choice of the classifier space, we have
at hand the means to discover a classifier that is best at describing the given
training data and a guarantee on future generalization over such data.6 Now
every learning algorithm basically explores a different classifier space. Consider
another problem then. What if the classifier space that we chose does not contain
such a classifier? That is, what if there is another classifier space that can better
explain the data at hand? Let us go a step further. Assume that we have k
candidate classifier spaces each available with an associated learning algorithm
that can discover the classifier in each case that best describes the data. How do
we choose the best classifier space from among all these? Implicitly, how do we
choose the best learning algorithm given our domain of learning? Looked at in
another way, how can we decide which of the k algorithms is the best on a set of
given tasks? This problem, known as the evaluation of the learning algorithm,
is the problem that we explore in this book.
6 We use the term guarantee loosely here. Indeed there are learning algorithms that can give a
theoretical guarantee over the performance of the classifier on future data. Such guarantees, known
as risk bounds or generalization error bounds, have even been used to guide the model selection
process. However, in the present case, we also refer to implicit guarantees obtained as a result of
optimizing the model selection criterion such as the ERM or SRM.
42 Machine Learning and Statistics Overview
Table 2.2. Results of running three learning algorithms on the labor dataset from the
UCI Repository
An Example
Consider the results of Table 2.2 obtained by running three learning algorithms
on the labor dataset from the UCI Repository for Machine Learning. The learning
algorithms used were the C4.5 decision trees learner (c45), naive Bayes (nb), and
Ripper (rip). All the simulations were run by a 10-fold cross-validation over the
labor dataset and the cross-validation runs were repeated 10 times on different
permutations of the data.7 The dataset contained 57 examples. Accordingly,
reported results of the classifier errors on the test folds pertain to the ones on six
examples in the first seven folds in the table, and the reported errors are over five
examples in the last three test folds. The training and test folds within each run
of cross-validation and for each repetition were the same for all the algorithms.
All the experiments that involve running classifiers in this and the subsequent
chapters, unless otherwise stated, are done with the WEKA machine learning
toolkit (Witten and Frank, 2005a).
With these results in the background, let us move on to discussing the concept
of random variables.
and
p(x) > 0, ∀x.
The expected value of a random variable x denotes its central value and is
generally used as a summary value of the distribution of the random variable.
The expected value generally denotes the average value of the random variable.
For a discrete random variable x taking m possible values xi , i ∈ {1, . . . , m},
the expected value can be obtained as
m
E[x] = xi Pr(xi ),
i=1
where Pr(·) denotes the probability distribution, with Pr(xi ) denoting the proba-
bility of x taking on the value xi . Similarly, in the case in which x is a continuous
random variable with p(x) as the associated probability density function, the
expected value is obtained as
E[x] = xp(x)dx.
can then be used to estimate the expected value of the random variable. Hence,
if Sx is the set of values taken by the variable x, then the sample mean can be
calculated as
|S |
1 x
x= xi ,
|Sx | i=1
where |Sx | denotes the size of the set Sx .
Although, the expected value of a random variable summarizes its central
value, it does not give any indication about the distribution of the underlying
variable by itself. That is, two random variables with the same expected value
can have entirely different underlying distributions. We can obtain a better sense
of a distribution by considering the statistics of variance in conjunction with the
expected value of the variable.
The variance is a measure of the spread of the values of the random variable
around its central value. More precisely, the variance of a random variable (prob-
ability distribution or sample) measures the degree of the statistical dispersion
or the spread of values. The variance of a random variable is always nonnega-
tive. Hence, the larger the variance, the more scattered the values of the random
variable with respect to its central value. The variance of a random variable x is
calculated as
Var(x) = σ 2 (x) = E[x − E[x]]2 = E[x 2 ] − E[x]2 .
In the continuous case, this means:
σ (x) = (x − E[x])2 p(x)dx,
2
where E[x] denotes the expected value of the continuous random variable x
and p(x) denotes the associated probability density function. Similarly, for the
discrete case,
m
σ 2 (x) = Pr(xi )(xi − E[x])2 ,
i=1
where, as before, Pr(·) denote the probability distribution associated with the
discrete random variable x.
Given a sample of the values taken by x, we can calculate the sample variance
by replacing the expected value of x with the sample mean:
x|S |
1
VarS (x) = σS2 = (xi − x)2 .
|Sx | − 1 i=1
Note that the denominator of the preceding equation is |Sx | − 1 instead of |Sx |.8
The preceding estimator is known as the unbiased estimator of the variance of
a sample. For large |Sx |, the difference between |Sx | and |Sx | − 1 is rendered
insignificant. The advantage of |Sx | − 1 is that in this case it can be shown that
the expected value of the variance E[σ 2 ] is equal to the true variance of the
sampled random variable.
The variance of a random variable is an important statistical indicator of the
dispersion of the data. However, the unit of the variance measurement is not the
same as the mean, as is clear from our discussion to this point. In some scenarios,
it can be more helpful if a statistic is available that is comparable to the expected
value directly. The standard deviation of a random variable fills this gap. The
standard deviation of a random variable is simply the square root of the variance.
When estimated on a population or sample of values, it is known as the sample
standard deviation. It is generally denoted by σ (x). This also makes it clear that
using σ 2 (x) for variance denotes that the unit of the measured variance is the
square of the expected value statistic. σ (x) is calculated as
σ (x) = Var(x).
Similarly, we can obtain the sample standard deviation by considering the square
root of the sample variance. One point should be noted. Even when an unbiased
estimator of the sample variance is used (with |Sx | − 1 in the denominator
instead of |Sx |), the resulting estimator is still not an unbiased estimator of
the sample standard deviation.9 Furthermore, it underestimates the true sample
standard deviation. A biased estimator of the sample variance can also be used
without significant deterioration. An unbiased estimator of the sample standard
deviation is not known except when the variable obeys a normal distribution.
Another significant use of the standard deviation will be seen in terms of
providing confidence to some statistical measurements. One of the main such
uses involves providing the confidence intervals or margin of error around a
measurement (mean) from samples. We will subsequently see an illustration.
inductive inference does make the underlying assumption here that the data
for both the training as well as the test set come from the same distribution.
The examples are assumed to be sampled in an independently and identically
distributed (i.i.d.) manner. The most general assumption that can be made is
that the data (and possibly their labels) are assumed to be generated from some
arbitrary underlying distribution. That is, we have no knowledge of this true
distribution whatsoever. This is indeed a reasonable assumption as far as learning
is concerned because the main aim is to be able to model (or approximate) this
distribution (or the label-generation process) as closely as possible. As a result,
each example in the test set can be seen as being drawn independently from
some arbitrary but fixed data distribution. The performance of the classifier
applied to each example can be measured for the criterion of interest by use of
corresponding performance measures. The criterion of interest can be, say, how
accurately the classifier predicts the label of the example or how much the model
depicted by the classifier errs in modeling the example. As a result, we can, in
principle, also model these performance measures as random variables, again
from an unknown distribution possibly different from the one that generates
the data and the corresponding labels. This is one of the main strategies behind
various approaches to classifier assessment as well as to evaluation.
Example 2.1. Recall Table 2.2 that presented results of running 10 runs of
10-fold cross-validation on the labor dataset from the UCI Repository. Now a
classifier run on each test example (in each fold of each run) for the respective
learning algorithms gives an estimate of its empirical error by means of the
indicator loss function. The classifier experiences a unit loss if the predicted
label does not match the true label. The empirical risk of the classifier in each
fold can then be obtained by averaging this loss over the number of examples in
the corresponding fold. Table 2.3 gives this empirical risk for all the classifiers.
The entries of Table 2.3 correspond to the entries of Table 2.2 but are divided by
the number of examples in the respective folds. That is, the entries in the first
seven columns are all divided by six, whereas those in the last three columns are
each divided by five. Then we can model the empirical risk of these classifiers
as random variables with the estimates obtained over each test fold and in each
trial run as their observed values. Hence we have 100 observed values for each
of the three random variables used to model the empirical risk of the three
classifiers. Note that the random variable used for the purpose can have values
in the [0, 1] range, with 0 denoting no risk (all the examples classified correctly)
and 1 denoting the case in which all the examples are classified incorrectly.
Let us denote the empirical risk by RS (·). Then the variables RS (c45), RS (rip),
and RS (nb) denote the random variables representing the empirical risks for the
decision tree learner, Ripper, and the naive Bayes algorithm, respectively. We
can now calculate the sample means for the three cases from the population of
100 observations at hand.
Table 2.3. Empirical risks for the classifiers in [0,1] range from Tab. 2.2
Trial
No. Classifiers f1 f2 f3 f4 f5 f6 f7 f8 f9 f 10
1 c4.5 0.5 0 0.3333 0 0.3333 0.3333 0.3333 0.2 0.2 0.2
rip 0.3333 0.5 0 0 0.3333 0.5 0.1667 0 0.2 0
nb 0.1667 0 0 0 0 0.1667 0 0.4 0 0
2 c4.5 0.3333 0.1667 0.1667 0.3333 0.3333 0 0 0 0.2 0.2
rip 0.3333 0.1667 0 0 0.3333 0.1667 0 0.2 0.2 0
nb 0.3333 0 0.1667 0.1667 0 0 0 0 0 0
3 c4.5 0 0 0.3333 0.3333 0.1667 0.3333 0.1667 0 0.2 0.2
rip 0 0 0.1667 0.3333 0.1667 0.1667 0.1667 0.2 0.4 0.2
49
nb 0 0 0 0 0 0.1667 0 0 0 0
4 c4.5 0.5 0.1667 0.3333 0.1667 0 0.1667 0.5 0.2 0.2 0.4
rip 0.3333 0 0.1667 0.1667 0.1667 0 0.6667 0.2 0.2 0.2
nb 0.3333 0 0 0.1667 0 0 0 0 0 0.2
5 c4.5 0 0.1667 0.3333 0.1667 0 0.1667 0 0.2 0.4 0.2
rip 0.3333 0.1667 0.1667 0.1667 0 0.1667 0 0.4 0 0.2
nb 0 0.1667 0 0 0.1667 0 0 0 0 0.4
6 c4.5 0.3333 0.1667 0.1667 0.1667 0.6667 0.1667 0.3333 0.2 0.2 0
rip 0 0 0.3333 0 0.5 0 0 0 0.2 0
nb 0 0 0.1667 0 0.1667 0.1667 0 0 0 0
(continued)
Table 2.3 (cont.)
Trial
No. Classifiers f1 f2 f3 f4 f5 f6 f7 f8 f9 f 10
7 c4.5 0.1667 0.3333 0.1667 0.3333 0.1667 0.1667 0.3333 0.2 0 0.2
rip 0.1667 0.1667 0.1667 0.3333 0.1667 0.1667 0.3333 0.4 0 0.2
nb 0.1667 0.1667 0.1667 0 0.1667 0.3333 0.3333 0 0.2 0
8 c4.5 0 0.1667 0.1667 0 0 0 0.1667 0 0.6 0.4
50
rip 0.1667 0.1667 0.1667 0 0 0.1667 0 0 0.2 0.2
nb 0 0 0 0 0 0 0 0 0.2 0.2
9 c4.5 0.1667 0.3333 0.1667 0.3333 0.3333 0.3333 0 0.6 0.2 0.2
rip 0 0 0.1667 0 0 0.1667 0 0.4 0 0.2
nb 0 0 0 0.1667 0 0 0 0 0 0.2
10 c4.5 0.5 0.1667 0.1667 0.5 0.5 0.3333 0 0.4 0.4 0
rip 0.5 0.1667 0.1667 0.1667 0 0.3333 0.3333 0.2 0.2 0.2
nb 0 0.1667 0 0 0 0 0.1667 0.2 0.2 0
2.2 Statistics Overview 51
The sample mean for these random variables would then indicate the overall
average value taken by them over the folds and runs of the experiment. We can
compute these means, indicated by an overline bar, by averaging the values in
the respective cells of Table 2.3. The values can be recorded as vectors in R and
the mean can then be calculated as follows:
Listing 2.1: Sample R command to input the sample values and calculate the
mean.
> c45 = c ( . 5 , 0 , . 3 3 3 3 , 0 , . 3 3 3 3 , . 3 3 3 3 , . 3 3 3 3 , . 2 , . 2 , . 2 ,
.3333 ,.1667 ,.1667 ,.3333 ,.3333 ,0 ,0 ,0 ,.2 ,.2 ,
0 ,0 ,.3333 ,.3333 ,.1667 ,.3333 ,.1667 ,0 ,.2 ,.2 ,
.5 ,.1667 ,.3333 ,.1667 ,0 ,.1667 ,.5 ,.2 ,.2 ,.4 ,
0 ,.1667 ,.3333 ,.1667 ,0 ,.1667 ,0 ,.2 ,.4 ,.2 ,
.3333 ,.1667 ,0.1667 ,0.1667 ,.6667 ,.1667 ,.3333 ,.2 ,.2 ,0 ,
.1667 ,.3333 ,.1667 ,.3333 ,.1667 ,.1667 ,.3333 ,.2 ,0 ,.2 ,
0 ,.1667 ,.1667 ,0 ,0 ,0 ,.1667 ,0 ,.6 ,.4 ,
.1667 ,.3333 ,.1667 ,.3333 ,.3333 ,.3333 ,0 ,.6 ,.2 ,.2 ,
.5 ,.1667 ,.1667 ,.5 ,.5 ,.3333 ,0 ,.4 ,.4 ,0)
> mean ( c45 )
[ 1 ] 0.217668
> nb=c ( . 1 6 6 7 , 0 , 0 , 0 , 0 , . 1 6 6 7 , 0 , . 4 , 0 , 0 ,
.3333 ,0 ,.1667 ,.1667 ,0 ,0 ,0 ,0 ,0 ,0 ,
0 ,0 ,0 ,0 ,0 ,.1667 ,0 ,0 ,0 ,0 ,
.3333 ,0 ,0 ,.1667 ,0 ,0 ,0 ,0 ,0 ,.2 ,
0 ,.1667 ,0 ,0 ,.1667 ,0 ,0 ,0 ,0 ,.4 ,
0 ,0 ,.1667 ,0 ,.1667 ,.1667 ,0 ,0 ,0 ,0 ,
.1667 ,.1667 ,.1667 ,0 ,.1667 ,.3333 ,.3333 ,0 ,.2 ,0 ,
0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,.2 ,.2 ,
0 ,0 ,0 ,.1667 ,0 ,0 ,0 ,0 ,0 ,.2 ,
0 ,.1667 ,0 ,0 ,0 ,0 ,.1667 ,.2 ,.2 ,0)
> mean ( nb )
[ 1 ] 0.065338
Note here that c( ) is a method that combines its arguments to form a vector;
mean( ) computes the sample mean of the vector passed to it.
52 Machine Learning and Statistics Overview
What we essentially just did was model the empirical risk as a continuous
random variable that can take values in the [0, 1] interval and obtain the statistic
of interest. An alternate to this was modeling the risk as a binary variable that
can take values in {0, 1}. Applying the classifier on each test example in each of
the test folds would then give an observation. These values can then be averaged
over to obtain the corresponding sample means. Hence, by adding the errors
made in each test fold and then further over all the trials, we would end up with a
population of size 10 × 57 = 570. The sample means can then be calculated as
14 + 10 + 10 + 15 + 9 + 14 + 12 + 8 + 15 + 17
R(c45) = RS (c45) =
570
= 0.2175,
12 + 8 + 10 + 12 + 9 + 6 + 12 + 6 + 5 + 13
R(rip) = RS (rip) =
570
= 0.1632,
4+4+1+4+4+3+9+2+2+4
R(nb) = RS (nb) = = 0.0649.
570
Note the marginal difference in the estimated statistic as a result of the round-off
error. Both the variations of modeling the performance as a random variable are
equivalent (although in the case of discrete modeling, it can be seen that, instead
of modeling the empirical risk, we are modeling the indicator loss function).
We will use the continuous random variables for modeling as in the preceding
first case because this allows us to model the empirical risk (rather than the loss
function). Given that these sample means represent the mean empirical risk of
each classifier, this suggests that nb classifies the domain better than rip, which
in turn classifies the domain better than c4.5. However, the knowledge of only
the average performance, via the mean, of classifier performance is not enough
to give us an idea of their relative performances. We are also interested in the
spread or deviation of the risk from this mean. The standard deviation, by virtue
of representation in the same units as the data, is easier to interpret. It basically
tells us whether the elements of our distribution have a tendency to be similar or
dissimilar to each other. For example, in a population made up of 22-year-old
ballerinas, the degree of joint flexibility is much more tightly distributed around
the mean than it is in the population made up of all the 22-year-old young
ladies (ballerinas and nonballerinas included). Thus the standard deviation, in
the first case, will be smaller than in the second. This is because ballerinas
have to be naturally flexible and must train to increase this natural flexibility
further, whereas in the general population, we will find a large mixture of young
ladies with various degrees of flexibility and different levels of training. Let
us then compute the standard deviations with respect to the mean empirical
risks, in each case using R [the built-in sd( ) function can be used for the
purpose]:
2.2 Statistics Overview 53
2.2.2 Distributions
We introduced the notions of probability distributions and density functions in
the previous subsection. Let us now focus on some of the main distributions
that have both a significant impact on and implications for the approaches cur-
rently used in assessing and evaluating the learning algorithms. Although we can
model data by using a wide variety of distributions, we would like to focus on two
important distributions most relevant to the evaluation of learning algorithms:
the Normal or Gaussian distribution and the binomial distribution. Among these,
the normal distribution is the most widely used distribution for modeling classi-
fier performance for a variety of reasons, such as the analytical tractability of the
results under this assumption, asymptotic analysis capability owing to the central
limit theorem subsequently discussed, asymptotic ability to model a wide variety
of other distributions, and so on. As we will see later, many approaches impose
a normal distribution assumption on the performance measures (or some func-
tion of the performance measures). For instance, the standard t test assumes the
54 Machine Learning and Statistics Overview
dnorm(x, σ = 2)
dnorm(x, σ = 1)
dnorm(x, σ = 0.5)
0.0
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x x x
σ =1 σ = 0.5 σ =2
√
Figure 2.2. Normal distributions centered around 0 with standard deviations (sd) = σ2
of 1 (the standard normal distribution), 0.5, and 2.
m k
Pr(x) = p (1 − ps )(m−k) ,
k s
which is nothing but the probability of exactly k successes in m trials.
Consider a classifier that maps each given example to one of a fixed number
of labels. Each example also has an associated true label. We can model the event
as a success when the label identified by the classifier matches the true label
of the example. Then the behavior of the classifier prediction over a number
of different examples in a test set can be modeled as a binomial distribution.
The expected value E[x] of a binomial distribution can be shown to be mps
and its variance to be mps (1 − ps ). In the extreme case of m = 1, the binomial
distribution becomes a Bernoulli distribution, which models the probability
of success in a single trial. On the other hand, as m −→ ∞, the binomial
distribution approaches the Poisson distribution when the product mps remains
fixed. The advantage of sometime using a Poisson distribution to approximate
a binomial distribution can come from the reduced computational complexity.
However, we do not delve into these issues, as these are beyond the scope of
this book.
Let us then see the effect of m on the binomial distribution. Figure 2.3 shows
three unbiased binomial distributions (an unbiased binomial distribution has
56 Machine Learning and Statistics Overview
0.00 0.05 0.10 0.15 0.20 0.25 0.30
0.20
0.06
0.10 0.15
0.04
0.02
0.00 0.05
0.00
0 1 2 3 4 5 0 1 2 3 4 5 6 7 8 9 10 0 6 13 22 31 40 49 58 67 76 85 94
m= 5 m = 10 m = 100
Figure 2.3. Binomial distributions with a probability of 0.5 (unbiased) and trial sizes m = 5,
m = 10, and m = 100. As the trial size increases, it is clear that the binomial distribution
approaches the normal distribution.
ps = 0.5) with increasing trial sizes. It can be seen that as the trial size increases
the binomial distribution approaches the normal distribution (for fixed ps ).10
The probability of a success in a Bernoulli trial also affects the distribution.
Figure 2.4 shows two biased binomial distributions with success probabilities
of 0.3 and 0.8 over 10 trials. In these cases, the graph is asymmetrical.
Other Distributions
Many other distributions are widely used in practice for data modeling in various
fields. Some of the main ones are the Poisson distribution, used to model the
number of events occurring in a given time interval, the geometric distribution,
generally used to model the number of trials required before a first success
(or failure) is obtained, and the uniform distribution, generally used to model
random variables whose range of values is known and that can take any value
in this range with equal probability. These, however, are not directly relevant
to the subject of this book. Hence we do not devote space to discussing these
distributions in detail. Interested readers can find these details in any standard
statistics text.
10 Note that a continuity correction, such as one based on the Moivre–Laplace theorem, is recom-
mended in the case in which a normal approximation from a binomial is used for large m’s.
2.2 Statistics Overview 57
0.30
0.25
0.25
0.20
0.20
0.15
0.15
0.10
0.10
0.05
0.05
0.00
0.00
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
ps = 0.3 ps = 0.8
Figure 2.4. Binomial distributions with probabilities of success ps = 0.3 and ps = 0.8
with m = 10. For these biased distributions, the graph has become asymmetrical.
Sm = x1 + x2 + · · · + xm .
Sampling Distribution
Consider a sample obtained from sampling data according to some given dis-
tribution. We can then calculate various kinds of statistics on this sample (e.g.,
mean, variance, standard deviation, etc.). These are called sample statistics. The
sampling distribution denotes the probability distribution or a probability den-
sity function associated with the sample statistic under repeated sampling of
the population. This sampling distribution then depends on three factors, viz.,
the size of the sample, the statistic itself, and the underlying distribution of the
population. In other words, the sampling distribution of a statistic (for example,
the mean, the median, or any other description or summary of a dataset) is the
distribution of values obtained for those statistics over all possible samplings of
the same size from a given population.
For instance, we obtain m samples for a random variable x denoted
{x1 , x2 , . . . , xm } with known probability distribution and calculate the sample
mean x = m1 m i=1 xi . When using x to model the empirical risk, we can obtain
the average empirical risk on a dataset of m examples by testing the classifier on
each of these. Further, because this average empirical risk is a statistic, repeated
sampling of the m data points and calculating the empirical risk in each case will
enable us to obtain a sampling distribution of the empirical risk estimates. There
can be a sampling distribution associated with any sample statistic estimated
(although not always computable). For instance, in the example of Table 2.3, we
use a random variable to model the empirical risk of each classifier and calculate
the average (mean) empirical risk on each fold. Hence, the 10 runs with 10 folds
each give us 100 observed values of RS (·) for each classifier over which we
calculate the mean. This, for instance, results in the corresponding sample mean
of the empirical risk in the case of c4.5 to be R(c45) = 0.217668.
Because the populations under study are usually finite, the true sampling dis-
tribution is usually unknown (e.g., note that the trials in the case of Table 2.3 are
interdependent). Hence it is important to understand the effect of using a single
estimate of sampling distribution based on one sampling instead of repeated
samplings. Typically, the mean of the statistic is used as an approximation of the
actual mean obtained over multiple samplings. The central limit theorem plays
an important role in allowing for this approximation. Let us see the reasoning
behind this.
Denote a random variable x that is normally distributed with mean μ and
variance σ 2 as x ∼ N (μ, σ 2 ). Then the sampling distribution of the sample
mean x coming from m-sized samples is
σ2
x ∼ N(μ, ). (2.9)
m
Moreover, if x is sampled from a finite-sized population of size N , then the
sampling distribution of the sample mean becomes
N − m σ2
x ∼ N(μ, × ). (2.10)
N −1 m
2.2 Statistics Overview 59
Note here the role of the central limit theorem just described. With increasing
N, Approximation (2.10) would approach Approximation (2.9). Moreover, note
how the sample mean in Approximations (2.9) and (2.10) is basically the actual
mean.
Another application of the sampling distribution is to obtain the sampling
distribution of difference between two means. We will see this in the case of
hypothesis testing. Let x1 and x2 be two random variables that are normally dis-
tributed with means μ1 and μ2 , respectively. Their corresponding variances are
σ12 and σ22 . That is, x1 ∼ N(μ1 , σ12 ) and x2 ∼ N(μ2 , σ22 ). We are now interested
in the sampling distribution of the sample mean of x1 over m1 -sized samples,
denoted as x1 , and the sample mean of x2 over m2 -sized samples, denoted as x2 .
Then it can be shown that the difference of the sampling means is distributed as
σ12 σ2
x1 − x2 ∼ N(μ1 − μ2 , + 2 ).
m1 m2
Finally, consider a variable x that is binomially distributed with parameter
p. That is, x ∼ Bin(p); then it can be shown that the sample proportion p also
follows a binomial distribution parameterized by p, that is, p ∼ Bin(p).11
11 Note that this notation differs from the previous one for binomial distribution for the sake of
simplicity since the number of trials is assumed to be fixed and uniform across trials.
60 Machine Learning and Statistics Overview
x= xi ,
|Sx | i=1
with each xi denoting an observed value of x in the sample Sx and |Sx | denoting
the size of the set Sx . Similarly, we can calculate the standard error (sample stan-
√
dard deviation) that according to our assumption will approximate σ/ |Sx | (see
2.2 Statistics Overview 61
Subsection 2.2.2). Next we can standardize the statistic to obtain the following
random variable:
x−μ
Z= .
√σ
|Sx |
Now we wish to find, at probability 1 − α, the lower and upper bounds on the
values of Z. That is, we wish to find ZP such that
Pr(x + ZP ≤ Z ≤ x − ZP ) = 1 − α.
σ (x)
CIlower = x − ZP √ ,
|Sx |
σ (x)
CIupper = x + ZP √ .
|Sx |
Note that this is essentially the two-sided confidence interval, and hence we have
considered a confidence parameter of α/2 to account for the upper and the lower
bounds each while considering the CDF. This will have important implications
in statistical hypothesis testing, as we will see later. The discussion up until here
on the manner of calculating the confidence intervals was aimed at elucidating
the process. However, tables with ZP values corresponding to the desired level
of significance are available (see Table A.1 in Appendix A). Hence, for desired
levels of confidence, these values can be readily used to give the confidence
intervals.
Let us go back to our example from Table 2.3. We calculated the mean
empirical risk of the three classifiers on the labor dataset. Using the sample
standard deviation, we can then obtain the confidence intervals for the true risk.
The value of ZP corresponding to α = 0.05 (95% confidence level) is found to
be 1.96 from Table A.1 in Appendix A. Hence we can obtain, for c4.5,
R(c45) σ (x)
CIlower = R(C4.5) − ZP √
|Sx |
0.1603415
= 0.217668 − 1.96 √ .
100
62 Machine Learning and Statistics Overview
Similarly,
σ (x)
upper = R(C4.5) + ZP √
CIR(c45)
|Sx |
0.1603415
= 0.217668 + 1.96 √ .
100
The confidence limits for the other two classifiers can be obtained in an analogous
manner.
As we mentioned earlier, the confidence interval approach has also been
important in statistical hypothesis testing. One of the most immediate applica-
tions of this approach employing the assumption of normal distribution on the
statistic can be found in the commonly used significance test, the t test. The
confidence interval calculation is implicit in the t test that can be used to verify
if the statistic differs from the one assumed by the null hypothesis. There are
many variations of the t test. We demonstrate the so-called one-sample t test
on the empirical risk of the three classifiers in our example of Table 2.3. Here
the test is used to confirm if the sample mean differs from 0 in a statistically
significant manner. This can be done in R for the empirical risks of the three
classifiers, also giving the confidence interval estimates, as follows:
Listing 2.4: Sample R command for executing the t test and thereby obtain the
confidence interval for the means of the sample data.
> t . t e s t ( c45 )
One Sample t − t e s t
d a t a : c45
t = 1 3 . 5 7 5 3 , d f = 9 9 , p−v a l u e < 2 . 2 e −16
a l t e r n a t i v e h y p o t h e s i s : t r u e mean i s n o t e q u a l t o 0
95 p e r c e n t c o n f i d e n c e i n t e r v a l :
0.1858528 0.2494832
sample e s t i m a t e s :
mean o f x
0.217668
One Sample t − t e s t
data : jrip
t = 1 0 . 9 3 8 6 , d f = 9 9 , p−v a l u e < 2 . 2 e −16
a l t e r n a t i v e h y p o t h e s i s : t r u e mean i s n o t e q u a l t o 0
95 p e r c e n t c o n f i d e n c e i n t e r v a l :
0.1336829 0.1929291
sample e s t i m a t e s :
mean o f x
2.2 Statistics Overview 63
0.163306
> t . t e s t ( nb )
One Sample t − t e s t
d a t a : nb
t = 6 . 1 4 9 4 , d f = 9 9 , p−v a l u e = 1 . 6 4 8 e −08
a l t e r n a t i v e h y p o t h e s i s : t r u e mean i s n o t e q u a l t o 0
95 p e r c e n t c o n f i d e n c e i n t e r v a l :
0.04425538 0.08642062
sample e s t i m a t e s :
mean o f x
0.065338
>
These confidence intervals can be plotted using the plotCI command available
in the gplot package, as follows:
Listing 2.5: Sample R command for plotting the confidence intervals of c4.5,
rip, and nb means when applied to the labor data
> means <− c ( . 2 1 7 6 6 8 , . 1 6 3 3 0 6 , . 0 6 5 3 3 8 )
> s t d e v s <− c ( . 1 6 0 3 4 1 5 , . 1 4 9 2 9 3 6 , . 1 0 6 2 5 1 6 )
> n s <− c ( 1 0 0 , 1 0 0 , 1 0 0 )
> q t ( . 9 5 , ns )
> ciw <− q t ( 0 . 9 5 , n s ) * s t d e v s / s q r t ( n s )
> p l o t C I ( x=means , uiw=ciw , c o l =“ b l a c k ” , l a b e l s = r o u n d ( means , −3) ,
x a x t =“ n s ” , x l i m =c ( 0 , 5 ) )
> a x i s ( s i d e =1 , a t = 1 : 3 , l a b e l s =c ( ’ c45 ’ , ’ j r i p ’ , ’ nb ’ ) , c e x = 0 . 7 )
The result is shown in Figure 2.5, where the 95% confidence intervals for the
three classifiers are shown around the mean. The figure shows that the true means
of the error rates of these classifiers are probably quite distinct from one another,
given the little, if any, overlap displayed by the graphs (an effect enhanced by
the scale on the vertical axis too). In particular, the nb interval does not overlap
with either the c4.5’s or rip’s, and there is only a marginal overlap between the
intervals of c4.5 and rip. It is worth noting that, although our conclusions may
seem quite clear and straightforward, the situation is certainly not as simple as
it appears. For starters, recall that these results have a 95% confidence level,
indicating that there still is some likelihood of the true mean not falling in the
intervals obtained around the sample mean empirical risks. Moreover, the results
and subsequent interpretations obtained here rely on an important assumption
that the sampling distribution of the empirical risk can be approximated by a
normal distribution. Finally, there is another inherent assumption, which we did
not state explicitly earlier and indeed is more often than not taken for granted,
that of the i.i.d. nature of the estimates of risk in the sample. This final assumption
64 Machine Learning and Statistics Overview
0.25
0.20
means
0.15
0.10
0.05
C45 JRIP NB
Figure 2.5. The confidence intervals for c4.5, nb, and rip.
is obviously violated, given that the repeated runs of the 10-fold cross-validation
are done by sampling from the same dataset. Indeed, these assumptions can have
important implications. The normality assumption in particular, when violated,
can yield inaccurate or even uninterpretable estimates of the confidence intervals,
as we will see later in the book (Chapter 8). We will also discuss the effects of
such repeated sampling from the same set of instances (Chapter 5).
Another issue has to do with the importance that confidence intervals have
started gaining in the past decade, vis-à-vis the hypothesis testing approach dis-
cussed in the next subsection. Indeed, in recent years, arguments in favor of using
confidence intervals over hypothesis testing have appeared (e.g., Armstrong,
2007). The reasoning behind the recommendation goes along the following two
lines:
r First, the meaning of the significance value used in hypothesis testing has
come under scrutiny and has become more and more contested. In contrast,
confidence intervals have a simple and straightforward interpretation.
r Second, significance values allow us to accept or reject a hypothesis (as we
will see in the next subsection), but cannot help us decide by how much
to accept or reject it. That is, the significance testing approach does not
allow quantification of the extent to which a hypothesis can be accepted
or rejected. Conversely, confidence intervals give us a means by which we
can consider such degrees of acceptance or rejection.
Even though there is some merit to the preceding arguments with regard to
the limitations of hypothesis testing, these shortcomings do not, by themselves,
make confidence intervals a de facto method of choice. The confidence interval
approach also makes a normal assumption on the empirical error that is often
violated. As we will see in Chapter 8, this can result in unrealistic estimates.
2.2 Statistics Overview 65
Moreover, statistical hypothesis testing, when applied taking into account the
underlying assumptions and constraints, can indeed be helpful. To make this
discussion clearer, let us now discuss the basics of statistical hypothesis testing.
6. If the observed test statistic lies in the critical region (has extremely low
probability of being observed under the null-hypothesis assumption), reject
the null hypothesis H0 . However, note that, when this is not the case, one
would fail to reject the null hypothesis. This does not lead to the conclusion
that one can accept the null hypothesis.
Step 6 makes an important point. In the case in which the null hypothesis
cannot be rejected based on the observed value of the test statistic, the conclusion
that it must be accepted necessarily does not hold. Recall that the null hypothesis
was assumed. Hence not being able to disprove its existence does not necessarily
confirm it.
We are interested in comparing the performance of two learning algorithms.
In our case then the null hypothesis can, for instance, assume that the difference
between the empirical risks of the two classifiers is (statistically) insignificant.
That is, the two estimates come from a similar population. Because the errors are
estimated on the same population, the only difference being the two classifiers,
this then translates to meaning that the two classifiers behave in a more or less
similar manner. Rejecting this hypothesis would then mean that the observed
difference between the classifiers’ performances is indeed statistically significant
(technically this would lead us to conclude that the difference is not statistically
insignificant; however, our null-hypothesis definition allows us to draw this
conclusion).
labor domain, c4.5 is once again not as accurate as nb. In such a case, we
hypothesize that the difference between the true risk of c4.5 [R(c45)] and that
of nb [R(nb)] has a mean of 0. Note that our assumption says that nb is never
worse than c4.5, and hence this difference is never considered to be less than
zero. Under these assumptions, we can obtain the observed difference in the
mean empirical risk of the two classifiers R(c45) – R(nb) and apply a one-
tailed test to verify if the observed difference is significantly greater than 0. On
the other hand, we may not have any a priori assumption over the classifiers’
performance difference, e.g., in the case of c4.5 and rip in the preceding example.
In such cases, we might be interested in knowing, for instance, only whether
the observed difference between their performance is indeed significant. That
is, this difference (irrespective of what classifier is better) would hold if their
true distribution were available. We leave the details of how such testing is
performed in practice and under what assumptions to a more elaborate treatment
in Chapter 6.
Of course, there are advantages and limitations to both the parametric and the
nonparametric approaches. The nonparametric tests, as a result of independence
from any modeling assumption, are quite useful in populations for which outliers
skew the distribution significantly (not to mention, in addition, the case in
which no distribution underlies the statistics). However, the consequence of
this advantage of model independence is the limited power (see the discussion
on the type II error in the following subsection) of these tests because limited
generalizations can be made over the behavior or comparison of the sample
statistic in the absence of any concrete behavior model. Parametric approaches,
on the other hand, are useful when the distributional assumption are (even
approximately) met because, in that case, strong conclusions can be reached.
However, in the case in which the assumptions are violated, parametric tests can
be grossly misleading.
Let us now see how we can characterize the hypothesis tests themselves in
terms of their ability to reject the null hypothesis. This is generally quantified
by specifying two quantities of interests with regard to the hypothesis test, its
type I error and its type II error, the latter of which also affects the power of the
test, as we subsequently discuss.
Definition 2.5. A type I error (α) corresponds to the error of rejecting the
null hypothesis H0 when it is, in fact, true (false positive). A type II error (β)
corresponds to the error of failing to reject H0 when it is false (false negative).
Note that the type II error basically quantifies the extent to which a test
validates the null hypothesis when it in fact does not hold. Hence we can define
the power of test by taking into account the complement of the type II error, as
follows:
Definition 2.6. The power of a test is the probability of rejecting H0 , given that
it is false:
Power = 1 − β.
The preceding two types of errors are generally traded off. That is, reducing
the type I error makes the hypothesis test more sensitive in that it does not reject
H0 too easily. As a result of this tendency, the type II error of the test, that of
failing to reject H0 , even when it does not hold, increases. This then gives us a
test with low power. A low-power test may be insufficient, for instance, in finding
the difference in classifier performance as significant, even when it is so. The
parametric tests can have more power than their nonparametric counterparts
because they can characterize the sample statistic in a well-defined manner.
70 Machine Learning and Statistics Overview
However, this is true only when the modeling assumption on the distribution of
the sample statistics holds. The α parameter is the confidence parameter in the
sense that it specifies how unlikely the result must be if one is to reject the null
hypothesis. A typical value of α is 0.05 or 5%. Reducing α then amounts to
making the test more rigorous. We will see in Chapter 5 how this α parameter
also affects the sample size requirement for the test in the context of analyzing
the holdout method of evaluation.
In addition to these characteristics arising from the inherent nature of the
test, the power of tests can be increased in other manners too (although with
corresponding costs), as follows:
r Increasing the size of the type I error: As just discussed, the first and
simplest way to increase power or lower the type II error is to do so at
the expense of the type I Error. Although we usually set α to 0.05, if we
increased it to 0.10 or 0.20, then β would be decreased. An important
question, however, is whether we are ready to take a greater chance at a
type I error; i.e., whether we are ready to take the chance of claiming that
a result is significant when it is not, to increase our chance of finding a
significant result when one exists. Often this is not a good alternative, and
it would be preferable to increase power without having to increase α.
r Using a one-tailed rather than a two-tailed test: One-tailed tests are
more powerful than two-tailed tests for a given α level. Indeed, running a
two-tailed test for α2 = 0.05 is equivalent to running two one-tailed test
for α1 = 0.025. Because α1 is very small in this case, its corresponding β1
is large, and thus the power is small. Hence, in moving from a two-tailed
α2 = 0.05 test to a one-tailed α1 = 0.05 test, we are doubling the value
of α1 and thus decreasing the value of β1 , in turn increasing the power
of the test. Of course, this can be done only if we know which of the
two distribution means considered in the test are expected to be higher. If
this information is not available, then a two-tailed test is necessary and no
power can be gained in this manner.
r Choosing a more sensitive evaluation measure: One way to separate two
samples, or increase power, is to increase the difference between the means
of the two samples. We can do this, when setting up our experiments, by
selecting an evaluation measure that emphasizes the effect we are testing.
For example, let us assume that we are using the balanced F measure (also
called the F1 measure) to compare the performance of two classifiers, but let
us say that precision is the aspect of the performance that we care about the
most. (Both of these performance measures are discussed in Chapter 3.) Let
us assume that, of the two classifiers tested, one was specifically designed
to improve on precision at the expense of recall (complementary measure
of precision used to compute the F measure, also discussed in Chapter 3).
If F1 is the measure used, then the gains in precision of the more precise
2.2 Statistics Overview 71
Effect Size
The idea of separating the two samples in order to increase power comes from the
fact that power is inextricably linked to the amount of overlap that occurs between
two distributions. Let us see what we mean by this in the context of comparing
classifier performance. If we are interested in characterizing the difference in risk
of two classifiers, then, in the parametric case, our null hypothesis can assume
this difference to be distributed according to a normal distribution centered at
zero. Let us call this the standard distribution. We would then calculate the
parameters of the distribution of the statistics of interest from the data. Let
us denote this as the empirical distribution. Then, the farther the empirical
distribution is from the standard distribution, the more confidence we would
have in rejecting the null hypothesis. The strength with which we can reject the
null hypothesis is basically the effect size of the test. Figure 2.6 illustrates this
notion graphically. The distribution on the left is the standard distribution, and
the one on the right is the empirical distribution with the straight vertical line
denoting the threshold of the hypothesis test beyond which the null hypothesis
is rejected. The overlap to the right of the threshold basically signifies the
probability that we reject the null hypothesis even when it holds (type I error, α).
The tail of the standard distribution is nothing but α. Similarly, the overlap to
the left of the threshold denotes the probability that we do not reject the null
hypothesis even though it does not hold (type II error, β). It is clear that moving
the separating line to the left increases α while decreasing β. In fact, choosing
any significance level α implicitly defines a trade-off in terms of increasing β.
Note, however, that in the event of very large samples, these trade-offs can be
13 We assume here that we do not wish to disregard recall altogether because otherwise we would use
the precision measure directly.
72 Machine Learning and Statistics Overview
deemed relatively unnecessary. Note here that we assumed in Figure 2.6 that the
standard distribution lies on the left of the empirical distribution. This need not
always be the case. If such a direction of the effect is known, we can use a one-
tailed test. However, if we do not know this directionality, a two-tailed test should
be employed. Finally, it can be seen that the two types of errors depend on the
overlap on either side of the test threshold between the two distributions. Hence,
if one were to reduce the standard deviations of the (one or) two distributions,
this would lead to a reduction in the corresponding overlaps, thereby reducing
the respective errors. This would in turn result in increased power.
Hence we can see that a parametric modeling on the two distributions would
rely on their overlap. Reporting the amount of overlap between two distribu-
tions would thus be a good indicator of how powerful a test would be for two
populations. Quantifying this overlap would then be an indicator of the test’s
effect size. Naturally, a large effect size indicates a small overlap whereas a small
effect size indicates a large overlap. In the literature, several measures have been
proposed to quantify the effect size of a test depending on the hypothesis test
settings. With regard to the more common case of comparing two means over
the difference of classifier performance, Cohen’s d statistic has been suggested
as a suitable measure of the effect size. It is discussed in Chapter 6.
2.3 Summary
This chapter was aimed at introducing the relevant concepts of machine learning
and statistics and placing them in the context of classifier evaluation. Even though
the book assumes familiarity on the part of the reader with the machine learning
basics, especially classification algorithms, we believe that the discussion in
this chapter will enable the reader to relate various notions not only to the
2.4 Bibliographic Remarks 73
context of evaluation, but also with regard to the associated statistics perspective.
The statistical concepts surveyed in the second section of this chapter recur
frequently in the context of various aspects of evaluation. Together with the
different concepts introduced in this chapter and those that will be reviewed
further, we pointed out two important and free Web-based packages for use in
both machine learning research and its evaluation. The first one was the WEKA
machine learning toolkit, which implements a large number of machine learning
algorithms, as well as preprocessing and postprocessing techniques. The second
one was the R Project for Statistical Computing, which is a free implementation
of statistical routines [closely related to the (commercial) S Package]. This
package implements, among other things, a large number of statistical tests,
whose use is demonstrated more thoroughly in Chapter 6.
Chapters 3–7 focus on specific aspects of the evaluation procedure in the
context of comparing the performances of two or more learning algorithms. The
most fundamental aspect of evaluating the performances is deciding on how
to assess or measure these in the first place. Chapters 3 and 4 discuss various
alternatives for such performance measures.
r performance measures,
r error estimation,
r statistical significance testing, and
r test benchmark selection.
This chapter and the next review various prominent performance measures
and touch on the different issues one encounters when selecting these for eval-
uation studies. Different dichotomies of performance measures can be formed.
We decided to use a simple and intuitive, though a bit approximate, one based on
the so-called confusion matrix. This chapter focuses on the performance mea-
sures whose elements are based on the information contained in the confusion
matrix alone. In the next chapter, we extend our focus to take into account mea-
sures incorporating information about (prior) class distributions and classifier
uncertainties.
also briefly discuss these aspects. Finally, we also see how the representation of
the various performance measures affects the type and amount of information
they convey as well as their interpretability and understandability. For instance,
we will see how graphical measures such as those used for scoring algorithms as
discussed in the next chapter result in the visualization of the algorithms’ perfor-
mance over different settings. The scalar measures, on the other hand, have the
advantage of being concise and allow for clear comparisons of different learn-
ing methods. However, because of their conciseness, they lack informativeness
because they summarize a great deal of information into a scalar metric. The
disadvantage of the graphical measures appears in the form of a lack of ease in
implementation and a possibly increased time complexity. In addition, results
expressed in this form may be more difficult to interpret than those reported in
a single measure.
r l
j =1 cij (f ) = ci. (f ) denotes the total number of examples of class i in
the test set.
r l c (f ) = c (f ) denotes the total number of examples assigned to
i=1 ij .j
class j by classifier f .
r All the diagonal entries c denotes the correctly classified examples for
ii
class i. Hence li=1 cii (f ) denotes the total number of examples classified
correctly by classifier f .
r All the nondiagonal entries denote misclassifications. Hence
i,j :i=j cij (f )
denotes the total number of examples assigned to wrong classes by classi-
fier f .
3.1 Overview of the Problem 79
As can be seen in the preceding case, the entries of C deal with a deterministic
classification scenario for the symmetric loss case. That is, f deterministically
assigns a label to each instance with a unit probability instead of making a
probabilistic statement on its membership for different classes. Moreover, the
cost associated with classifying an instance x to class j, j ∈ {1, . . . , l}, is the
same as classifying it to class k such that k ∈ {1, . . . , l} , k = j . We will see a bit
later how we can incorporate these considerations into the resulting performance
measures.
Let us illustrate this with an example. Consider applying a naive Bayes (nb)
classifier to the breast cancer dataset (we discuss this and some other experiments
in Section 3.3). The domain refers to the application of the classifier to predict,
from a set of patients, whether a recurrence would occur. Hence the two class
labels refer to “positive” (recurrence occurred) and “negative” (recurrence did
not occur). On applying the nb classifier to the test set, we obtain the confusion
matrix of Table 3.3.
Let us now interpret the meaning of the values contained in the confusion
matrix. Relating the matrix of Table 3.3 to that of Table 3.2, we see that TP = 37,
TN = 168, FP = 33, and FN = 48. The confusion matrix shows that out of the
37 + 48 + 33 + 168 = 286 patients in our test set who previously had breast
cancer, TP + FN = 37 + 48 = 85 suffered a new episode of the disease whereas
FP + TN = 33 + 168 = 201 had remained disease free at the time the data were
collected. Our trained nb classifier (we do not worry for now about the issue
of how the algorithm was trained and tested so as to obtain the performance
estimates; we will come back to this in Chapter 5) predicts results that differ
from the truth in terms of both numbers of predictions and predicted class
distributions. It predicts that only TP + FP = 37 + 33 = 70 patients suffered a
new episode of the disease and that FN + TN = 48 + 168 = 216 did not suffer
from any new episode. It is thus clear that nb is a generally optimistic classifier
(for this domain, in which “optimistic” refers to fewer numbers of recurrences)
in the sense that it tends to predict a negative result with a higher frequency
relative to the positive prediction. The confusion matrix further breaks these
results up into their correctly predicted and incorrectly predicted components
for the two prediction classes. That is, it reports how many times the classifier
predicts a recurrence wrongly and how many times it predicts a nonrecurrence
wrongly. As can be seen, of the 70 recurrence cases predicted overall by nb,
37 were actual recurrence cases, whereas 33 were not. Similarly, out of the 216
cases for which nb predicted no recurrence, 168 cases were actual nonrecurrence
cases, whereas 48 were recurrence cases.1
1 Note, however, that the data were collected at a particular time. At that time, the people wrongly
diagnosed by nb, did not have the disease again. However, this does not imply that they did not
develop it at any time after the data were collected, in which case nb would not be that wrong.
We restrain ourselves from delving into this issue further for now. But this highlights an inherent
limitation of our evaluation strategy and warns the reader that there may be more variables in play
when a learning algorithm is applied in practice. Such application-specific considerations should
always be kept in mind.
3.2 An Ontology of Performance Measures 81
All
measures
Multiclass Single-Class
Focus Focus
Distance/ Information-
Graphical Summary
Error Theoretic
measures Statistics
measures Measures
2 The confusion matrix alone can also be used for classifiers other than deterministic ones (e.g., by
thresholding the decision function) or can, at least in theory, incorporate partial-loss information.
However, it is conventionally used for deterministic classifiers, which is why we focus on this
use here. Similarly, measures over deterministic classifiers can also take into account additional
information (e.g. skew). We indicate this possibility with a dashed line.
3.3 Illustrative Example 83
The results are reported with the following performance measures: accuracy
(Acc), the root-mean-square error (RMSE), the true-positive and false-positive
rates (TPR and FPR), precision (Prec), recall (Rec), the F Measure (F ), the area
under the ROC curve (AUC), and Kononenko and Bratko’s information score
(K & B). All the results were obtained with the WEKA machine learning toolkit
whose advanced options allow for a listing of these metrics. The reported results
are averaged over a 10-fold cross-validation run of the learning algorithm on
the dataset. However, for now, we do not focus on what each of these measures
means and how they were calculated. Our aim here is to emphasize that different
performance measures, as a result of assessing different aspects of algorithms’
performances, yield different comparative results. We do not delve into the
appropriateness of the 10-fold cross-validation method either. We discuss this
and other error-estimation methods in detail in Chapter 5.
Let us look into the results. If we were to rank the different algorithms based
on their performance on the dataset (rank 1 denoting the best classifier) we
would end up with different rankings depending on the performance measures
that we use. For example, consider the results obtained by accuracy and the
F measure on the breast cancer domain (Table 3.4). Accuracy ranks c45 as
the best classifier whereas the AUC ranks it as the worst, along with svm.
Similarly, the F measure ranks nb as the best classifier, whereas accuracy ranks
it somewhere in the middle. Across categories, AUC is in agreement with RMSE
when it comes to c45’s rank, but the two metrics disagree as to where to rank
boosting. When ranked according to the AUC, boosting comes first (tied with
nb) whereas it is ranked lower for Acc. Similar disagreements can also be
seen between K & B and AUC as, for example, with respect to their ranking
of svm.
Of course, this is a serious problem because a user is left with the questions of
which evaluation metric to use and what their results mean. However, it should
be understood that such disagreements do not suggest any inherent flaw in the
performance measures themselves. This, rather, highlights two main aspects of
such an evaluation undertaking: (i) different performance measures focus on
different aspects of the classifier’s performance on the data and assess these
specific aspects; (ii) learning algorithms vary in their performance on more than
one count. Taken together, these two points suggest something very important:
where I (a) is the indicator function that outputs 1 if the predicate a is true and
zero otherwise, f (xi ) is the label assigned to example xi by classifier f , yi is
the true label of example xi , and |T | is the size of the test set.
In terms of the entries of the confusion matrix, the empirical error rate of
Equation (3.1) can be computed as3
l l
i,j :i=j cij (f ) i,j =1 cij (f ) − i=1 cii (f )
RT (f ) = l = l .
i,j =1 cij (f ) i,j =1 cij (f )
3
l
Note that while we use single summation
signs for notational simplicity (e.g. i,j =1 ), it indicates
iteration over both the indices (i.e. li=1 lj =1 ), for this and subsequent uses.
86 Performance Measures I
Error rate, as also discussed in the last chapter, measures the fraction of the
instances from the test set that are misclassified by the learning algorithm. This
measurement further includes the instances from all classes.
A complement to the error-rate measurement naturally would measure the
fraction of correctly classified instances in the test set. This measure is referred
to as accuracy. Reversing the criterion in the indicator function of Equation (3.1)
leads to the accuracy measurement AccT (f ) of classifier f on test set T , i.e.,
|T |
1
AccT (f ) = I (f (xi ) = yi ).
|T | i=1
This can be computed in terms of the entries of the confusion matrix as
l
cii (f )
AccT (f ) = l i=1
i,j =1 cij (f )
For the binary classification case, our representations of the confusion matrix
of Table 3.1 and subsequently Table 3.2 yield
c11 (f ) + c22 (f ) TP + TN
AccT (f ) = = ,
c11 (f ) + c12 (f ) + c21 (f ) + c22 (f ) P +N
c12 (f ) + c21 (f ) FN + FP
RT (f ) = 1 − AccT (f ) = =
c11 (f ) + c12 (f ) + c21 (f ) + c22 (f ) P +N
Example 3.1. In the example of nb applied to the breast cancer domain, the
accuracy and error rates are calculated as
37 + 168
AccT (f ) = = 0.7168,
37 + 48 + 33 + 168
RT (f ) = 1 − 0.7168 = 0.2832
These results tell us that nb makes a correct prediction in 71.68% of the cases
or, equivalently, makes prediction errors in 28.32% of the cases. How should
such a result be interpreted? Well, if a physician is happy being right with
approximately 7 out of 10 of his patients, and wrong in 3 out of 10 cases, then
he or she could use nb as a guide. What the physician does not know, however,
is whether he or she is overly pessimistic, overly optimistic, or a mixture of the
two, in terms of telling people who should have no fear of recurrence that they
will incur a new episode of cancer or in telling people who should worry about
it not to do so. This is the typical context in which accuracy and error rates can
result in misleading evaluation, as we will see in the next subsection.
Example 3.2. Consider two classifiers represented by the two confusion matri-
ces of Tables 3.6 and 3.7. These two classifiers behave quite differently from
one another. The one symbolized by the confusion matrix of Table 3.6 does
88 Performance Measures I
not classify positive examples very well, getting only 200 out of 500 right. On
the other hand, it does not do a terrible job on the negative data, getting 400
out of 500 well classified. The classifier represented by the confusion matrix of
Table 3.7 does the exact opposite, classifying the positive class better than the
negative class, with 400 out of 500 versus 200 out of 500 correct classifications.
It is clear that these classifiers exhibit quite different strengths and weaknesses
and should not be used blindly on a dataset such as the medical domain we
previously used. Yet both classifiers exhibit the same accuracy of 70%.4
Now let us consider the issues of class imbalance and differing costs. Let us
assume an extreme case in which the positive class contains 50 examples and the
negative class contains 950 examples. In this case, a trivial classifier incapable of
discriminating between the positive and the negative class, but blindly choosing
to return a “negative” class label on all instances, would obtain a 95% accuracy.
This indeed is not representative of the classifier’s performance at all. Because
many classifiers, even nontrivial ones, take the prior class distributions into
consideration for the learning process, the preference for the more-dominant
class would prevail, resulting in their behaving the way our trivial classifier
does. Accuracy results may not convey meaningful information in such cases.
This also brings us to the question of how important it is to correctly classify
the examples from the two classes in relation to each other. That is, how much
cost do we incur by making the classifier more sensitive to the less-dominant
class while incurring misclassification of the more-dominant class? In fact, such
costs can move either way, depending on the importance of (mis)classifying
instances of either class. This is the issue of misclassification costs. However,
the issues of misclassification costs can also be closely related to, although
definitely not limited to, those of class distribution.
Consider the case in which the classes are fully balanced, as in the two previ-
ous confusion matrices. Assume, however, that we are dealing with a problem for
which the misclassification costs differ greatly. In the critical example of breast
cancer recurrence (the positive class) and nonrecurrence (the negative class),
it is clear, at least from the patient’s point of view, that false-positive errors
have a lower cost because these errors consist of diagnosing recurrence when
4 Of course, one can think of reversing the class labels in such cases. However, this may not exactly
invert the problem mapping. Also, this trick can become ineffective in the multiclass case.
3.4 Performance Metrics with a Multiclass Focus 89
the patient is, in fact, not likely to get the disease again; whereas false-negative
errors are very costly, because they correspond to the case in which recurrence
is not recognized by the system (and thus is not consequently treated the way
it should be). In such circumstances, it is clear that the system represented by
the preceding first confidence matrix is much less appropriate for this problem
because it issues 300 nonrecurrence diagnostics in cases in which the patient
will suffer a new bout of cancer, whereas the system represented by the second
confidence matrix makes fewer (100) mistakes (needless to say that, in practical
scenarios, even these many mistakes will be unacceptable).
Clearly the accuracy (or the error-rate measures) does not convey the full
picture and hence does not necessarily suffice in its classical form in domains
with differing classification costs or wide class imbalance. Nonetheless, these
measures do remain simple and intuitive ways to assess classifier performance,
especially when the user is aware of the two issues just discussed. These metrics
serve as quite informative measures when generic classifiers are evaluated for
their overall performance-based comparative evaluation. We will come back to
the issue of dealing with misclassification costs (also referred to as the asym-
metric loss) a bit later in the chapter.
When assessing the performance of the classifier against the “true” labeling
of the dataset, it is implicitly assumed that the actual labels on the instances
are indeed unbiased and correct. That is, these labels do not occur by chance.
Although such a consideration is relatively less relevant when we assume a
perfect process that generates the true labels (e.g., when the true labeling can
be definitively and unquestionably established), it is quite important when this
is not the case (in most practical scenarios). This can be the case, for instance,
in which the result from the learning algorithm is compared against some silver
standard (e.g., labels generated from an approximate process or a human rater).
An arguable fix in such cases is the correction for chance, first proposed by
Cohen (1960) for the two-class scenario with two different processes generating
the labels. Let us look into some such chance-correction statistics.
for instance, a dataset with 75% positive examples and the rest negative. Clearly
this is a case of imbalanced data, and the high number of positive instances can
be an indicator of the bias of the label-generation process in labeling instances
as positive with a higher frequency. Hence, if we have a classifier that assigns a
positive label with half the frequency (an unbiased coin toss), then, even without
learning on the training data, we would expect its positive label assignment to
agree with the true labels in 0.5 × 0.75 = 0.375 proportion of the cases. Conven-
tional measurements such as accuracy do not take such accurate classifications
into consideration that can be the result of mere chance agreement between the
classifier and the label-generation process. This concern is all the more relevant
in applications such as medical image segmentation in which the true class labels
(i.e., the ground truth) are not known at all. Experts assign labels to various seg-
ments of images (typically pixels) against which the learning algorithm output
is evaluated. Not correcting for chance, then, ignores the bias inherent in both
the manual labeling and the learned classifier, thereby confusing the accurate
segmentation achieved merely by chance for an indication of the efficiency of
the learning algorithm.
It has hence been argued that this chance concordance between the labels
assigned to the instances should be taken into account when the accuracy of
the classifier is assessed against the true label-generation process. Having their
roots in statistics, such measures have been used, although not widely, in the
machine learning and related applications. These measures are popularly known
as agreement statistics.5 We very briefly discuss some of the main variations with
regard to the binary classification scenario. Most of these agreement statistics
were originally proposed in an analogous case of measuring agreement on
class assignment (over two classes) to the samples in a population by two
raters (akin to label assignment to instances of the test set in our case by the
learning algorithm and the actual underlying distribution). These measures were
subsequently generalized to the multiclass, multirater scenario under different
settings. We provide pointers to these generalizations in Section 3.8.
As a result of not accounting for such chance agreements over the labels, it
has been argued that accuracy tends to provide an overly optimistic estimate of
correct classifications when the labels assigned by the classifier are compared
with the true labels of the examples. The agreement measures are offered as a
possible, although imperfect, fix. We will come to these imperfections shortly.
We discuss three main measures of agreement between the two label-generation
processes that aim to obtain a chance-corrected agreement: the S coefficient
(Bennett et al., 1954), Scott’s π (pi) statistic (Scott, 1955), and Cohen’s κ
(kappa) statistic. The three measures essentially differ in the manner in which
they account for chance agreements.
5 The statistics literature refers to these as interclass correlation statistics or interrater agreement
measures.
3.4 Performance Metrics with a Multiclass Focus 91
S = 2Po − 1.
The other two statistics, that is, Scott’s π and Cohen’s κ, have a common for-
mulation in that they take the ratio of the difference between the observed and
chance agreements and the maximum possible agreement that can be achieved
over and beyond chance. However, the two measures treat the chance agree-
ment in different manners. Whereas Scott’s π estimates the chance that a label
(positive or negative) is assigned given a random instance irrespective of the
label-assigning process, Cohen’s κ considers this chance agreement by consid-
ering the two processes to be fixed. Accordingly, Scott’s π is defined as
Po − PeS
π= ,
1 − PeS
where the chance agreement over the labels, denoted as PeS , is defined as
where the chance agreement over the labels in the case of Cohen’s κ, denoted
as PeC , is defined as
f f
PeC = PPY PP + PNY PP
YP fP YN fN
= + .
m m m m
As can be seen, unlike Cohen’s κ, Scott’s π is concerned with the overall
propensity of a random instance being assigned a positive or negative label and
hence marginalizes over the processes. Therefore, in the case of assessing a
classifier’s chance corrected accuracy against a “true” label-generating process
(whether unknown or by, say, an expert), Cohen’s κ is a more-relevant statistic in
our settings, which is why it is the only agreement statistic included in WEKA.
We illustrate it here by the following example.
Example 3.3. Consider the confusion matrix of Table 3.9 representing the
output of a hypothetical classifier Hk on a three-class classification problem.
The rows represent the actual classes, and the columns represent the outputs of
the classifier. Now, just looking at the diagonal entries will give us an estimate of
the accuracy of the classifier Acc(Hk ). With regard to the preceding agreement
statistics framework, this is basically the observed agreement Po . Hence, Po =
(60 + 90 + 80)/400 = 0.575.
Let us see how we can generalize Cohen’s κ to this case. For this we need
to obtain a measure of chance agreement. Recall that we previously computed
the chance agreement in the case of Cohen’s κ as the sum of chance agreement
on the individual classes. Let us extend that analysis to the current case of
three classes. The chance that both the classifier and the actual label assignment
agree on the label of any given class is the product of their proportions of
the examples assigned to this class. In the case of Table 3.9, we see that the
classifiers Hk assigns a proportion 120/400 of examples to class a (sum of the
first column). Similarly, the proportion of true labels of class a in the dataset
is also 120/400 (sum of first row). Hence, given a random example, both the
classifier and the true label of the example will come out to be a, with probability
(120/400) × (120/400). We can calculate the chance agreement probabilities for
3.4 Performance Metrics with a Multiclass Focus 93
Hence, we get Cohen’s κ agreement statistic for classifier Hk with the previously
calculated values as
Po − PeC
κ(Hk ) =
1 − PeC
0.575 − 0.33625
=
1 − 0.33625
= 0.3597.
proper context (generally the same that applies to reliable accuracy estimation),
acceptable and reasonable performance assessment.
Let us now shift our attention to measures that are aimed at assessing the
effectiveness of the classifier on a single class of interest.
cii (f ) cii (f )
TPRi (f ) = l = .
j =1 cij (f )
ci. (f )
3.5 Performance Metrics with a Single-Class Focus 95
c12 (f ) FP
FPR(f ) = = .
c11 (f ) + c12 (f ) FP + TN
True- and false-positive rates generally form a complement pair of reported
performance measures when the performance is measured over the positive class
in the binary classification scenario. Moreover, we can obtain the same measures
on the “negative” class (the class other than the “positive” class) in the form
of true-negative rate TNR(f ) and false-negative rate FNR(f ), respectively. Our
representations of Tables 3.1 and 3.2 yield
c11 (f ) TN
TNR(f ) = = ,
c11 (f ) + c12 (f ) TN + FP
c21 (f ) FN
FNR(f ) = = .
c21 (f ) + c22 (f ) FN + TP
In signal detection theory, the true-positive rate is also known as the hit
rate, whereas the false-positive rate is referred to as the false-alarm rate or the
fallout. Next we discuss another complement metric that generally accompanies
the true-positive rate.
test successfully detect? The complement metric to this, in the case of the two-
class scenario, would focus on the proportion of negative instances (e.g., control
cases or healthy subjects) that are detected. This metric is called the specificity
of the learning algorithm. Hence specificity is the true-negative rate in the case
of the binary classification scenario. That is, sensitivity is generally considered
in terms of the positive class whereas the same quantity, when measured over
the negative class, is referred to as specificity.
Again, from the binary classification confusion matrix of Table 3.2, we can
define the two metrics as
TP
Sensitivity = ,
TP + FN
TN
Specificity = .
FP + TN
In essence, the tests, together, identify the proportion of the two classes
correctly classified. However, unlike accuracy, they do this separately in the
context of each individual class of instances. As a result of the class-wise
treatment, the measures reduce the dependency on uneven class distribution in
the test data. However, the cost of doing so appears in the form of a metric for
each single class. In the case of a multiclass classification problem, this would
lead to as many metrics as there are classes, making it difficult to interpret.
There are also other aspects that one might be interested in but that are missed
by this metric pair. One such aspect is the study of the proportion of examples
assigned to a certain class by the classifier that actually belong to this class. We
will study this and the associated complementary metric soon. But before that,
let us describe a metric that aims to combine the information of sensitivity and
specificity to yield a metric-pair that studies the class-wise performance in a
relative sense.
Likelihood Ratio
An important measure related to the sensitivity and specificity of the classifier,
known as the likelihood ratio, aims to combine these two notions to assess the
extent to which the classifier is effective in predicting the two classes. Even
though the measure combines sensitivity and specificity, there are two versions,
each making the assessment for an individual class. For the positive class,
Sensitivity
LR+ = ,
1 − Specificity
whereas for the negative class,
1 − Sensitivity
LR− = .
Specificity
In our breast cancer domain, LR+ summarizes how many times more likely
patients whose cancer did recur are to have a positive prediction than patients
without recurrence: LR− summarizes how many times less likely patients whose
cancer did recur are to have a negative prediction than patients without recur-
rence. In terms of probabilities, LR+ is the ratio of the probability of a positive
result in people who do encounter a recurrence to the probability of a positive
result in people who do not. Similarly, LR− is the ratio of the probability of a
negative result in people who do encounter a recurrence to the probability of a
negative result in people who do not.
A higher positive likelihood and a lower negative likelihood mean better per-
formance on positive and negative classes, respectively, so we want to maximize
LR+ and minimize LR− . A likelihood ratio higher than 1 indicates that the test
result is associated with the presence of the recurrence (in our example), whereas
a likelihood ratio lower than 1 indicates that the test result is associated with the
absence of this recurrence. The further likelihood ratios are from 1, the stronger
98 Performance Measures I
the evidence for the presence or absence of the recurrence. Likelihood ratios
reaching values higher than 10 and lower than 0.1 provide acceptably strong
evidence (Deeks and Altman, 2004).
When two algorithms, A and B are compared, the relationships between the
positive and the negative likelihood ratios of both classifiers can be interpreted
in terms of comparative performance as follows, for LR+ ≥ 1:6
r LRA > LRB and LRA < LRB imply that A is superior overall.
+ + − −
r LRA < LRB and LRA < LRB imply that A is superior for confirmation of
+ + − −
negative examples.
r LRA > LRB and LRA > LRB imply that A is superior for confirmation of
+ + − −
positive examples.
Example 3.4. Applying this evaluation method to the confusion matrix that was
obtained from applying nb to the breast cancer domain, we obtain the following
values for the likelihood ratios of a positive and a negative test, respectively:
0.53
LR+ = = 2.41,
1 − 0.78
1 − 0.53
LR− = = 0.6.
0.78
This tells us that patients whose cancer recurred are 2.41 times more likely
to be predicted as positive by nb than patients whose cancer did not recur; and
that patients whose cancer did recur are 0.6 times less likely to be predicted as
negative by nb than patients whose cancer did not recur. This is not a bad result,
but the classifier would be more impressive if is positive likelihood ratio were
higher and its negative likelihood ratio smaller.
Following our previously discussed hypothetical example, we find that
the classifier represented by Table 3.6, denoted as classifier HA , yields
LRH HA
+ = 0.4/(1 − 0.8) = 2 and LR− = (1 − 0.4)/0.8 = 0.75, whereas the
A
HB
0.8/(1 − 0.4) = 1.33 and LR− = (1 − 0.8)/0.4 = 0.5.
We thus have LRH A HB HA HB
+ > LR+ and LR− > LR− , meaning that classifier HA is
superior to classifier HB for confirmation of positive examples; but that classifier
HB is better for confirmation of negative examples.
Note that, when interpreted in a probabilistic sense, the likelihood ratios used
together give the likelihood in the Bayesian sense, which, along with a prior
over the data, can then give a posterior on the instances’ class memberships.
The preceding discrete version has found wide use in clinical diagnostic test
assessment. In the Bayesian or probabilistic sense, however, the likelihood ratios
are used in the context of nested hypothesis, that is, on hypotheses that belong
6 If an algorithm does not satisfy this condition, then “positive” and “negative” likelihood values
should be swapped.
3.5 Performance Metrics with a Single-Class Focus 99
to the same class of functions but vary in their respective complexities. This
is basically model selection with regard to choosing a more (or less) complex
classifier depending on their respective likelihoods given the data at hand.
cii (f ) cii (f )
PPVi (f ) = Preci (f ) = l = .
j =1 cj i (f )
c.i (f )
TP
Prec(f ) = PPV(f ) = ,
TP + FP
TN
NPV(f ) = .
TN + FN
100 Performance Measures I
Geometric Mean
As discussed previously, the informativeness of accuracy acc(f ) generally
reduces with increasing class imbalances. It is hence desirable to look at the
classifier’s performance on instances of individual classes. One way of address-
ing this concern is the sensitivity metric. Hence, in the two-class scenario, we
use sensitivity and specificity metrics’ combination to report the performance
of the classifier on the two classes. The geometric mean proposes another view
of the problem. The original formulation was proposed by Kubat et al. (1998)
and is defined as
Gmean1 (f ) = TPR(f ) × TNR(f ).
As can be seen, this measure takes into account the relative balance of the clas-
sifier’s performance on both the positive and the negative classes. Gmean1 (f )
becomes 1 only when TPR(f ) = TNR(f ) = 1. For all other combinations of
3.5 Performance Metrics with a Single-Class Focus 101
the classifier’s TPR(f ) and TNR(f ), the measures weigh the resulting statistic
by the relative balance between the two. In this sense, this measure is closer to
the multiclass focus category of the measures discussed in the previous section.
Another version of the measures, focusing on a single class of interest, can
similarly take the precision of the classifier Prec(f ) into account. In the two-class
scenario, with the class of interest being the “positive” class, this yields
G mean2 (f ) = TPR(f ) × Prec(f ).
Hence the Gmean2 takes into account the proportion of the actual positive
examples labeled as positive by the classifier as well as the proportion of the
examples labeled by the classifier as positive that are indeed positive.
HC Pos Neg
Yes 200 100
No 300 0
P = 500 N = 100
in the first case and Prec(HB ) = 400/(400 + 300) = 0.572 and Rec(HB ) =
400/(400 + 100) = 0.8 in the second.
These results reflect the strength of the second classifier on the positive data
compared with that of the first classifier. However, the disadvantage in the case
of this metric pair is that precision and recall do not focus on the performance of
the classifier on any other class than the class of interest (the “positive” class). It
is possible, for example, for a classifier trained on our medical domain to have
respectable precision and recall values even if it does very poorly at recognizing
that a patient who did not suffer a recurrence of her cancer is indeed healthy. This
is disturbing because the same values of precision and recall can be obtained
no matter what proportion of patients labeled as healthy are actually healthy, as
in the example confusion matrix of Table 3.10. This is similar to the confusion
matrix in Table 3.6, except for the true-negative value, which was set to zero
(and the number of negative examples N , which was consequently decreased
to 100). In the confusion matrix of Table 3.10, the classifier HC obtains the
same precision and recall values as in the case of HA of Table 3.6. Yet it is
clear that the HC presents a much more severe shortcoming than HA because it
is incapable of classifying true-negative examples as negative (it can, however,
wrongly classify positive examples as negative!). Such a behavior, by the way,
as mentioned before, is reflected by the accuracy metric, which assesses that the
classifier HC is accurate in only 33% of the cases whereas HA is accurate in
60% of the cases. Specificity also catches the deficiency of precision and recall
in actually a much more direct way because it obtains 0 for HC .
Hence, to summarize, the main advantage of single-class focus metrics lies in
their ability to treat performance of the classifier on an individual class basis and
thus account for shortcomings of multiclass measures, such as accuracy, with
regard to class imbalance problems, all the while studying the aspects of interest
within the individual class performance. However, this advantage comes at
certain costs. First, of course, is that no single metric is capable of encapsulating
all the aspects of interest, even with regard to the individual class (see for instance
the case of sensitivity and precision). Second, two (multiple) metrics need to be
reported to detail the classifier performance over two (multiple) classes, even
for a single aspect of interest. Increasing the number of metrics being reported
makes it increasingly difficult to interpret the results. Finally, we do not yet know
3.5 Performance Metrics with a Single-Class Focus 103
F Measure
The F measure attempts to address the issue of convenience brought on by
a single metric versus a pair of metrics. It combines precision and recall in a
single metric. More specifically, the F measure is a weighted harmonic mean
of precision and recall. For any α ∈ R, α > 0, a general formulation of the F
measure can be given as7
(1 + α)[Prec(f ) × Rec(f )]
Fα = .
{[α × Prec(f )] + Rec(f )]}
There are several variations of the F measure. For instance, the balanced F
measure weights the recall and precision of the classifier evenly, i.e., α = 1:
2[Prec(f ) × Rec(f )]
F1 = .
[Prec(f ) + Rec(f )]
Similarly, F2 weighs recall twice as much as precision and F0.5 weighs preci-
sion twice as much as recall. The weights are generally decided based on the
acceptable trade-off of precision and recall.
In our nb-classified breast cancer domain, we obtain
2(0.4353 × 0.5286) 0.46
F1 = = = 0.48,
(0.4353 + 0.5286) 0.96
3(0.4353 × 0.5286) 0.69
F2 = = = 0.49,
[(2 × 0.4353) + 0.5286] 1.4
1.5(0.4353 × 0.5286) 0.345
F0.5 = = = 0.46.
[(0.5 × 0.4353) + 0.5286] 0.74625
This suggests that the results obtained on the positive class are not very good
and that nb favors neither precision nor recall.
7 Note that this α is different from the one used in statistical significance testing. However, because
it is a conventional symbol in the case of F measure, we retain it here. The different context of the
two usages of α will disambiguate the two, helping avoid confusion.
104 Performance Measures I
For our hypothetical example from Tables 3.6 and 3.7, we have F1HA =
2 × 0.67 × 0.4/[(2 × 0.67) + 0.4] = 0.31 and F1HB = 2 × 0.572 × 0.8/[(2 ×
0.572) + 0.8] = 0.47. This shows that, with precision and recall weighed
equally, HB does a better job on the positive class than HA . On the other hand, if
we use the F2 measure, we get F2HA = 0.46 and F2HB = 0.7, which emphasizes
that recall is a greater factor in assessing the the quality of HB than precision.
HA HB
Similarly, the use of F0.5 measure results in F0.5 = 0.55 and F0.5 = 0.632,
indicating that recall is indeed very important in the case of HB . Indeed, when
precision counts for twice as much as recall, the two metrics obtain results that
are relatively close to one another.
This goes on to show that choosing the relative weight for combining the
precision and recall is very important in the F -measure calculations. However,
in most practical cases, appropriate weights are generally not known, resulting in
a significant limitation with regard to the use of such combinations of measures.
We revisit this issue when discussing the weighted versions of other performance
measures a bit later. Another limitation of the F measure results from the
limitation of its components. Indeed, just as the precision–recall metric pair, the
F measure leaves out the true-negative performance of the classifier.
proposed that takes into account class ratios instead of fixing the measurements
with regard to the class size of a particular class. Class ratio for a given class i
refers to the number of instances of class i as opposed to those of other classes
in the dataset. Hence, in the two-class scenario, the class ratio of the positive
class as opposed to the negative class, denoted as ratio+ , can be obtained as
(TP + FN)
ratio+ = .
(FP + TN)
In the multiclass scenario, for the class of interest i, we get
j cij
ratioi = .
j,j =i cj i + j,j =i cjj
The entries can then be weighted by their respective class ratios. For instance,
in the binary case, the TPR can be weighed by ratio+ whereas the TNR can be
weighed by ratio− , the class ratio of the negative class. Such an evaluation
that takes into account the differing class distributions is referred to as a skew-
sensitive assessment.
The second issue that confounds the interpretation of the entries of the confu-
sion matrix that we wish to discuss here is that of asymmetric misclassification
costs. There are two dimensions to this. Misclassifying instances of a class i
can have a different cost than misclassifying the instances of class j, j = i.
Moreover, the cost associated with the misclassification of class i instances to
class j, j = i can differ from the misclassification of class i instances to class
j , j = j . For instance, consider a learning algorithm applied to the tasks of dif-
ferentiating patients with acute lymphoblastic leukemia (ALL) or acute myloid
leukemia (AML) and healthy people based on their respective gene expression
analyses (see Golub et al., 1999, for an example of this). In such a scenario,
predicting one of the pathologies for a healthy person might be less costly than
missing some patients (with ALL or AML) by predicting them as healthy (of
course, contingent on the fact that further validation can identify the healthy
people as such). On the other hand, missing patients can be very expensive
because the disease can go unnoticed, resulting in devastating consequences.
Further, classifying patients with ALL as patients with AML and vice versa can
also have differing costs.
The asymmetric (mis)classification costs almost always exist in real-world
problems. However, in the absence of information to definitively establish such
costs, a comforting assumption, that of symmetric costs, is made. As a result, the
confusion matrix, by default, considers all errors to be equally important. The
question then becomes this: How can we effectively integrate cost considera-
tions when such knowledge exists? Incorporating such asymmetric costs can be
quite important for both effective learning on the data and sensible evaluation.
This brings us back to our discussion on the distinction of solutions applied
during the learning process and after the learning is done. The asymmetric cost
106 Performance Measures I
loses the uncertainty information of the classifier. That is, it suggests that we
should be equally confident in all the class labels predicted by the classifier
or alternatively that the labels output by the classifier are all perfectly certain.
Nonetheless, when looked at closely, the information in the table gives us a
more detailed picture. In particular, instances that are misclassified with little
certainty (e.g., Instance 2) can quite likely correspond to instances often called
boundary examples. On the other hand, when misclassifications are made with
high certainty (e.g., Instance 6), then either the classifier or the examples need to
be studied more carefully because such a behavior can be due to a lack of proper
learning or to the presence of noise or outliers, among other reasons. There
can also be other cases in which uncertainty is introduced in the performance
estimates (e.g., stochastic learning algorithms) and for which it is not trivial to
measure the performance of the learning algorithm on the test data. Altogether,
the point we wish to make here is that information about a classifier’s certainty
or uncertainty can be very important. As can be clearly seen, the confusion
matrix, at least in its classical form, does not incorporate this information.
Consequently the lack of classifier uncertainty information is also reflected in
all the performance measures that rely solely on the confusion matrix. The next
chapter discusses the issue in more depth as it moves away from performance
measures that rely solely on the confusion matrix.
WEKA’s classification screen. The following listing shows the WEKA summary
output obtained on the labor dataset with c45 with the preceding option checked.
Listing 3.1: WEKA’s extended output.
=== Summary ===
=== D e t a i l e d A c c u r a c y By C l a s s ===
=== C o n f u s i o n M a t r i x ===
a b <−− c l a s s i f i e d a s
14 6 | a = bad
9 28 | b = good
We now link the WEKA terminology to our terminology and indicate the values
obtained by the classifier on this dataset in Table 3.12. All these correspondences
can be established by getting back to the formulas previously listed in Sec-
tions 3.4 and 3.5 and comparing them with the values output by WEKA.
3.7 Summary
In this chapter, we focused on the measures that take into account, in some form
or other, information extracted solely from the confusion matrix. Indeed, these
are some of the most widely utilized measures – despite the limitations that
we discussed – and they are shown to perform reasonably well when used in
the right context. The measures that we discussed here, at least in their classical
form, generally address the performance assessment of a deterministic classifier.
There is also a qualitative aspect to the limitation of the information conveyed
by the confusion matrix. Because the entries of the confusion matrix report the
“numbers” of correctly or incorrectly classified examples in trying to provide
the information succinctly, there is another loss incurred by the confusion-
matrix-based metrics: the loss incurred by not looking at overlapping sets of
examples, rather than just numbers, classified correctly or missed by respective
(two or more) algorithms. Indeed, algorithms that have a highly overlapping set
3.8 Bibliographic Remarks 109
a DABC stands for the part of the output titled “Detailed Accuracy By Class.”
of instances on which they perform similarly can have learned more accurate
models of the data than the ones achieving the same numbers but on nonover-
lapping sets of instances. Another limiting aspect of the confusion-matrix-based
measures follows from the fact that they do not allow users to visualize the
performance of the learning algorithm with varying decision thresholds. On a
similar account, the measures resulting from such information also tend to be
scalar. Such an attempt to obtain a succinct single-measure description of the
performance results in the loss of information. Although pairs of measures are
sometimes used to compensate for this (e.g., sensitivity and specificity), this does
not address the issue completely. In the next chapter, we extend our discussion of
performance measures to the ones that do not rely only on the information from
a confusion matrix (at least, not just a single confusion matrix). In particular, we
look into graphical measures that allow the user to visualize the performance of
the classifier, for a given criterion, over its full operating range (range of possible
values of the classifier’s decision threshold).
8 Note that WEKA reports Fleiss’s Kappa for the multiclass case.
110 Performance Measures I
(Ferri et al., 2009) and are very important because they affect our choices of
a performance metric, given what we know of the domain to which it will be
applied.
Various studies about evaluating and combining the performance measures
themselves have appeared, such as those of Ling et al. (2003); Caruana and
Niculescu-Mizil (2004); Huang and Ling (2007); and Ferri et al. (2009). We
will see some of the interesting aspects of this research in Chapter 8.
WEKA has become one of the most widely used machine learning tool. The
relevant sources are (Witten and Frank, 2005a, 2005b).
With regard to combining the prior class distribution and the asymmetric
classification costs, the concept of cost ratios was proposed. In general, Flach
(2003) suggests that we use the neutral term skew-sensitive learning rather than
cost-sensitive learning to refer to adjustments to learning or evaluation pertaining
to either the class distribution or the misclassification cost. Using cost ratios for
classifier performance assessment was also discussed by Ciraco et al. (2005). The
first prominent criticism against the use of accuracy and empirical error rate in
the context of learning algorithms was made by Provost et al. (1998). Adapting
binary metrics to multiclass classification and skew-ratio considerations was
also discussed by Lavrač et al. (1999), who propose weighted relative accuracy.
Examples of stochastic algorithms measuring Gibbs and Bayes risk can be found
in (Marchand and Shah, 2005) and (Laviolette et al., 2010).
One of the most widely used agreement statistic has been Cohen’s κ (Cohen,
1960). Scott’s π statistics was proposed in (Scott, 1955), and Bennett’s S coeffi-
cient was proposed in (Bennett et al., 1954). Earlier attempts to find agreement
included Jaccard’s index (Jaccard, 1912) and the Dice coefficient (Dice, 1945),
among others. However, these did not take into account chance correction. Many
generalizations with regard to Cohen’s κ agreement statistic have appeared with
their applicability in respective contexts. Some of the relevant readings include
those of Fleiss (1971), Kraemer (1979), Schouten (1982), and Berry and Mielke
(1988). Issues with regard to the behavior of Cohen’s κ in the presence of bias
and prevelance were also identified and corresponding corrected measures pro-
posed. See, for instance (Byrt et al., 1993). Finally, Gwet (2002a, 2002b) noted
the issues with Cohen’s κ measures and introduced the AC1 statistic, defined as
Po − PeG
,
1 − PeG
where PeG = 2PP (1 − PP ) and PP is defined in Equation (3.3) in Subsec-
tion 3.4.2.
4
Performance Measures II
Our discussion in the last chapter focused on performance measures that relied
solely on the information obtained from the confusion matrix. Consequently it
did not take into consideration measures that either incorporate information in
addition to that conveyed by the confusion matrix or account for classifiers that
are not discrete. In this chapter, we extend our discussion to incorporate some
of these measures. In particular, we focus on measures associated with scoring
classifiers. A scoring classifier typically outputs a real-valued score on each
instance. This real-valued score need not necessarily be the likelihood of the test
instance over a class, although such probabilistic classifiers can be considered
to be a special case of scoring classifiers. The scores output by the classifiers
over the test instances can then be thresholded to obtain class memberships
for instances (e.g., all examples with scores above the threshold are labeled as
positive, whereas those with scores below it are labeled as negative). Graphical
analysis methods and the associated performance measures have proven to be
very effective tools in studying both the behavior and the performance of such
scoring classifiers. Among these, the receiver operating characteristic (ROC)
analysis has shown significant promise and hence has gained considerable pop-
ularity as a graphical measure of choice. We discuss ROC analysis in significant
detail. We also discuss some alternative graphical measures that can be applied
depending on the domain of application and assessment criterion of interest.
We also touch on metrics commonly known as reliability metrics that take par-
tial misclassification loss into account. Such metrics also form the basis for
evaluation measure design in continuous learning paradigms such as regression
analysis. We briefly discuss the root-mean-square-error metric as well as metrics
inspired from information theory, generally utilized in the Bayesian analysis of
probabilistic classifiers.
Let us start with the commonly used graphical metrics for performance
evaluation of classifiers.
111
112 Performance Measures II
of the noise that corrupts the signal, the strength of the signal itself, and the
desired hit (detecting the signal when the signal is actually present) or false-
alarm (detecting a signal when the signal is actually absent) rate. The selection
of the best operating point is typically a trade-off between the hit rate and the
false-alarm rate of a receiver.
In the context of learning algorithms, ROC graphs have been used in a variety
of ways. ROC is a very powerful graphical tool for visualizing the performance
of a learning algorithm over varying decision criteria, typically in a binary clas-
sification scenario. ROCs have been utilized not only to study the behavior of
algorithms but also to identify optimal behavior regions, perform model selec-
tion, and perhaps most relevant to our context, for the comparative evaluation
of learning algorithms. However, before we proceed with the evaluation aspect,
it would be quite helpful to understand the ROC space, the meaning of the ROC
curve, and its relation to other performance measures.
An ROC curve is a plot in which the horizontal axis (the x axis) denotes
the false-positive rate FPR and the vertical axis (the y axis) denotes the true-
positive rate TPR of a classifier. We discussed these measures in the last chapter.
As you may recall, the TPR is nothing but the sensitivity of the classifier whereas
the FPR is nothing but 1-TNR (TNR is the true negative rate) or equivalently
1 – specificity of the classifier. Hence, in this sense, ROC analysis studies the
relationship between the sensitivity and the specificity of the classifier.
1
Best Trivial (Positive)
0.9
0.7
True-Positive Rate
0.6
0.5
ier
0.4 sif
as
Cl
om
0.3 nd
Ra
0.2
f: worse than Random
0.1
by this point gets all its predictions wrong. On the other hand, the point (0, 1)
denotes the ideal classifier, one that gets all the positives right and makes no
errors on the negatives. The diagonal connecting these two points has TPR = 1 –
FPR. Note that 1 – FPR is nothing but TNR, as discussed in the last chapter.
This goes to show that the classifiers along this diagonal perform equally well
on both the positive and the negative classes.
An operating point in the ROC space corresponds to a particular decision
threshold of the classifier that is used to assign discrete labels to the examples.
As just mentioned, the instances achieving a score above the threshold are
labeled positive whereas the ones below are labeled negative. Hence, what the
classifier effectively does is establish a threshold that discriminates between
the instances from the two classes coming from two unknown and possibly
arbitrary distributions. The separation between the two classes then decides the
classifier’s performance for this particular decision threshold. Hence each point
on the ROC space denotes a particular TPR and FPR for a classifier. Now, each
such point will have an associated confusion matrix summarizing the classifier
performance. Consequently an ROC curve is a collection of various confusion
matrices over different varying decision thresholds for a classifier.
Theoretically, by tuning the decision threshold over the continuous interval
between the minimum and maximum scores received by the instances in the
dataset, we can obtain a different TPR and FPR for each value of the scoring
4.2 Receiver Operating Characteristic (ROC) Analysis 115
0.9
0.8
f1
0.7
f2
0.6
True-Positive Rate
0.5
0.4
0.3
0.2
0.1
Figure 4.2. The ROC curves for two hypothetical scoring classifiers f1 and f2 .
threshold, which should result in a continuous curve in the ROC space (such as
the one shown in Figure 4.2 for two hypothetical classifiers f1 and f2 ). However,
this is not necessarily the case in most practical scenarios. The reasons for this
are twofold. First, the limited size of the dataset limits the number of values
on the ROC curves that can be realized. That is, when the instances are sorted
in terms of the scores achieved as a result of the application of the classifier,
then all the decision thresholds in the interval of scores of any two consecutive
instances will essentially give the same TPR and FPR on the dataset, resulting
in a single point. Hence, in this case, the maximum number of points that can be
obtained are upper bounded by the number of examples in the dataset. Second,
this argument assumes that a continuous tuning of the decision threshold is
indeed possible. This is not necessarily the case for all the scoring classifiers, let
alone the discrete ones for which such tuning cannot be done at all. Classifiers
such as decision trees, for instance, allow for only a finite number of thresholds
(upper bounded by the number of possible labels over the leafs of the decision
tree). Hence, in the typical scenario of a scoring classifier, varying the decision
threshold results in a step function at each point in the ROC space. An ROC
curve can then be obtained by extrapolation over this set of finite points. Discrete
classifiers, the ones for which such a tuning of the decision threshold is not
possible, yield discrete points in the ROC space. That is, for a given test set T ,
a discrete classifier f will generate one tuple [FPR(f ), TPR(f )] corresponding
116 Performance Measures II
(0,1) (1,1)
SVM
NB
C4.5
True-Positive Rate
NN
f1
f2
Figure 4.3. An example of a ROC plot for discrete classifiers in a hypothetical scenario.
to one point in the two-dimensional (2D) ROC space (for instance, the classifier
f or f c in Figure 4.1 represents such discrete classifiers). Figure 4.3 shows
some examples of points given by six classifiers in the ROC space on some
hypothetical dataset.
The classifiers appearing on the left-hand side on an ROC graph can be
thought of as more conservative in their classification of positive examples, in
the sense that they have small false-positive rates, preferring failure to recognize
positive examples to risking the misclassification of negative examples. The
classifiers on the right-hand side, on the other hand, are more liberal in their
classification of positive examples, meaning that they prefer misclassifying
negative examples to failing to recognize a positive example as such. This can
be seen as quite a useful feature of ROC graphs because different operating points
might be desired in the context of different application settings. For instance,
in the classical case of cancer detection, labeling a benign growth as cancer
leads to fewer negative consequences than missing to recognize a cancerous
growth as such, whereas false negatives are not as serious in applications such
as information retrieval.
Let us illustrate this in the plot of Figure 4.3. We see that the “best” overall
classifier is svm because it is closest to point (0, 1). Nevertheless, if retaining
a small false-positive rate is the highest priority, it is possible that, in certain
circumstance, c4.5 or even nn would be preferred. There is no reason why nb
should be preferred to c4.5 or svm, however, because both its false-positive and
true-positive rates are worse than those of c4.5 or svm. We will come back to this
a bit later. f 1 is a very weak classifier, barely better than random guessing. As
4.2 Receiver Operating Characteristic (ROC) Analysis 117
for f 2, it can be made (roughly) equivalent to nb once its decisions are reversed
in manner analogous to that of classifier f in Figure 4.1 that is the mirror of f c
along the center of the graph.
It should be noted that it is not trivial to characterize the performances
of classifiers that are just slightly better than those of the random classifier
(points just above the diagonal). A statistical significance analysis is required
in such cases to ascertain whether the marginally superior performance of such
classifiers, compared with that of a random classifier, is indeed statistically
significant. Finally, each point on the ROC curve represents a different trade-off
between the false positives and false negatives (also known as the cost ratio).
The cost ratio is defined by the slope of the line tangent to the ROC curve at a
given point.
finite dichotomies over the label assignment to the test instances can be realized,
and hence the score can be thresholded at a finite number of points (resulting
in different performances). Hence this would yield a piecewise ROC graph as
a consequence of a discrete score distribution. Similarly, as mentioned earlier,
classifiers such as decision trees yield only a finite number of scores (and hence
thresholding possibilities), again giving a discrete score distribution. An ROC
curve is obtained by extrapolation of the resulting discrete points on the ROC
graph. However, in the limit, the scores can be generalized to being probability
density functions.
Another observation worth making in the case of ROC analysis is also that
the classifier scores need to be only relative. The scores need not be in any
predefined intervals, nor be likelihoods or probabilities over class memberships.
What indeed matters is the label assignment obtained (and hence the resulting
confusion matrix) when the score interval is thresholded. The overall hope is
that the classifier typically scores the positive examples higher than the negative
examples. In this sense, the scoring classifiers are sometimes also referred to as
ranking classifiers in the context of ROC analysis. However, it is important to
clarify that these algorithms need not be ranking algorithms in the classical sense.
That is, they need not rank all the instances in a desired order. What is important
is that the positive instances are generally ranked higher than the negative ones.
Therefore we henceforward use the term scoring classifiers, avoiding the more
confusing term of ranking classifiers.
the classifier over its entire operating range and, further, over different class
distributions.
Hence, in considering a 2D ROC space, we essentially relax this assumption.
When considering performances on a 2D slice, we have the implicit assumption
that the TPR and the FPR are independent of the empirical (or expected) class
distributions. However, when this 2D ROC space is considered with respect
to performance measures, such as accuracy, that are sensitive to such class
imbalances, it is wise to incorporate this consideration into the performance
measure itself. In addition, we can take misclassification costs in such cases into
account by including the cost ratios.
More generally, the factors such as class imbalances, misclassification costs,
and credits for correct classification can be incorporated by a single metric,
the skew ratio. Note that a single metric can be used in many measures, not
merely because it alone can account for all the different concerns previously
mentioned. Rather, the rationale behind this is that the different factors have
similar effects on the performance measures such as accuracy or empirical error
rate, and hence a single metric can be used to signify this (possibly combined)
effect. This does not have significant effect as long as the performance measure
is used to compare various algorithms’ performances under similar settings.
However, the interpretation of the performance measure and the corresponding
algorithm’s performance with respect to it can vary significantly.
A skew ratio rs can be utilized such that rs < 1 if positive examples are
deemed more important, for instance, because of a class imbalance with fewer
positives in the test set compared with the negatives or because of a high misclas-
sification cost associated with the positives. In an analogous manner, rs > 1 if
the negatives are deemed more important. Accounting for class imbalances, mis-
classification costs, credits for correct classification, and so on hence compels
the algorithm to trade off the performance on positives and negatives in relation
to each other. This trade-off can be represented, in the context of skewed sce-
narios, by the skew ratio. Let us see how, in view of the skew consideration by
means of rs , the classifier performance can be characterized with regard to some
specific performance measure.
4.2.3 Isometrics
The optimal operating point (threshold) for the classifier will need to be selected
in reference to some fixed performance measure. To select an optimal operating
point on the ROC curve, any performance measure can be used as long as it can
be formulated in terms of the algorithm’s TPR and FPR. For instance, given a
skew ratio rs , we can define the (skew-sensitive) formulation of the accuracy of
classifier f as
TPR(f ) + (1 − rs )FPR(f )
Acc(f ) = , (4.1)
1 + rs
120 Performance Measures II
0.8 A0.4
A0.5
0.7
0.6 A0.4
True-Positive Rate
A0.3
0.5
A0.3
0.4
A0.2
0.3
A0.2
0.2 A0.1
0.1 A0.1
Figure 4.4. An ROC graph showing isoaccuracy lines calculated according to Equa-
tion (4.1). The subscripts following A denotes respective accuracies for rs = 1 (black
lines) and rs = 0.5 (dash-dotted lines).
1
P0.9 P0.8
P0.7
0.9 P0.6
P0.5
0.8
0.7
True-Positive Rate
P0.4
0.6
0.5
P0.3
0.4
0.3
P0.2
0.2
P0.1
0.1
Figure 4.5. An ROC graph showing isoprecision lines calculated according to Equa-
tion (4.2) for rs = 1. The subscripts following the P denote respective precisions.
122 Performance Measures II
0.9
0.8
0.7
True-Positive Rate
pO
0.6
0.5
Suboptimal
0.4
0.3
0.2
0.1
Figure 4.6. A Hypothetical ROC curve. Point pc is a suboptimal point. That is, pc is not
optimal for any isoaccuracy line for any skew ratio.
Note here that this interpretation gives the isometrics in the expected cost sce-
nario. The skew scenario is more general and the expected cost can be interpreted
as its special case. However, analyzing expected skew can be more difficult, if
at all possible, in most scenarios.
Although we do not show representative isocurves here, metrics such as
entropy or the Gini coefficient yield curves on the 2D ROC graph. Interested
readers are referred to the pointers in Section 4.8.
For a given performance measure, we can consider the highest point on the
ROC curve that touches a given isoperformance line of interest (that is, with
desired rs ) to select the desired operating point. This can be easily done by
starting with the desired isoperformance line at the best classifier position in the
ROC graph (FPR = 0, TPR = 1) and gradually sliding it down until it touches
one or more points on the curve. The points thus obtained are the optimal
performances of the algorithm for a desired skew ratio. We can obtain the value
of the performance measure at this optimal point by looking at the intersection
of the isoperformance line and the diagonal connecting the points (FPR = 0,
TPR = 1) and (FPR = 1, TPR = 0). Naturally there are many points that are not
optimal. That is, there is no skew ratio such that these points correspond to the
optimal performance of the learning algorithm. We refer to all such points as
suboptimal. For instance, the point po in Figure 4.6 is a suboptimal point. The
set of points on the ROC curve that are not suboptimal forms the ROC convex
hull, generally abbreviated as ROCCH (see Figure 4.7).
4.2 Receiver Operating Characteristic (ROC) Analysis 123
Convex Hull
0.9
0.8
0.7
0.6
True-Positive Rate
0.5
0.4
0.3
0.2
0.1
1
P0.9 A0.8 A0.6
Convex Hull
0.9
0.8
0.7
0.6
True-Positive Rate
0.5
0.4
0.3
0.2
0.1
Figure 4.8. The ROCCH and model selection. See text for more details.
0.9
0.8
f1
0.7 f2
True-Positive Rate
0.6
0.5
0.4
0.3
0.2
0.1
Figure 4.9. The ROC curves for two hypothetical scoring classifiers f1 and f2 , in which a
single classifier is not strictly dominant throughout the operating range.
0.9
Convex Hull
0.8
0.7 f2
True-Positive Rate
0.6
f1
0.5
0.4
0.3
0.2
0.1
Figure 4.10. The convex hull over the classifiers f1 and f2 of Figure 4.9.
126 Performance Measures II
Table 4.1. Points used to generate a ROC curve. See Example 4.1
Instance # 1 2 3 4 5 6 7 8 9 10
Scores 0.95 0.9 0.8 0.85 0.68 0.66 0.65 0.64 0.5 0.48
True class p n p p n p n p n n
Example 4.1. Table 4.1 shows the scores that a classifier assigns to the instances
in a hypothetical test set along with their true class labels. According to the
1 Note that we directly use index i to denote a data point instead of xi . Since it directly corresponds
to Listing 4.1 where i is typically used over iterations.
4.2 Receiver Operating Characteristic (ROC) Analysis 127
0.9
0.8
0.7
True-Positive Rate
0.6
0.5
0.4
0.3
0.2
0.1
algorithm of Listing 4.1, the threshold will first be set at 0.48. At that threshold,
we obtain TPR = FPR = 1 because every positive example has a score above
or equal to 0.48, meaning that all the positive examples are well classified, but
all the negative ones, with scores above 0.48 as well, are misclassified. This
represents the first point on our curve. All the thresholds issued by increments
of 0.05 until 0.5 yield the same results. At 0.5, however, we obtain TPR = 1
and FPR = 0.8. This represents the second point of our curve. The next relevant
threshold is 0.64, which yields TPR = 1 and FPR = 0.6, giving the third point
on the curve. We obtain the rest of the points in a similar fashion to obtain the
graph of Figure 4.11.2
2 Note that the plots obtained by varying the thresholds of continuous or scoring classifiers output a
step function on a finite number of points. This would approximate the true curve of the form of
Figure 4.2 in the limit of an infinite number of points because in principle the thresholds can be
varied in the interval [−∞ + ∞]. We discussed this earlier in the chapter.
128 Performance Measures II
The first statistic aims at establishing the performance that a learning algorithm
can achieve above the random classifier along TPR = FPR. The second signifies
the operating range of the algorithm that yields classifiers with lower expected
cost. The final and probably the most popular summary statistic for the ROC
analysis is the AUC. Hence we discuss this metric in more detail.
The AUC represents the performance of the classifier averaged over all the
possible cost ratios. Noting that the ROC space is a unit square, as previously
discussed, it can be clearly seen that the AUC for a classifier f is such that
AUC(f ) ∈ [0, 1], with the upper bound attained for a perfect classifier (one with
TPR = 1 and FPR = 0). Morever, as can be noted, the random classifier repre-
sented by the diagonal cuts the ROC space in half and hence AUC(frandom ) = 0.5.
For a classifier with a reasonably better performance than random guessing, we
4.2 Receiver Operating Characteristic (ROC) Analysis 129
would expect it to have an AUC greater than 0.5. An important point worth
noting here is that an algorithm can have an AUC value of 0.5 also for reasons
other than a random performance. If the classifier assigns the same score to all
examples, whether negative or positive, we would obtain classifiers along the
diagonal TPR = FPR. We can also obtain a similar curve if the classifier assigns
similar distributions of the score. Consequently this would result in an AUC
of approximately 0.5. We can also obtain such a metric value if the classifier
performs very well on half of the examples of one class while at the same time
performing poorly on the other half (that is, it assigns the highest scores to one
half and the lowest to the other).
Another interpretation of an AUC can be obtained for ranking classifiers in
that AUC represents the ability of a classifier to rank a randomly chosen positive
test example higher than a negative one. In this respect, this is shown to be
equivalent to Wilcoxon’s Rank Sum test (also known as the Mann–Whitney U
test). Wilcoxon’s Rank Sum test is a nonparametric test to assess whether two
samples (over orderings) of the observations come from a single distribution.
It will be discussed at greater length in Chapter 6. With regard to the Gini
coefficient (Gini), a measure of statistical dispersion popular in economics, it
has been shown that AUC = (Gini + 1)/2.
Elaborate methods have been suggested to calculate the AUC. We illustrate
the computation according to one such approach in Section 4.6. However, using
Wilcoxon’s Rank Sum statistic, we can obtain a simpler manner of estimating
the AUC for ranking classifiers. To the scores assigned by the classifier to each
test instance, we associate a rank in the order of decreasing scores. That is, the
example with the highest score is assigned the rank 1. Then we can calculate the
AUC as
|Tp |
(Ri − i)
AUC(f ) = i=1 ,
|Tp ||Tn |
where Tp ⊂ T and Tn ⊂ T are, respectively, the subsets of positive and negative
examples in test set T , and Ri is the rank of the ith example in Tp given by
classifier f .
AUC basically measures the probability of the classifier assigning a higher
rank to a randomly chosen positive example than to a randomly chosen negative
example. Even though AUC attempts to be a summary statistic, just as other
single-metric performance measures, it too loses significant information about
the behavior of the learning algorithm over the entire operating range (for
instance, it misses information on concavities in the performance, or trade-off
behaviors between the true-positive and the false-positive performances).
It can be argued that the AUC is a good way to get a score for the general
performance of a classifier and to compare it with that of another classifier.
This is particularly true in the case of imbalanced data in which, as discussed
earlier, accuracy is strongly biased toward the dominant class. However, some
130 Performance Measures II
criticisms have also appeared warning against the use of AUC across classifiers
for comparative purposes. One of the most obvious is that, because the classifiers
are typically optimized to obtain the best performance (in context of the given
performance measure), the ROC curves thus obtained in the two cases would be
similar. This then would yield uninformative AUC differences. Further, if the
ROC curves of the two classifiers intersect (such as in the case of Figure 4.9),
the AUC-based comparison between the classifiers can be relativly uninforma-
tive and even misleading. However, a more serious limitation of the AUC for
comparative purposes lies in the fact that the misclassification cost distributions
(and hence the skew-ratio distributions) used by the AUC are different for dif-
ferent classifiers. We do not delve into the details of this limitation here, but
the interested reader is referred to the pointers in Section 4.8 and a brief related
discussion in Chapter 8.
Nonetheless, in the event the ranking property of the classifier is important (for
instance, in information-retrieval systems), AUC can be a more reliable measure
of classifier performance than measures such as accuracy because it assesses
the ranking capability of the classifier in a direct manner. This means that the
measure will correlate the output scores of the classifier and the probability
of correct classification better than accuracy, which focuses on determining
whether all data are well classified, even if some of the data considered are
not the most relevant for the application. The relationship between AUC and
accuracy has also received some attention (see pointers in Section 4.8).
4.2.6 Calibration
The classifiers’ thresholds based on the training set may or may not reflect the
empirical realizations of labelings in the test set. That is, if the possible score
interval is [−∞, +∞], and if no example obtains a score in the interval [t1 , t2 ] ⊂
[−∞, +∞], then no threshold in the interval [t1 , t2 ] will yield a different point
in the ROC space. The extrapolation between two meaningful threshold values
(that is, ones that result in at least one different label over the examples in the
test set) may not be very sensible. One solution to deal with this problem is
calibration. All the scores in the interval [t1 , t2 ] can be mapped to the fraction of
the positive instances obtained as a result of assigning any score in this interval.
That is, for any threshold in the interval [t1 , t2 ], the fraction of instances labeled
as positive remains the same, and hence all the threshold scores in the interval
can be mapped to this particular fraction.
This is a workable solution as long as there are no concavities in the ROC
curve. Concavity in the curve means that there are skew ratios for which the
classifier is suboptimal. This essentially means that better classifier performance
can be obtained for the skew ratios lying in the concave region of the curve
although the empirical estimates do not suggest this. In the case of concavities,
the behavior of the calibrated scores does not mimic the desired behavior of the
4.3 Other Visual Analysis Methods 131
slope of the threshold interval. The classifier obtained over calibrated scores can
overfit the data, resulting in poor generalization. A solution for dealing with the
issue of score calibration in the case of nonconvex ROCs has been proposed in
the form of isotonic regression over the scores. The main idea behind the isotonic
regression approach is to map the scores corresponding to the concave interval,
say [t1c , t2c ], to an unbiased estimate of the slope of the line segment connecting
the two points corresponding to the thresholds t1c and t2c . This in effect bypasses
the concave segments by extrapolating a convex segment in the interval. Some
pointers of interest with regard to calibration can be found in Section 4.8.
2
AUCmulticlass (f ) = AUCli ,lj (f ),
l(l − 1) l ,l ∈L
i j
where AUCmulticlass (f ) is the total AUC of the multiclass ROC for the classifier
f , L is the set of classes such that |L| = l, and AUCli ,lj (f ) is the AUC of the
two-class ROC curve of f for the classes li and lj .
Finally, as we discussed earlier, ROCs can be a significant tool to assist in
evaluating the performance of hybrid classifiers designed to improve perfor-
mance, in removing concavities from the individual classifiers’ performance, or
in the case of analyzing cascaded classifiers. The details are out of the scope of
this book but some pointers are given in Section 4.8.
ROC curves. In particular, we devote some space to discussing the cost curves.
But, for now, let us start with the lift charts.
curves can sometimes be more appropriate than the ROC curves in the events of
highly imbalanced data (Davis and Goadrich, 2006).
Cost-Curve Space
Cost curves plot the misclassification cost as a function of the proportion of
positive instances in the dataset. Let us, for the moment, forget all about costs
and focus just on the error rate. We can then graft the costs onto our explanations
later. The important thing to keep in mind is that there is a point–line duality
between cost curves and ROC curves. Cost curves are point–line duals of the
ROC curves. In ROC space, a discrete classifier is represented by a point. The
points representing several classifiers (produced by manipulating the threshold
of the base classifier) can be joined (and extrapolated) to produce a ROC curve
or the convex hull of the straight lines produced by joining each point together.
In cost space, each of the ROC points is represented by a line and the convex hull
of the ROC space corresponds to the lower envelope created by all the classifier
lines. We illustrate the cost-curve space in Figure 4.12.
Figure 4.12 shows a space in which the error rate is plotted as a function of
the probability of an example being from the positive class, P (+). On the graph,
we first focus on four lines: the three dashed straight lines and the x axis. Each
of these lines represents a different ideal, terrible or trivial classifier: The x axis
represents the perfect classifier, i.e., no matter what the value of P (+) is, its
error rate is always 0. The horizontal dashed line located on the y axis’s value
of 1 is the opposite classifier: the one that has a 100% error rate, no matter what
P (+) is. The two dashed diagonal lines represent the trivial classifiers. The one
with an ascending slope is the one that always issues a negative classification,
whereas the one with descending slope always issues a positive classification.
134 Performance Measures II
Classifier that
is always wrong
1
Positive
Trivial
Classifier
Negative
Trivial
Error Classifier
Rate
0.25
Classifier that
is always right
0 0 0.4 1
P(+), Probability of an example
being from the positive class
Clearly, the first of these classifiers gets a 100% error rate, when P (+) = 1 and
a 0% error rate when P (+) = 0. Conversely, the second one obtains a 0% error
rate when P (+) = 1 and a 100% error rate when P (+) = 0.
The graph also shows four full descending straight lines and four full ascend-
ing straight lines. By these lines, we mean to represent two families of (hypothet-
ical) classifiers, say, A and B. Each of these lines would thus be represented by a
point in ROC space, and the four A points would be used to generate classifiers
A’s ROC curve whereas the four B points would be used to generate B’s ROC
curve.
The first thing to note is that there are regions of the cost space in which each
of classifiers from A and B are irrelevant. These represent all the areas of the
lines that fall outside of the bottom triangle. These areas are irrelevant because,
in those, the trivial positive or negative classifiers have lower error rates than
classifiers from A or B. This lower triangle is called the classifiers’ operating
region. In this region, the line closest to the x axis is the best-performing
classifier. It can be seen that, in certain parts of the operating region, one of the
trivial classifiers is recommended whereas, in others, various classifiers from
the A or B family are preferred.
becomes the dominating classifier. This point is similar, in some ways, to the
points at which two ROC curves cross, except that the ROC graph does not
give us any practical information about when classifier A should be used over
classifier B. Such practical information can, on the other hand, be read off the
cost graph. In particular, we can read that for 0.26 ≤ P (+) < 0.4, classifier B2
should be used, but that for 0.4 ≤ P (+) < 0.48, classifier A1 should be used.
This is practical information because P (+) represents the class distribution of
the domain in which the Classifier A and B compound system is deployed. In
contrast, ROC graphs tell us that sometimes A is preferable to B, but we cannot
read off when this is so from the graph.
Cost Considerations
The last point that we would like to discuss about cost curves is their use with
different costs. Remember that, to simplify our discussion of cost curves, we
chose to ignore costs. i.e., we assumed that each class had the same classification
costs. When this is not the case, a very simple modification of the cost curves
needs to be applied. This modification affects only the identity of the axes.
The meaning of the curves and the reading of the graph remain the same. In
this context, rather than representing the error rate, the y axis represents the
normalized expected cost (NEC) or relative expected misclassification cost,
defined as
where FNR and FPR are the false-negative and false-positive rates, respectively,
and PC [+], the probability cost function, a modified version of P [+] that takes
costs into consideration, is formally defined as
P [+] × C[+|−]
PC [+] = ,
P [+] × C[+|−] + P [−] × C[−|+]
where C[+|−] and C[−|+] denote the cost of predicting a positive when the
instance is actually negative and vice versa. P [−] is the probability of an example
belonging to the negative class.
is that, although the precise costs of each type of error might either not be
available or impossible to quantify, it may be the case that their relative costs
are known. Mapping such relative costs then transforms the ROC curves into
a set of parallel lines from which the superiority of classifiers in the regions
of interest (of relative cost ratios) can be inferred. To replace the associated
AUC, a loss comparison (LC) index is used in the case of relative superiority
curves. In this context, sometimes interpreted as a binary version of the ROC
curves, relative superiority curves can be used to identify whether a classifier is
superior to the other with the LC index measuring the confidence in the inferred
superiority.
4.3.6 Limitations
We have discussed the limitations of the graphical analysis methods in the text in
the preceding subsections. It is important that one keep these limitations in mind
while drawing conclusions or making inferences based on these measures. Let
us illustrate this point with one last example. ROC analysis suggests that an ideal
classifier is achieved at point (0, 1) in the ROC space (i.e., it has FPR = 0 and
TPR = 1). Note the interesting aspect here: We make an implicit assumption
that the classifier always generates either a positive or a negative label on
every instance. This assumption can be problematic for the case of classifiers
that can abstain from making a prediction. A classifier that can abstain can
theoretically get TPR = 1 if it identifies all the positives correctly. However,
instead of making errors on negatives, if it abstains from making predictions
(whether positive or negative) on the set of negative instances totally, it can still
achieve a FPR of zero even though, obviously, this classifier may be far from
ideal.
In fact, such shortcomings result from the limitation of the confusion matrix
itself and are reflected in all the measures and analysis methods that rely on the
confusion matrix, including the ROC analysis.
4.4 Continuous and Probabilistic Classifiers 137
where yi is the true label of test instance xi and f (xi ) represents the label
predicted by the classifier f . Recall the definition of the risk of the classifier
from Chapter 2. The RMSE measures the same classifier risk, the only difference
being that of the loss function. Instead of using a zero–one loss function as in
the case of classification, RMSE uses a squared loss function. Naturally, other
notions of classifier risk can be defined under different settings by adapting the
associated loss function. The squared loss, in a sense, quantifies the error (or
alternatively closeness) of the predicted label to the true label. When specialized
to the case of probabilistic classifiers, this then can be interpreted as a reliability
measure.
138 Performance Measures II
Table 4.2. Intermediate values for calculating the RMSE. See Example 4.2
SqrErr
Inst. Class 1 Class 2 (sum of
No. predicted Actual Diff2 /2 predicted Actual Diff2 /2 differences)
1 0.95 1.00000 0.00125 0.05 0.00000 0.00125 0.0025
2 0.6 0.00000 0.18 0.4 1.00000 0.18 0.36
3 0.8 1.00000 0.02 0.2 0.00000 0.02 0.04
4 0.75 0.00000 0.28125 0.25 1.00000 0.28125 0.5625
5 0.9 1.00000 0.005 0.1 0.00000 0.005 0.01
In the absence of any other information about the application domain and the
specific evaluation needs, RMSE can serve as a good general-purpose perfor-
mance measure because it has been shown to correlate most closely with both
the classification metrics, such as accuracy, and the ranking metrics, such as the
AUC, in addition to its suitability in evaluating probabilistic classifiers (Caru-
ana and Niculescu-Mizil, 2004). Naturally it is reasonable to use it only in the
case of continuous classifier predictions. The qualifications that we made in the
beginning of this paragraph are indeed important; without these the RMSE need
not necessarily be more suitable than any other metric in consideration.
Example 4.2. Here is an example of the way WEKA calculates the RMSE
values. This example suggests how to handle continuous and multiclass data,
although, for simplicity, it deals with only the binary case.3 Let us assume that a
classifier is trained on a binary dataset and that the test set contains five instances.
We also assume that instances 1, 3, and 5 belong to class 1 whereas instances 2
and 4 belong to class 2. The results we obtained are listed in Table 4.2.
The columns titled “Class X predicted” (X = 1 or 2 in this example) show
the numerical predictions obtained for each instance with respect to the class
considered. The actual class value is shown in the next column. Columns Diff2 /2
squares the difference to the actual value for all class labels and divides it by
the number of class labels (2, in this example, because this is a binary example).
These are summed in the column “SqrErr” for each instance.
The five values of SqrErr are then summed to obtain 0.975. Finally, we
compute the RMSE by dividing 0.975 by 5, the number of instances, and taking
the square root of that ratio. In this example, we thus get RMSE = 0.4416.
3 The explanation was obtained from the WEKA mailing list at https://list.scms.waikato.ac.nz/
pipermail/wekalist/2007-May/009990.html, but a new example was generated.
4.4 Continuous and Probabilistic Classifiers 139
of the algorithms, but also in defining the optimization criteria for the learning
problem itself. The main reason for their success is their intuitive nature. These
measures reward a classifier upon correct classification relative to the (typically
empirical) prior on the data. Note that, unlike the cost-sensitive metrics that
we have discussed so far, including the ROC-based measures, the information-
theoretic measures, by virtue of accounting for the data prior, are applicable
to probabilistic classifiers. Further, these metrics are independent of the cost
considerations and can be applied directly to the probabilistic output scenario.
Consequently these measures have found a wide use in Bayesian learning.
We discuss some of the main information-theoretic measures, starting with the
classical Kullback–Leibler divergence.
Kullback–Leibler Divergence
We consider probabilistic classifiers where these measures can be applied most
naturally. Let the true probability distribution over the labels be denoted as p(y).
Let the posterior distribution generated by the learning algorithm after seeing the
data be denoted by P (y|f ). Note that, because f is obtained after looking at the
training samples x ∈ S, this empirically approximates P (y|x), the conditional
posterior distribution of the labels. Then the Kullback–Leibler divergence, also
known as the KL divergence, can be utilized to quantify the difference between
the estimated posterior distribution and the true underlying distribution of the
labels:
KL[p(y)||P (y|f )] = [p(y) ln p(y)dy] − p(y) ln P (y|f )dy
=− p(y) ln P (y|f )dy − − p(y) ln p(y)dy
P (y|f )
=− p(y) ln dy.
p(y)
The first equality basically denotes the difference between the entropies of the
posterior and the prior distributions. For a given random variable y (labels in
our case) and a given distribution p(y), the entropy E, i.e., the average amount
of information needed to represent the labels, is defined as:
E[y] = − p(y) log p(y)dy.
Hence the KL divergence basically just finds the difference between the entropies
of the two distributions P (y|f ) and p(y). In view of this interpretation, the
KL divergence is also known as the relative entropy. In information-theoretic
terms, the relative entropy denotes the average additional information required to
specify the labels by using the posterior distribution instead of the true underlying
distribution of the labels. The base of the logarithm is 2 in this case, and hence
the information content denoted by the entropy as well as the relative entropy
140 Performance Measures II
should be interpreted in bits. Alternatively, the natural logarithm can also be used.
This would result in an alternative unit called “nats” which, for all comparative
purposes, is equivalent to the preceding unit except for a ln 2 factor.
For a finite-size dataset S, this can be discretized to
P (y|f )
KL[p(y)||P (y|f )] = − p(y) ln dy
x∈S
p(y)
p(y)
= p(y) ln dy.
x∈S
P (y|f )
where P (y) represents the prior probability of a label y, P (y|f ) denotes the
posterior probability of the label y (after obtaining the classifier, that is, after
seeing the data), and the indicator function I (a) is equal to 1 iff the predicate a is
true and zero otherwise.4 Note that this single definition, in fact, represents two
cases, which are instantiated depending on a correct or an incorrect classification.
Note as well that, for simplification, log in this formula represents log2 .
In the case in which P (y|f ) ≥ P (y), the probability of class y has changed
in the right direction. Let us understand the terms in the measure from an
information-theoretic aspect. Recall from our previous discussion that the
entropy of a variable (labels y in our case) denotes the average amount of
information needed to represent the variables. This is nothing but an expectation
over the entire distribution P (y) over the variable. This basically means that,
for a single instantiation of the variable y, the amount of information needed
is nothing but − log P (y). That is, one needs − log P (y) bits of information
to decide if a label takes on the value y. Similarly, the amount of information
necessary to decide if the label does not take this particular value is nothing but
− log[1 − P (y)]. The metric hence measures the decrease in the information
needed to classify the instances as a result of learning the classifier f .
Now, the average information score Ia of a prediction over a test set T is
1
m
Ia = I(xj ), ∀x ∈ T , |T | = m
m j =1
4 Note that the notations for the information score and for the indicator function are slightly different:
For the information score, we used a boldfaced I, whereas we used an italic I for the indicator
function.
142 Performance Measures II
Example 4.3. We consider the example in Table 4.2, previously used to illus-
trate the calculation of the RMSE assuming a probabilistic interpretation. In
this table, the first instance is positive and P (y1 = 1) = 0.6 ≤ P (y1 = 1|f ) =
0.95. Note that the P (y)’s are calculated empirically. Because, there are 3
out of 5 instances of class 1, we have P (y = 1) = 35 = 0.6. Similarly, we
have P (y = 0) = 0.4. Therefore this instance’s information score is calcu-
lated as I = − log(0.6) + log(0.95) = 0.66297. The fourth instance, on the
other hand, is negative and P (y4 = 0) = 0.4 > P (y4 = 0|f ) = 0.25. There-
fore this instance’s information score is calculated as I = − log(1 − 0.4) +
log(1 − 0.25) = 0.32193. Altogether, the information scores obtained for the
entire dataset are listed in Table 4.3. Using these values, we can calculate the
average information score by averaging the information score values obtained
for each instance. In the example, we get Ia = 0.39698.
We can also calculate the relative information score, first, by computing E as
and second, by dividing Ia by E and turning the result into a percentage. i.e.,
Ia
Ir = 100% = 40.88%
E
However, this measure too has some limitations, especially in multiclass
scenarios. Because it considers the probability of only the true class, other classes
are ignored. That is, each misclassification is weighted equally. Accordingly, a
lack of calibration in such cases would not be penalized even though a correct
one is rewarded (one sided). As a result, this might not lead toward an optimal
posterior over the labels. Let us then look at a strategy proposed to deal with
this issue.
4.5 Specialized Metrics 143
Even though we will not be making full use of the ROCR package, which
is a sophisticated and easy-to-use one, the reader is encouraged to explore
the measures and the features that are implemented in ROCR (see Howell,
2007).
5 ROCR is freely downloadable from: http://rocr.bioinf.mpi-sb.mpg.de/ (2007) (also, see Sing et al.,
2005).
4.6 Illustration of the Ranking and Probabilistic Approaches 147
=== P r e d i c t i o n s on t e s t d a t a ===
These data will be the basis for our plots in R. In particular, we will build
a dataset containing, on the one hand, the numerical predictions made by the
classifier (c45, in this case), and, on the other hand, the true labels of the data. We
will do this by extracting the values in the last column of the preceding output
(column 6), making them vectors, extracting the actual labels in the second
column of the preceding output, converting them to values of 1 for positive and
0 for negative, and making them vectors as well. The two vectors thus created
will be assigned to two elements of an R object, one called predictions and the
other one called labels.
This can be done as in Listing 4.2.
> l a b o r D T l a b e l s <− c
(0 ,0 ,1 ,1 ,1 ,1 ,0 ,0 ,1 ,1 ,1 ,1 ,0 ,0 ,1 ,1 ,1 ,1 ,0 ,0 ,1 ,1 ,1 ,1 ,0 ,0 ,1 ,1 ,1 ,
1 ,0 ,0 ,1 ,1 ,1 ,1 ,0 ,0 ,1 ,1 ,1 ,1 ,0 ,0 ,1 ,1 ,1 ,0 ,0 ,1 ,1 ,1 ,0 ,0 ,1 ,1 ,1)
> l a b o r D T S i n g l e <− c ( “ p r e d i c t i o n s ” , “ l a b e l s ” )
> l a b o r D T S i n g l e $ p r e d i c t i o n s <− l a b o r D T p r e d i c t i o n s
> l a b o r D T S i n g l e $ l a b e l s <− l a b o r D T l a b e l s
1.0
0.8
0.6
True-positive rate
0.4
0.2
0.0
results of each fold as well as the label results of each fold. This can be done as
follows.
Listing 4.4: Data preparation for multiple ROC curves.
> l a b o r D T <− c ( “ p r e d i c t i o n s ” , “ l a b e l s ” )
> l a b o r D T$ p r e d i c t i o n s [ [ 1 ] ] <− c ( 1 , 0 . 2 3 8 , . 9 1 8 , . 2 3 8 , . 2 3 8 , 1 )
> l a b o r D T$ p r e d i c t i o n s [ [ 2 ] ] <− c ( 0 . 1 5 , 1 , 1 , . 8 6 , . 1 5 , . 8 6 )
> l a b o r D T$ p r e d i c t i o n s [ [ 3 ] ] <− c ( . 1 7 , . 1 7 , . 8 1 5 , . 8 1 5 , . 8 1 5 ,
.815)
> l a b o r D T$ p r e d i c t i o n s [ [ 4 ] ] <− c ( . 0 2 , . 0 2 , . 9 6 7 , . 0 2 , . 0 7 5 , . 0 2 )
> l a b o r D T$ p r e d i c t i o n s [ [ 5 ] ] <− c ( . 1 7 , . 1 7 , . 9 6 3 , . 9 6 3 , . 9 6 3 ,
.963)
> l a b o r D T$ p r e d i c t i o n s [ [ 6 ] ] <− c ( 0 . 8 7 7 , . 0 8 , 0 . 8 7 7 , . 0 8 , . 7 6 4 ,
0.877)
> l a b o r D T$ p r e d i c t i o n s [ [ 7 ] ] <− c ( . 1 6 , 0 . 8 3 7 , 0 . 8 3 7 , 0 . 8 3 7 ,
0.837 , 0.837)
> l a b o r D T$ p r e d i c t i o n s [ [ 8 ] ] <− c ( . 0 6 7 , . 0 6 7 , . 7 0 5 , . 9 7 3 , . 8 0 3 )
> l a b o r D T$ p r e d i c t i o n s [ [ 9 ] ] <− c ( . 2 0 3 , . 8 6 , . 8 6 , . 8 6 , . 8 6 )
> l a b o r D T$ p r e d i c t i o n s [ [ 1 0 ] ] <− c ( 1 , . 0 8 5 , . 3 4 6 , 1 , . 3 4 6 )
>
> l a b o r D T$ l a b e l s [[1]] <− c (0 ,0 ,1 ,1 ,1 ,1)
> l a b o r D T$ l a b e l s [[2]] <− c (0 ,0 ,1 ,1 ,1 ,1)
> l a b o r D T$ l a b e l s [[3]] <− c (0 ,0 ,1 ,1 ,1 ,1)
> l a b o r D T$ l a b e l s [[4]] <− c (0 ,0 ,1 ,1 ,1 ,1)
> l a b o r D T$ l a b e l s [[5]] <− c (0 ,0 ,1 ,1 ,1 ,1)
> l a b o r D T$ l a b e l s [[6]] <− c (0 ,0 ,1 ,1 ,1 ,1)
> l a b o r D T$ l a b e l s [[7]] <− c (0 ,0 ,1 ,1 ,1 ,1)
150 Performance Measures II
> l a b o r D T $ l a b e l s [ [ 8 ] ] <− c ( 0 , 0 , 1 , 1 , 1 )
> l a b o r D T $ l a b e l s [ [ 9 ] ] <− c ( 0 , 0 , 1 , 1 , 1 )
> l a b o r D T$ l a b e l s [ [ 1 0 ] ] <− c ( 0 , 0 , 1 , 1 , 1 )
>
The data-entry operation just performed will create a data structure named
“laborDT.” (We do not discuss the warning messages output because these do
not affect the results obtained.)
Given the “laborDT” object just constructed, 10 ROC plots and an 11th,
vertical average plot can be built as follows.
This creates the plot of Figure 4.14.6 The 10 ROC curves constructed for each
fold are shown as broken lines. The solid line in bold represents the average of
these 10 lines. The box plots show the estimate of the spread of the 10 curves at
each point (when averaged vertically).
Listing 4.6: Data preparation for multiple ROC curves of two different classifiers.
> laborNB <− c ( “ p r e d i c t i o n s ” , “ l a b e l s ” )
> laborNB$ p r e d i c t i o n s [ [ 1 ] ] <− c ( 0 . 6 4 9 , 0 . 0 3 7 , 1 , 1 , . 9 8 5 , 1 )
> laborNB$ p r e d i c t i o n s [ [ 2 ] ] <− c ( 0 . 9 8 4 , 0 . 0 3 1 , 0 . 9 9 9 , 1 , 0 . 4 8 9 ,
1)
> laborNB$ p r e d i c t i o n s [ [ 3 ] ] <− c ( 0 . 0 7 2 , 0 , 0 . 9 9 7 , 0 . 9 9 7 , 1 ,
0.996)
> laborNB$ p r e d i c t i o n s [ [ 4 ] ] <− c ( 0 , 0 . 0 0 1 , 0 . 9 9 6 , 0 . 2 5 1 , 0 . 9 4 4 ,
0.353 )
> laborNB$ p r e d i c t i o n s [ [ 5 ] ] <− c ( 0 , . 0 7 4 , . 6 5 , . 9 9 9 , 1 , 1 )
6 Please note that, in the labor dataset, the technique illustrated here is far from ideal, especially
because each fold contains only 5 or 6 test instances. On the other hand, this example is practical for
illustration purposes, because it contains so few data points.
4.6 Illustration of the Ranking and Probabilistic Approaches 151
1.0
0.8
True-positive rate
0.6
0.4
0.2
0.0
Listing 4.7: Plotting multiple ROC curves for two different classifiers.
>
> p r e d . DT <− p r e d i c t i o n ( l a b o r D T$ p r e d i c t i o n s , l a b o r D T$ l a b e l s )
> p e r f . DT <− p e r f o r m a n c e ( p r e d . DT , “ t p r ” , “ f p r ” )
> p r e d . NB <− p r e d i c t i o n ( laborNB$ p r e d i c t i o n s , laborNB$ l a b e l s )
152 Performance Measures II
1.0
0.8
True-positive rate
0.6
C.45
NB
0.4
0.2
0.0
Figure 4.15. Single ROC curves for c45 and nb on the labor data.
This creates the plot of Figure 4.15, which can be interpreted like the plot of
Figure 4.14, but with two averaged curves instead of a single one.
Please note that R and ROCR come with a large number of options. The pur-
pose of this section is to introduce the readers to the R tools that are immediately
useful for graphical classifier evaluation. However, we encourage the reader to
explore the various options in greater depth.
4.6 Illustration of the Ranking and Probabilistic Approaches 153
Both sets of commands return the 10 different AUC values obtained at each
fold by c.45 and nb: 0.4375, 0.5, 1, 0.75, 1, 0.5625, 0.75, 1, 0.75 and 0.5833333
for c45 and 1, 0.875, 1, 1, 1, 1, 0.875, 1, 1, 1 for nb. It is clear from these
values, as it is from the previous graph and from the results obtained in previous
chapters, that nb performs much better on this dataset than c.45.
Lift Graph
1.5
1.4
C.45
Lift value
1.3
NB
1.2
1.1
1.0
Figure 4.16. Single lift curves for c45 and nb on the labor data.
This creates the plot of Figure 4.16, which again shows nb’s clear superiority
over c45 on this domain.
Note that even though the lift charts in their conventional form resemble
ROC curves, a more common usage includes lift value (ratio of the probability of
positive prediction given the classifier and that in the absence of the classifier) on
the y-axis and the rate of positive prediction on the x-axis. This shows increased
likelihood of true positives as a result of using the respective classifier against a
random model. This is hence the version illustrated here.
4.6 Illustration of the Ranking and Probabilistic Approaches 155
This creates the plot of Figure 4.17, which, once again, shows clearly nb’s
superior performance over c45’s on the labor domain.
Listing 4.12: Plotting the cost curves for c45 on the labor data.
> p l o t ( 0 , 0 , x l i m =c ( 0 , 1 ) , y l i m =c ( 0 , 1 ) , x l a b = ’ P r o b a b i l i t y c o s t
function ’ ,
y l a b =“ N o r m a l i z e d e x p e c t e d c o s t ” , main = ’ C o s t c u r v e f o r t h e
p e r f o r m a n c e o f C45 on t h e L a b o r D a t a s e t ’ )
> p r e d <−p r e d i c t i o n ( l a b o r D T S i n g l e $ p r e d i c t i o n s , l a b o r D T S i n g l e $
labels )
> lines ( c (0 ,1) ,c (0 ,1) )
> lines ( c (0 ,1) ,c (1 ,0) )
> p e r f 1 <− p e r f o r m a n c e ( p r e d , ’ f p r ’ , ’ f n r ’ )
> f o r ( i i n 1 : l e n g t h ( perf1@x . v a l u e s ) ) {
+ f o r ( j i n 1 : l e n g t h ( perf1@x . v a l u e s [ [ i ] ] ) ) {
l i n e s ( c ( 0 , 1 ) , c ( perf1@y . v a l u e s [ [ i ] ] [ j ] , perf1@x . v a l u e s
[ [ i ] ] [ j ] ) , c o l =“ g r a y ” , l t y = 3 )
+ }
+ }
> p e r f <−p e r f o r m a n c e ( p r e d , ’ e c o s t ’ )
> p l o t ( p e r f , lwd =3 , x l i m =c ( 0 , 1 ) , y l i m =c ( 0 , 1 ) , add =T )
156 Performance Measures II
Precision/Recall graphs
1.00
0.95
0.90
C.45
NB
0.85
Precision
0.80
0.75
0.70
0.65
Recall
Figure 4.17. Single PR curves for c45 and nb on the labor data.
> p r e d <−p r e d i c t i o n ( l a b o r D T S i n g l e $ p r e d i c t i o n s , l a b o r D T S i n g l e $
labels )
with
> p r e d <−p r e d i c t i o n ( l a b o r N B S i n g l e $ p r e d i c t i o n s , l a b o r N B S i n g l e $
labels )
=== D e t a i l e d A c c u r a c y By C l a s s ===
=== C o n f u s i o n M a t r i x ===
a b <−− c l a s s i f i e d a s
14 6 | a = bad
9 28 | b = good
As can be seen in the preceding output, the RMSE is reported under the name
“Root mean squared error,” and it has the value 0.4669 in the example. As well,
the Kononenko and Bratko measures are also reported under the names “K and
B Relative Info Score” (or Ir , in our notation) with a value of 1769.6451% in
the current example and “K and B Information Score” with values of 16.5588
bits and 0.2905 bits/instance. In fact, the first of these values corresponds to the
cumulative information score, or the sum of each instance’s information score
value, I, and the second one corresponds to Ia , the average information score,
or the previous value divided by the number of instances.
158 Performance Measures II
1.0
0.8
Normalized expected cost
0.6
0.4
0.2
0.0
0.6
0.4
0.2
0.0
Figure 4.18. Cost curves for c45 (top) and nb (bottom) on labor data.
4.8 Bibliographic Remarks 159
4.7 Summary
Complementing the last chapter that focused on performance measures based
solely on information derived from a (single) confusion matrix, this chapter
extended this discussion to measures that take into account criteria such as
skew considerations and prior probabilities. To this end, we discussed various
graphical evaluation measures that enable visualizing the classifier performance,
possibly for a given measure, over its entire operating range under the scoring
classifier assumption. In particular, we discussed in considerable detail the ROC
analysis and also the cost curves. In addition to discussing the summary statistics
with regard to these techniques, we also studied their relationship to other related
graphical measures. We then briefly discussed the metrics generally employed
in the case of continuous and probabilistic classifiers, focusing mainly on the
information-theoretic measures. We concluded with a discussion on specialized
metrics employed for learning problems other than classification as well as the
ones customized for specific application domains. In the next chapter, we turn
our focus to the issue of reliably estimating various performance measures,
mainly with regard to considerations about the size of datasets.
Flach, 2003, and Flach, 2003). Also, for a take on dealing with ROC concavities,
see (Flach and Wu, 2005).
Calibration and isotonic regression was discussed by Zadrozny and Elkan
(2002) and Fawcett and Niculescu-Mizil (2007).
PR curves are less used in machine learning, although Davis and Goadrich
(2006) argue that they are more appropriate than ROC curves in the case of
highly imbalanced datasets. Details on relative superiority graphs can be found
in (Adams and Hand, 1999) and on DET curves in (Martin et al., 1997).
The relationship between RMSE and other metrics has been explored exten-
sively. Caruana and Niculescu-Mizil (2004) opined that RMSE is a very good
general-purpose metric because it correlates most closely with the classification
and the ranking metrics, in addition to the probability metrics. Noting, however,
that it can be applied only to continuous classifiers, Ferri et al. (2009) do not hold
RMSE higher than any other metric. For performance measures for regression
algorithms, see (Rosset et al., 2007). For a review and comparison of different
evaluation metrics used in ordinal classification, please refer to (Gaudette and
Japkowicz, 2009).
See (Geng and Hamilton, 2007) for more details on performance measures
utilized in association rule mining. In addition to the measures mentioned in the
text, Geng and Hamilton (2007) also define and discuss many others, includ-
ing prevalence, relative risk, Jaccard, odds ratio, Gini index, and J Measure,
and characterize them with respect to the criteria they defined. For some of
the measures just named, original source references can also be found in the
bibliographic remarks of the previous chapter.
See (He et al., 2002) for more details on performance measures for clustering
algorithms. Demartini and Mizzaro (2006) give a nice overview and classifi-
cation of the different metrics that were proposed in the field of information
retrieval between 1960 and 2006. Finally, see (Deeks and Altman, 2004) for a
discussion of performance measures used in the medical domain.
5
Error Estimation
We saw in Chapters 3 and 4 the concerns that arise from having to choose
appropriate performance measures. Once a performance measure is decided
upon, the next obvious concern is to find a good method for testing the learning
algorithm so as to obtain as unbiased an estimate of the chosen performance
measure as possible. Also of interest is the related concern of whether the
technique we use to obtain such an estimate brings us as close as possible to the
true measure value.
Ideally we would have access to the entire population and test our classifiers
on it. Even if the entire population were not available, if a lot of representative
data from that population could be obtained, error estimation would be quite
simple. It would consist of testing the algorithms on the data they were trained
on. Although such an estimate, commonly known as the resubstitution error, is
usually optimistically biased, as the number of instances in the dataset increases,
it tends toward the true error rate. Realistically, however, we are given a signifi-
cantly limited-sized sample of the population. A reliable alternative thus consists
of testing the algorithm on a large set of unseen data points. This approach is
commonly known as the holdout method. Unfortunately, such an approach still
requires quite a lot of data for testing the algorithm’s performance, which is
relatively rare in most practical situation. As a result, the holdout method is not
always applicable. Instead, a limited amount of available data needs to be used
and reused ingeniously in order to obtain sufficiently large numbers of samples.
This kind of data reuse is called resampling. Together, resubstitution, hold-out
and resampling constitute the area of error estimation. Error estimation is a
complex issue as the results of our experiments may be meaningless if the data
on which they were obtained are not representative of the actual distribution or
if the algorithms are unstable and their performance unpredictable. Resampling
must thus be done carefully.
Broadly, the resampling techniques can be divided into two categories: simple
resampling and multiple resampling. The simple resampling techniques tend to
161
162 Error Estimation
use each data point for testing only once. Techniques such as k-fold cross-
validation and leave-one-out are examples of simple resampling techniques (we
may also include resubstitution in this category. However, we tend to use simple
resampling to refer to the techniques that apply the algorithm multiple times,
making the most use of the data). Multiple resampling techniques, on the other
hand, do not refrain from testing data points more than once. Examples of
such techniques include random subsampling, bootstrapping, randomization,
and repeated k-fold cross-validation. In this chapter, we discuss various error-
estimation techniques that might be suitable in offering better assurances with
regard to the estimation of an algorithm’s performance measure, especially in a
limited data scenario.
Although both simple resampling and multiple resampling address the prob-
lem caused by the dearth of data, care needs to be taken with regard to the
effect of using such approaches on the assumptions made by subsequent steps
in the evaluation, mainly the statistical significance testing discussed in the next
chapter. Recall that the independence of the data used to obtain the sample statis-
tics is a fundamental assumption made by these tests. This chapter is aimed at
highlighting the basic assumptions and context of application of various resam-
pling approaches. In addition to studying the impact of using various resampling
techniques on the resulting estimate in the context of their respective bias and
variance behaviors in this chapter, we also see some specific approaches to
deal with the issue of resampling in the context of statistical hypothesis testing.
We also discuss the model selection considerations that should be made when
applying these resampling techniques. We conclude the chapter with a series
of examples in R that illustrate how to integrate resampling techniques into a
practical evaluation framework.
Figure 5.1 shows an ontology of the various error-estimation methods dis-
cussed in this chapter. The discussion of various techniques is generally pre-
sented with regard to the error rate (risk) of the learning algorithm on the dataset
as the performance measure. The main reason for adopting this performance
measure is that a concrete bias–variance analysis is available in this case for
classification, allowing us to explain the different aspects of the techniques more
clearly. However, similar arguments would hold in the case of other performance
measures as well.
Here is a roadmap for this chapter. Section 5.1 presents the context with
respect to the risk estimates over the classifiers because this is the most common
setting in which error-estimation methods are utilized (although some methods
presented are more general and can be applied in the context of other performance
measures too). Section 5.2 then presents the holdout method and also demon-
strates how theoretically sound guarantees on the true risk can be obtained in
this setting. This section also highlights the sample size requirement to do so and
presents a motivation for using resampling techniques because only limited data
instances are typically available in most practical scenarios. Before moving on
to discussing the resampling techniques, Section 5.3 introduces a bias–variance
5.1 Introduction 163
All Data
Regimen
No
Re-sampling
Re-sampling
Simple Multiple
Hold-Out
Re-sampling Re-sampling
Repeated
Re- Cross- Random Randomi-
Bootstrapping k-fold Cross-
substitution Validation Sub-Sampling zation Validation
Non-Stratified Stratified E0
Leave- 0.632 Permutation
k-fold Cross- k-fold Cross- Bootstrap 5 × 2CV 10 × 10CV
Validation One-Out Bootstrap Test
Validation
5.1 Introduction
Recall from Chapter 2 our definition of the true or expected risk of the classifier
f:
R(f ) = L(y, f (x))dD(x, y), (5.1)
where L(y, f (x)) denotes the loss incurred on the label f (x) assigned to the
example x by f and the true label of x is y; the probability measure D(z) =
D(x, y) here is unknown.
For the zero–one loss, i.e., L(y, f (x)) = 1 when y = f (x) and 0 otherwise,
we can write the expected risk as
R(f ) = Pr(x,y)∼D (f (x) = y) . (5.2)
However, in the absence of knowledge of the true distribution, the learner often
computes the empirical risk RS (f ) of any given classifier f on some training
data S with m examples according to
1
m
RS (f ) = L(yi , f (xi )). (5.3)
m i=1
164 Error Estimation
Because the empirical risk of the classifier is used to approximate its true
risk, the goal of error estimation is cut out in a straightforward manner: to
obtain as unbiased an estimate of empirical risk as possible. This empirical
risk on the training set can be measured in a variety of ways, as we will see
shortly. Naturally the simplest approach to risk estimation would be to measure
the empirical risk of a classifier fS trained on the full training data S on the
same dataset S. This measure is often referred to as the resubstitution risk (or
resubstitution error rate). We denote it as RSresub (fS ). As can be easily seen, this
results in judging the performance of a classifier on the data that were used by the
algorithm to induce that classifier. Consequently, such an estimate is essentially
optimistically biased. The best way to obtain a minimum resubstitution error
estimate is to overfit the classifier to each and every training point. However,
this would essentially lead to poor generalization, as we saw in Chapter 2.
Before we look into other ways of obtaining empirical risk estimates, we
need to consider various factors that should be taken into account in this regard.
The effect that the learning settings can have on the empirical risk estimate can
arise from the following main sources:
r random variation in the testing sets,
r random variation in the training sets,
r random variation within the learning algorithm, and
r random variation with respect to class noise in the dataset.
An effective error-estimation method should take into considerations these dif-
ferent sources of random fluctuations. In the simplest case, if we have enough
training data, we can consider a separate set of examples to test the classifier
performance. This test set is not used in any way while training the learning
algorithm. Hence, assuming that the test set is representative of the domain in
which the classifier is to be applied, we can obtain a reliable risk estimate, given
enough training data. However, the qualification regarding enough training data
availability is serious. We will exemplify this a bit later. But first, let us go ahead
and look at this method of estimating risk on a separate testing set. This method
has some very important advantages because concrete guarantees on the risk
estimate can be obtained in this case.
where fS (·) denotes the classifier output by the learning algorithm when given
a training set S and with the underlying assumption that the test set T is repre-
sentative of the distribution generating the data.
One of the biggest advantages of a holdout error estimate lies in its inde-
pendence from the training set. Because the estimate is obtained on a separate
test set, some concrete generalizations on this estimate can be obtained. More-
over, there is another crucial difference between a holdout error estimate and
a resampled estimate. A holdout estimate pertains to the classifier output by
the learning algorithm, given the training data. Hence any generalizations made
over this estimate will essentially apply to any classifier with the given test-set
performance. We will see later that this is not the case for the resampled error
estimate. Let us now discuss the general behavior of the test-set error estimate.
Moreover, the sample size requirement grows exponentially with falling and
δ, limiting the utility of this approach in most learning scenarios for which the
overall availability of examples is limited. In such cases, researchers often make
use of what are called resampling methods.
As the name suggests, these methods are based on the idea of being able to
reuse the data for training and testing. Resampling has some advantages over the
single-partition holdout method. Resampling allows for more accurate estimates
of the error rates while allowing the learning algorithms to train on most data
examples. Also, this method allows the duplication of the analysis conditions
in future experiments on the same data. We discuss these methods in detail in
Sections 5.4 and 5.6 from both applied and theoretical points of view. However,
before proceeding to these, let us understand the bias–variance analysis that
we introduced in Chapter 2 in the context of error estimation. This will be
instrumental in helping us analyze various error-estimation methods with regard
to their strengths, limitations, and applicability in practical scenarios. Although
the discussion in the next section may seem a bit too theoretical, it will enable
the reader to put the subsequent discussion in perspective and obtain practical
insights about various resampling techniques.
1
nbs m
y= fi (xj ).
nbs i=1 j =1
5.3 What Implicitly Guides Resampling? 169
Then the error and the corresponding bias and variance terms can be estimated
from the evaluation of the final classifier f (chosen after model selection, that
is, generally with optimized parameters) on the test set Stest . The test error of the
classifier is just the average loss of the classifier on each example:
|Stest |
1
RStest (f ) = E(x,y)∼Stest [L(yi , f (xi ))] = L(yi , f (xi )),
|Stest | i=1
where (x, y) ∼ S|Stest | denotes that (x, y) is drawn from Stest .
Note that, in the case of a zero–one loss, the previous equation reduces to
the empirical error of Equation (2.5). However, we again run into the problem
of estimating the decomposition terms in the absence of knowledge of the true
data distribution. A common trick applied in such cases is the assumption of
noiseless data. That is, we assume that the data have no noise and hence the noise
term on the decomposition is zero. With this assumption, we can approximate
the empirical estimates of the average bias as
B A = Ex [BA (x)]
= Ex [L(y † , y)]
|Stest |
∼ 1
= L(yi , y)
|Stest | i=1
The zero-noise assumption just made can be deemed acceptable because the
main interest in studying the bias–variance behavior of the algorithm’s error is
in their variation in response to various factors that affect the learning process.
Hence we are interested in their relative values as opposed to the absolute values.
However, in the case in which the zero-noise assumption is violated, the bias
term previously estimated approximates the sum of the noise term and the bias
term because we approximate the error of the average model.
resampling methods. However, there are a few points worth noting here. The
choice of a particular resampling method also affects the bias–variance charac-
teristic of the algorithm’s error. This then can have important implications for
both the error estimation and subsequent evaluation of classifiers (both absolute
evaluation and with respect to other algorithms) and its future generalization as
a result of the impact of this choice on the process of model selection. As the
approximations of the various terms in the bias–variance decomposition sug-
gest, the size and the method of obtaining the training sets and the test sets can
have important implications. Most prominently, note that the approximations
are averaged over all samples. Moreover, in the case of average prediction, they
are also sampled over different training sets. As soon as the sample size is lim-
ited, these approximations encounter variability in their estimates. Further, if we
have a small number of training sets, the average prediction estimate cannot be
reliably obtained. This then affects the bias behavior of the algorithm. On the
other hand, having too few examples to test would result in high variance.
In practical scenarios, we invariably encounter the situation of a very limited
dataset, let alone insufficient training sets and test sets individually. Resampling
methods further aim at using the available data in a “smart” manner so as to
obtain these estimates relatively reliably. However, various modes of resampling
the data can introduce different limitations on the estimates for the bias and the
variance and hence on the reliability of the error estimates obtained as a result.
For instance, when resampling is done with replacement, as in our previous
example, we run the risk of seeing the same examples again (and missing some
altogether). In such cases, the estimates can be highly variable (especially in the
case of small datasets) because the examples in the training set might not be rep-
resentative of the actual distribution. Another concern would arise when the par-
titioning of the data is done multiple times because in this case the error estimates
are not truly independent if the different test-set partitions overlap. Similarly,
the bias may be underestimated for cases in which the training sets overlap over
multiple partitions. Although we do not focus on quantifying such effects for the
various resampling methods we discuss here, we highlight the limitations and
the effects of sample sizes on the reliability of the error estimate thus obtained.
While discussing the resampling methods, we frequently refer to their behav-
ior in case of small, moderate, or large dataset sizes. However, mapping a
concrete number to the sample size in these categories is nontrivial. The sample
size bound in the case of a holdout test set gives us an idea of how the sample
size is affected by the two parameters, and δ. The δ term there is fixed (to
a desired confidence level). However, the takes into account implicitly the
generalization error of the algorithm that in turn depends on a multitude of fac-
tors, including data dimensionality, classifier complexity, and data distribution.
Throughout our discussion we assume that we do not have the required number
of samples that would justify a holdout-based approach. Hence our reference to
small-, moderate- or medium-, and large-sized datasets are all bounded by this
constraint. These terms respectively denote ranges over the number of instances
5.4 Simple Resampling 171
in the data farther from the sample size bound on the left in descending order.
Interested readers can explore studies, referred to in the Bibliographic Remarks
section at the end of the chapter, that have attempted to obtain more concrete
quantifications over dataset sizes.
With the backdrop of different factors that play significant roles in the result-
ing reliability of the error estimates, let us now move on to the specific resampling
methods, starting with the simple resampling techniques.
1 We will see that the expectation of resampling risk over all possible partitions estimates the risk of
multiple trial resamplings.
172 Error Estimation
where D k (z) = D k (x, y) denotes the distribution of the partition Sk . That is,
1 1
k k |S |
resamp
RS (f ) = L(yj , fSk (xj )) ,
k i=1 |Sk | j =1
5.4.2 Resubstitution
As discussed earlier, resubstitution trains the classifier on the training set S
and subsequently tests it over the same set S of examples. That is, we have
the full mass of PW on the weight vector w = 1. Under our formalization, the
case of k = 1 is the resubstitution case. We already discussed the fact that,
as the classifier usually overfits the data it was trained on, the error estimate
provided by this method is optimistically biased. This is why this approach is
never recommended in practice for error-estimation purpose. We will see how
more reliable estimates can be obtained that are less biased and hence better
approximators of the true risk.
− T r a i n and t e s t t h e l e a r n i n g a l g o r i t h m on Si and Si
respectively
−O b t a i n t h e e m p i r i c a l r i s k RSi (fi ) ( o r any o t h e r
p e r f o r m a n c e m e a s u r e o f i n t e r e s t ) o f c l a s s i f i e r fi
o b t a i n e d by t r a i n i n g t h e a l g o r i t h m on Si on Si
−I n c r e m e n t i by 1
− A v e r a g e t h e RSi (fi ) o v e r a l l i ’ s t o o b t a i n RS (f ) , t h e mean
e m p i r i c a l r i s k o f t h e k−f o l d c r o s s v a l i d a t i o n .
−R e p o r t RS (f )
Note that, in listing 5.1, even though we stick to the notation RS (f ) to denote
the empirical risk of the k-fold CV, we refer to the averaged empirical risk over
the classifiers obtained in each of the k folds. It is important to note that the k
testing sets do not overlap. Each example is therefore used only once for testing
and k − 1 times for training.
Let us formalize this technique. In the k-fold CV, we consider the case of k ≥ 2
and k < m(k = 1 is the resubstitution case). The case of k = m is a special case
called leave-one-out (LOO), and has achieved prominence in error estimation,
especially for small dataset sizes. We will discuss this shortly. Getting back to
k ≥ 2, in the case in which m is even, we can easily characterize the distribution
PW . For a given even m, the number nw k of possible sets of binary weight vectors
defining a valid partition of m examples in k subsets can be obtained with the
following formula:
k−2
im
(m − )
nw
k = m
k .
i=0 k
Stratified k-fold CV
Even if a careful resampling method such as k-fold CV is used, the split into a
training set and a testing set may be uneven. That is, the split may not take into
account the distribution of the examples of various classes while generating the
training and test subsets. This can result in scenarios in which examples of the
174 Error Estimation
kind present in the testing set are either underrepresented in, or entirely absent
from, the training set. This can yield an even more pessimistic performance
estimate. A simple and effective solution to this problem lies in stratifying the
data. Stratification consists of taking note of the representation of each class in
the overall dataset and making sure that this representation is respected in both
the training set and the test set in the resulting partitions of data. For example,
consider a three-class problem with the dataset consisting of classes y1 , y2 and
y3 . Let us assume, for illustration’s sake, that the dataset is composed of 30%
examples of class y1 , 60% examples of class y2 , and 10% examples of class y3 .
A random split of the data into a training set and a testing set may very well
ignore the data of class y3 in either the training or the testing set. This would lead
to an unfair evaluation. Stratification does not allow such a situation to occur
as it ensures that the training and testing sets in every fold or every resampling
event maintains the relative distribution with 30% of class y1 examples, 60% of
class y2 examples, and 10% of class y3 examples.
Informally, in the case of binary data, a straightforward method to achieving
stratification is that of the following listing.
k−2
(mp − (mn − imn
) )
k =
nw k k
mp mn .
i=0 k i=0 k
5.4 Simple Resampling 175
Discussion
k-fold CV is a very practical approach that has a number of advantages. First, it
is very simple to apply and, in fact, is preprogrammed in software systems such
as WEKA and can be invoked just by the touch of a button; it is not as computer
intensive as LOO, discussed next, or the repeated resampling techniques that
will be discussed later; it is not a repeated approach, thus guaranteeing that the
estimates obtained from each fold are obtained on nonoverlapping subsets of
the testing set.
On the other hand, whereas the testing sets used in k-fold CV are independent
of each other, the classifiers built on the k − 1 folds in each iteration are not
necessarily independent because the algorithm in each case is trained on a
highly overlapping set of training examples. This can then also affect the bias
of the error estimates. However, in the case of moderate to large datasets,
this limitation is mitigated, to some extent, as a result of large-sized subsets.
Another point worth noting here is that, unlike the holdout case that reports
the error rate of a single classifier trained on the training set, the k-fold CV is
an averaged estimate over the error rates of k different classifiers (trained and
tested in each fold). In fact, this observation holds for almost all the resampling
approaches.
Discussion
As can be seen easily, the error estimates obtained at each iteration of the
LOO scheme refers to the lone testing example. As a result, this can yield
estimates with high variances, especially in the case of limited data. However,
the advantage of LOO lies in its ability to utilize almost the full dataset for
training, resulting in a relatively unbiased classifier. Naturally, in the case of
severely limited dataset size, the cost of highly varying risk estimates on test
examples trumps the benefit of being able to use almost the whole dataset for
training. This is because in the case of a very small dataset size, using even
the whole set for training might not guarantee a relatively unbiased classifier.
Almost analogically, the risk estimates would also not account for mitigating
the high variance when averaged. However, as the dataset size increases, LOO
176 Error Estimation
can be quite advantageous except for its computational complexity. Hence LOO
can be quite effective for moderate dataset sizes.
Indeed, for large datasets, LOO may be computationally too expensive to be
worth applying, especially because a k-fold CV can also yield reliable estimates.
Independent of the sample size, there are a couple of special cases, however,
for which LOO can be particularly effective. LOO can be quite beneficial when
there is wide dispersion of the data distribution or when the dataset contains
extreme values. In such cases, the estimate produced by LOO is expected to be
better than the one produced by k-fold CV.
2 We are concerned here with the parameter selection to be precise and not with the optimization
criteria used under different learning settings such as the ERM and SRM algorithms discussed in
Chapter 2.
178 Error Estimation
the kernel width for an SVM with a radial basis function kernel) then this should
be done independently of the test fold as well because, otherwise, for each
fold, the algorithm would have a positive bias toward obtaining the classifier
performing best on the test fold. One solution is to use a nested k-fold cross-
validation. The idea behind the nested k-fold CV is to divide the dataset into k
disjoint subsets, just as was done in the k-fold CV method previously described.
But now, in addition, we perform a separate k-fold CV within the k − 1 folds
during training in order to compare the different parameter instantiations of the
algorithm. Once the best model is identified for that training fold, testing is, as
usual, performed on the kth testing fold. The rationale behind this approach is
to make the algorithm totally unbiased in parameter selection.
The simple resampling methods discussed so far may not yield the desired
estimates in some scenarios such as extremely limited sample sizes, and more
robust estimation methods are desired. Multiple resampling methods aim to do
so (of course with some inherent costs). Let us then discuss some prominent
multiple resampling methods.
where D k (z) = D k (x, y) denotes the distribution of the partition Sk and the
expectation with respect to w denotes the expectation over all the fixed parti-
tioning defined by W , with each w defining a particular partition.
5.6 Multiple Resampling 179
bootstrap, the two most common bootstrap techniques. Let us first summarize
the simpler 0 bootstrap (also referred to as e0 bootstrap) estimate.
Listing 5.4: 0 bootstrap.
− Given a d a t a s e t S w i t h m e x a m p l e s
− i n i t i a l i z e 0 = 0 ,
− i n i t i a l i z e i = 1.
− R e p e a t w h i l e i ≤ k ( t y p i c a l l y k ≥ 200 ) .
− Draw , w i t h r e p l a c e m e n t , m s a m p l e s from S t o o b t a i n a
i
t r a i n i n g s e t Sboot .
− D e f i n e Tboot = S\Sboot , i.e. , t h e t e s t s e t c o n t a i n s t h e
i i
i
e x a m p l e s from S n o t i n c l u d e d i n Sboot .
− T r a i n t h e a l g o r i t h m on Sboot t o o b t a i n a c l a s s i f i e r fboot
i i
.
− T e s t fboot i i
on Tboot to obtain the empirical risk estimate
0i .
− 0 = 0 + 0i .
− I n c r e m e n t i by 1 .
− C a l c u l a t e 0 = 0 k
.
− R e p o r t 0 .
Bootstrapping can be quite useful, in practice, in the cases in which the sample
is too small for CV or LOO approaches to yield a good estimate. In such cases,
a bootstrap estimate can be more reliable.
Let us formalize the 0 bootstrap to understand better the behavior and the
associated intuition of not only the 0 estimate but also of the .632 bootstrap
technique that it leads to. Going back to our resampling framework, let w, in the
case of a bootstrap resampling, be such that
wboot ∈ Nm ,
such that
∀i, 0 ≤ wi ≤ m,
m
||w||1 = wi = m.
i=1
consider a uniform distribution on this set of basis vectors; that is, the probability
of sampling each wi is equal. Then we sample wi from this distribution m times
with each sampling, resulting in a weight vector wi for i = 1, 2, . . . , m. We can
then obtain
wboot = wi .
i
− I n i t i a l i z e t h e .632 r i s k e s t i m a t e e632 = 0 .
− Initialize i = 1.
− R e p e a t w h i l e i ≤ k ( t y p i c a l l y k ≥ 200 ) .
− Draw , w i t h r e p l a c e m e n t , m s a m p l e s from S t o o b t a i n a
i
t r a i n i n g s e t Sboot .
− D e f i n e Tboot i
= S\Sbooti
, i.e. , t h e t e s t s e t c o n t a i n s t h e e x a m p l e s
i
from S n o t i n c l u d e d i n Sboot .
− T r a i n t h e a l g o r i t h m on Sboot t o o b t a i n a c l a s s i f i e r fboot
i i
.
− T e s t fboot i i
on Tboot t o o b t a i n t h e e m p i r i c a l r i s k e s t i m a t e 0i .
− e632 = e632 + 0.632 · 0i .
− I n c r e m e n t i by 1 .
− C a l c u l a t e e632 = e632 k
.
− A p p r o x i m a t e t h e r e m a i n i n g p r o p o r t i o n o f t h e r i s k u s i n g e r r (f )
t o g i v e e632 = e632 + 0.368 × e r r (f ).
− R e t u r n e632 .
More formally, let the number of bootstrap samples generated be k. For each
i
sample, we obtain a bootstrap sample Sboot and a corresponding bootstrap test
set Tboot , i ∈ {1, 2, . . . , k}. Moreover, a classifier fboot
i i i
is obtained on each Sboot
i
and tested on Tboot to yield a corresponding estimate 0i . These estimates can
together be used to obtain what is called the .632 bootstrap estimate, defined as
1
k
e632 = 0.632 × 0i + 0.368 × err(f )
k i=1
= 0.632 × 0 + 0.368 × err(f ),
where err(f ) is the resubstitution error rate RSresub (fS ), with f being the classifier
fS obtained by training the algorithm on the whole training set S.
Discussion
In empirical studies, the relationship between the bootstrap and the CV-based
estimates have received special attention. Bootstrapping can be a method of
choice when more conventional resampling such as k-fold CV cannot be applied
5.6 Multiple Resampling 183
owing to small dataset sizes. Moreover, the bootstrap also, in such cases, results
in estimates with low variance as a result of (artificially) increased dataset size.
Further, the 0 bootstrap has been empirically shown to be a good error estimator
in cases of a very high true error rate whereas the .632 bootstrap estimator has
been shown to be a good error estimator on small datasets, especially if the true
error rate is small (i.e., when the algorithm is extremely accurate).
An interesting, perhaps at first surprising, result that emanates from various
empirical studies is that the relative appropriateness of one sampling scheme
over the other is classifier dependent. Indeed, it was found that bootstrapping is
a poor error estimator for classifiers such as the nn or foil (Bailey and Elkan,
1993) that do not benefit from (or simply make use of) duplicate instances. In
light of the fact that bootstrapping resamples with replacement, this result is not
as surprising as it first appeared to be.
5.6.3 Randomization
The term randomization has been used with regard to multiple resampling meth-
ods in two contexts. The first is what is referred to as randomization over sam-
ples, that is, estimating the effect of different reorderings of the data on the
algorithm’s performance estimate. We refer to such randomization on training
samples as permutation sampling or permutation testing. The second context
in which randomization is used refers to the randomization over labels of the
training examples. The purpose of this testing is to assess the dependence of
the learning algorithm on the actual label assignment as opposed to obtaining
the same or similar classifiers on chance label assignments. Like bootstrapping,
randomization makes the assumption that the sample is representative of the
original distribution. However, instead of drawing samples with replacement, as
bootstrapping does, randomization reorders (shuffles) the data systematically or
randomly a number of times. It calculates the quantity of interest on each reorder-
ing. Because shuffling the data amounts to sampling without replacement, it is
one difference between bootstrapping and randomization.
In permutation testing, we basically look at the number of possible reorderings
of the training set S to assess their effect on classifier performance. As we can
easily see, there are a total of m! possible reorderings of the entries of the vector
w and hence those of the examples z in the training set S. We would then consider
a distribution over these m! orderings on unit vectors w and Pw that would have
1
an equal probability (= m! ) on each of the weight vectors.
Permutation testing can provide a sense of robustness of the algorithm to the
ordering of the data samples and hence a sense of the stability of the performance
estimate thus obtained. However, when it comes to comparing such estimates for
two or more robust algorithms, permutation testing might not be very effective
because the stability of estimates over different permutations is not the prominent
184 Error Estimation
Note that this can be applied not only to validate a given classifier’s perfor-
mance against a random set of labelings, but also to characterize the difference
of two classifiers in a comparative setting. In this regard, this resampling is gen-
erally used as a sanity check test to compare the classifiers’ performance over
random assignments of labels to the examples in S. Hence, keeping the ordering
of the examples in the training set constant, we can randomize the labels while
maintaining the label distribution. Alternatively, we can randomize the examples
while keeping the label assignment constant, as shown in Listing 5.6. However,
5.7 Discussion 185
5.7 Discussion
Choosing the best resampling method for a given task should be done carefully
if an objective performance estimate is to be obtained. As we saw earlier,
the bias–variance behavior of the associated loss is largely dependent on such
choices and, in fact, also helps us guide this selection. For instance, it can be
seen that increasing the number of folds in a k-fold CV approach would result
186 Error Estimation
in estimates that are less biased because a larger subset of the data is used
for training. However, doing so would result in decreasing the size of the test
partitions, thereby resulting in an increase in the variance of the estimates.
In addition to their effects on the bias–variance behavior of the resulting
error estimate, selecting a resampling method also relies on some other factors.
One such factor is the nature of the classifiers to be evaluated. For instance,
more-stable classifiers would not need a permutation test. The less robust a
classifier is, the more training data it would need to reach a stable behavior.
In fact, it would also take multiple runs to approximate its average behavior.
There are some dataset-dependent factors affecting the choice of resampling
methods too. These include the size of the dataset as well as its spread (that is,
the representative capability of the data) and the complexity of the domain that
we wish to learn. A highly complex domain, for instance, in the presence of a
limited dataset size, would invariably lead to a biased classifier. Hence the aim
would be to use a resampling scheme that would allow it to use as much data for
learning as possible. A multiple resampling scheme can also be useful in such
scenarios.
Precisely quantifying various parameters involved in the choice of a resam-
pling method is extremely difficult. Let us take an indirect approach in discussing
some important observations by looking at them in a relative sense. One should
read the following discussion while bearing in mind that the terms unbiased or
almost unbiased are strictly relative with respect to the optimal classifier that
can be obtained given the training data.
Other concerns that should be taken into account while opting for a resam-
pling method include the computational complexity involved in employing the
resampling method of choice and the resulting gain in terms of more objective
and representative estimates. For instance, increasing the folds in a k-fold CV
all the way to LOO resampling would mean increasing the number of runs over
the learning algorithm in each fold. Further, if model selection is involved, this
would require nested runs to optimize the learning parameters, thereby further
increasing the computational complexity. Bootstrapping, on the other hand, can
also be quite expensive computationally. See the Bibliographic Remarks at the
end of the chapter for observations from a specific study reported in (Weiss and
Kulikowski, 1991).
As we noted earlier, the relative appropriateness of one sampling scheme
over the other has also been found to be classifier dependent empirically. Kohavi
(1995) further extends this observation to show that, not only are the resampling
techniques sensitive to the classifiers under scrutiny, but they are also sensitive
to the domains on which they are applied. In light of these observations, it would
only be appropriate to end this discussion with the take-home message from the
above observations echoed by Reich and Barai (1999, p. 11):
compared (two) was also hard-coded, as was the choice of the metric returned
(accuracy). Modifying these choices is not very difficult, as we will show when
we do this for our experiments later in the book (see Appendix C).
Also note that we do not discuss the multiple trials of simple resampling
methods mentioned in Subsection 5.6.4 here because these were mainly pro-
posed with regard to comparing two classifiers and hence will be discussed in
association with the appropriate statistical significance tests. Consequently we
have relegated the description of the full versions of these multiple resampling
trials to the next chapter, which is dedicated to statistical significance testing.
Although the resampling methods discussed in this chapter were designed in
a more general framework than the multiple trials of simple resampling methods
just mentioned, to make our illustrations interesting yet simple to follow, in all
but one case, we applied them to the comparison of two classifiers on a single
domain. More specifically, the case study we present here aims at comparing
the performances of naive Bayes (nb) and and the c4.5 decision tree learner
(c45) on the UCI Iris dataset. Even though we have not yet discussed it in depth,
the resampling procedure is followed by a t test with an aim to illustrate the
hypothesis testing principle discussed earlier in this context.
A different case study is used to illustrate randomization, however, to better
explore the technique, given its less-frequent use in machine learning experi-
ments. Let us start with cross-validation.
5.8.1 Cross-Validation
We first present the code for nonstratified cross-validation along with the paired
t test for significance testing, followed by the stratified version of cross-
validation in a similar manner. For this illustration, we do not use the cross-
validation procedure provided by RWeka, however, because of the insufficient
information it outputs. Indeed, this information would not enable us to perform
either a significance test or an analysis over individual folds.
Nonstratified Cross-Validation
Listing 5.7 illustrates the nonstratified cross-validation procedure with the aim
of comparing nb and c45 on the Iris dataset. The parameters and variables used
are as follows: k denotes the number of folds desired, dataSet denotes the dataset
on which the cross-validation study is performed, setSize refers to the number
of samples in this dataset, and dimension denotes the number of attributes,
including the class. Because the code is geared at comparing the performance
of two classifiers, the parameters classifier1 and classifier2 indicating the two
classifiers are included. For simplicity, we chose to assess the performances
in terms of accuracy. The function thus reads the WEKA output and retrieves
the accuracy figures in the form of two vectors, with entries in each vector
5.8 Illustrations Using R 189
t test is typically used, as illustrated in Listing 5.8 for the present case (available
as a function implementation in R). We discuss in detail the issues as well as
appropriateness of using this approach in the next chapter.
Listing 5.8: Sample R code for executing nonstratified k-fold cross-validation
followed by a paired t test.
n o n s t r a t c v T t e s t = f u n c t i o n ( k , dataSet , s e t S i z e , dimension ,
classifier1 , classifier2 ) {
a l l R e s u l t s <− n o n s t r a t c v ( k , d a t a S e t , s e t S i z e , d i m e n s i o n ,
classifier1 , classifier2 )
p r i n t ( “mean a c c u r a c y o f c l a s s i f i e r 1 : ” )
p r i n t ( mean ( a l l R e s u l t s [ [ 1 ] ] ) )
p r i n t ( “mean a c c u r a c y o f c l a s s i f i e r 2 : ” )
p r i n t ( mean ( a l l R e s u l t s [ [ 2 ] ] ) )
t . t e s t ( a l l R e s u l t s [ [ 1 ] ] , a l l R e s u l t s [ [ 2 ] ] , p a i r e d =TRUE)
}
Now that the functions to perform the cross-validation and the associated
R test are defined, let us see how to invoke these and look at their output
(Listing 5.9).
Listing 5.9: Invocation and results of the nscv/t-test code.
> l i b r a r y ( RWeka )
Loading r e q u i r e d package : g r i d
> NB <− m a k e W e k a c l a s s i f i e r ( “weka / c l a s s i f i e r s / b a y e s /
NaiveBayes”)
> i r i s <− r e a d . a r f f ( s y s t e m . f i l e ( “ a r f f ” , “ i r i s . a r f f ” ,
p a c k a g e = “RWeka” ) )
> n o n s t r a t c v T t e s t ( 1 0 , i r i s , 1 5 0 , 5 , NB, J 4 8 )
[1] “mean a c c u r a c y o f c l a s s i f i e r 1 : ”
[1] 95.998
[1] “mean a c c u r a c y o f c l a s s i f i e r 2 : ”
[1] 94.665
P a i r e d t−t e s t
data : a l l R e s u l t s [ [ 1 ] ] and a l l R e s u l t s [ [ 2 ] ]
t = 0 . 9 9 9 6 , d f = 9 , p−v a l u e = 0 . 3 4 3 6
a l t e r n a t i v e h y p o t h e s i s : t r u e d i f f e r e n c e i n means i s n o t e q u a l
to 0
95 p e r c e n t c o n f i d e n c e i n t e r v a l :
−1.683713 4 . 3 4 9 7 1 3
sample e s t i m a t e s :
mean o f t h e d i f f e r e n c e s
1.333
Stratified Cross-Validation
The code for stratified cross-validation is a little bit more complex, as data need
to be proportionately sampled from the different subdistributions of the data. As
a result, we also need to add parameters indicating the number of classes in the
dataset and their respective sizes. This is done with parameters numClasses and
classSize. numClass indicates the number of classes in the dataset. classSize is
a vector of size numClass that lists the size of each class. Listing 5.10 gives the
code for stratified k-fold cross-validation.
# We assume t h a t t h e i n s t a n c e s a r e s o r t e d by c l a s s e s and we
# c r e a t e s u b s e t s o f t h e e n t i r e d a t a s e t c o n t a i n i n g homogeneous
# classes
D a t a s e t L i s t <− l i s t ( )
c u r r e n t P o s i t i o n <− 1
f o r ( i in 1: numClasses ) {
SubSamp <− c u r r e n t P o s i t i o n : ( c u r r e n t P o s i t i o n + c l a s s S i z e [ i ] −1)
o n e D a t a s e t <− d a t a S e t [ SubSamp , 1 : d i m e n s i o n ]
D a t a s e t L i s t <− c ( D a t a s e t L i s t , l i s t ( o n e D a t a s e t ) )
c u r r e n t P o s i t i o n <− c u r r e n t P o s i t i o n + c l a s s S i z e [ i ]
}
# We c r e a t e a d i f f e r e n t s h u f f l i n g and BaseSamps o f t h e
# i n s t a n c e s f o r each c l a s s
s h u f f l e d I n s t a n c e L i s t <− l i s t ( )
b a s e S a m p L i s t <− l i s t ( )
f o r ( c in 1: numClasses ) {
t e s t F o l d S i z e [ c ] <− c l a s s S i z e [ c ] / numFolds
s h u f f l e d I n s t a n c e <− s a m p l e ( c l a s s S i z e [ c ] , c l a s s S i z e [ c ] ,
r e p l a c e =FALSE )
s h u f f l e d I n s t a n c e L i s t <− c ( s h u f f l e d I n s t a n c e L i s t ,
l i s t ( shuffledInstance ) )
b a s e S a m p L i s t <− c ( b a s e S a m p L i s t , l i s t ( 1 : c l a s s S i z e [ c ] ) )
}
# T h i s f u n c t i o n b u i l d s t h e t r a i n i n g and t e s t i n g p a i r s f o r a
# g i v e n f o l d and a g i v e n c l a s s . The s e t s b u i l t f o r d i f f e r e n t
# c l a s s e s w i t h i n e a c h f o l d w i l l t h e n be bound t o g e t h e r t o form
# s i n g l e t r a i n i n g and t e s t i n g s e t s f o r t h i s f o l d .
getTestTrainSubset = function ( i , c ) {
192 Error Estimation
# P e r f o r m s t r a t i f i e d k−F o l d C r o s s V a l i d a t i o n
c l a s s i f i e r 1 R e s u l t A r r a y <− n u m e r i c ( numFolds )
c l a s s i f i e r 2 R e s u l t A r r a y <− n u m e r i c ( numFolds )
f o r ( i i n 1 : numFolds ) {
o n e T r a i n <− r b i n d ( T r a i n L i s t [ [ i ] ] [ [ 1 ] ] , T r a i n L i s t [ [ i ] ] [ [ 2 ] ] ,
TrainList [[ i ] ] [ [ 3 ] ] )
o n e T e s t <− r b i n d ( T e s t L i s t [ [ i ] ] [ [ 1 ] ] , T e s t L i s t [ [ i ] ] [ [ 2 ] ] ,
TestList [[ i ]][[3]])
c l a s s i f i e r 1 M o d e l <− c l a s s i f i e r 1 ( c l a s s ˜ . , d a t a = o n e T r a i n )
c l a s s i f i e r 2 M o d e l <− c l a s s i f i e r 2 ( c l a s s ˜ . , d a t a = o n e T r a i n )
c l a s s i f i e r 1 E v a l u a t i o n <− e v a l u a t e W e k a c l a s s i f i e r (
classifier1Model ,
newdata= oneTest )
c l a s s i f i e r 1 A c c u r a c y <− a s . n u m e r i c ( s u b s t r (
c l a s s i f i e r 1 E v a l u a t i o n $ s t r i n g , 70 ,80) )
c l a s s i f i e r 2 E v a l u a t i o n <− e v a l u a t e W e k a c l a s s i f i e r (
classifier2Model ,
newdata= oneTest )
c l a s s i f i e r 2 A c c u r a c y <− a s . n u m e r i c ( s u b s t r (
c l a s s i f i e r 2 E v a l u a t i o n $ s t r i n g , 70 ,80) )
c l a s s i f i e r 1 R e s u l t A r r a y [ i ] <− c l a s s i f i e r 1 A c c u r a c y
c l a s s i f i e r 2 R e s u l t A r r a y [ i ] <− c l a s s i f i e r 2 A c c u r a c y
}
return ( l i s t ( classifier1ResultArray , classifier2ResultArray ) )
}
5.8 Illustrations Using R 193
Listing 5.11 next provides the function definition for the associated t test.
Note that this is similar to the one provided earlier in Listing 5.8 except that the
resampling method invoked is stratcv() instead of nonstratcv().
Listing 5.12 invokes the two methods to output the results. Please note that
stratified cross-validation is not necessary on the Iris dataset, given that the
classes have the same size. The procedure, however, was also tested on classes
of different sizes.
P a i r e d t−t e s t
data : a l l R e s u l t s [ [ 1 ] ] and a l l R e s u l t s [ [ 2 ] ]
t = 1 . 1 5 2 3 , d f = 9 , p−v a l u e = 0 . 2 7 8 9
a l t e r n a t i v e h y p o t h e s i s : t r u e d i f f e r e n c e i n means i s n o t e q u a l
to 0
95 p e r c e n t c o n f i d e n c e i n t e r v a l :
−1.927188 5 . 9 2 9 1 8 8
sample e s t i m a t e s :
mean o f t h e d i f f e r e n c e s
2.001
>
194 Error Estimation
P a i r e d t−t e s t
data : a l l R e s u l t s [ [ 1 ] ] and a l l R e s u l t s [ [ 2 ] ]
t = 0 , d f = 1 4 9 , p−v a l u e = 1
a l t e r n a t i v e h y p o t h e s i s : t r u e d i f f e r e n c e i n means i s n o t e q u a l
to 0
95 p e r c e n t c o n f i d e n c e i n t e r v a l :
−3.237626 3 . 2 3 7 6 2 6
sample e s t i m a t e s :
mean o f t h e d i f f e r e n c e s
0
>
Since the accuracies here are identical, the statistical significance of performance
difference cannot be established.
by classifier1 and classifier2 at each iteration. This quantity is then used in the
computation of the subsequent t test.
Listing 5.14: Sample R code for executing random subsampling.
randomSubsamp = f u n c t i o n ( i t e r , d a t a S e t , s e t S i z e , d i m e n s i o n ,
classifier1 , classifier2 ){
p r o p o r t i o n s <− n u m e r i c ( i t e r )
for ( i in 1: i t e r ) {
Subsamp <− s a m p l e ( s e t S i z e , ( 2 * s e t S i z e ) / 3 , r e p l a c e =FALSE )
Basesamp <− 1 : s e t S i z e
o n e T r a i n <− d a t a S e t [ Subsamp , 1 : d i m e n s i o n ]
o n e T e s t <− d a t a S e t [ s e t d i f f ( Basesamp , Subsamp ) , 1 : d i m e n s i o n ]
c l a s s i f i e r 1 M o d e l <− c l a s s i f i e r 1 ( c l a s s ˜ . , d a t a = o n e T r a i n )
c l a s s i f i e r 2 M o d e l <− c l a s s i f i e r 2 ( c l a s s ˜ . , d a t a = o n e T r a i n )
c l a s s i f i e r 1 E v a l u a t i o n <− e v a l u a t e W e k a c l a s s i f i e r (
classifier1Model ,
newdata= oneTest )
c l a s s i f i e r 1 A c c u r a c y <− a s . n u m e r i c ( s u b s t r (
c l a s s i f i e r 1 E v a l u a t i o n $ s t r i n g , 70 ,80) )
c l a s s i f i e r 2 E v a l u a t i o n <− e v a l u a t e W e k a c l a s s i f i e r (
classifier2Model ,
newdata= oneTest )
c l a s s i f i e r 2 A c c u r a c y <− a s . n u m e r i c ( s u b s t r (
c l a s s i f i e r 2 E v a l u a t i o n $ s t r i n g , 70 ,80) )
p c l a s s i f i e r 1 <− (100 − c l a s s i f i e r 1 A c c u r a c y ) / ( s e t S i z e / 3 )
p c l a s s i f i e r 2 <− (100 − c l a s s i f i e r 2 A c c u r a c y ) / ( s e t S i z e / 3 )
p r o p o r t i o n s [ i ]= p c l a s s i f i e r 2 −p c l a s s i f i e r 1
}
return ( proportions )
}
p r o p o r t i o n s <− randomSubsamp ( i t e r , d a t a S e t , s e t S i z e , d i m e n s i o n ,
classifier1 , classifier2 )
a v e r a g e P r o p o r t i o n <− mean ( p r o p o r t i o n s )
sum=0
for ( i in 1: i t e r ) {
sum = sum + ( p r o p o r t i o n s [ i ]− a v e r a g e P r o p o r t i o n ) ˆ 2
}
196 Error Estimation
p r i n t ( ’ The t −v a l u e f o r t h e s i m p l e r e s a m p l e d t − t e s t i s ’ )
print ( t )
}
Listing 5.16: Invocation and results of the simple resampling t-test code.
> l i b r a r y ( RWeka )
Loading r e q u i r e d package : g r i d
>
> NB <− m a k e W e k a c l a s s i f i e r ( “weka / c l a s s i f i e r s / b a y e s /
NaiveBayes”)
> i r i s <− r e a d . a r f f ( s y s t e m . f i l e ( “ a r f f ” , “ i r i s . a r f f ” ,
p a c k a g e = “RWeka” ) )
> s i m p l e R e s a m p t t e s t ( 3 0 , i r i s , 1 5 0 , 5 , NB, J 4 8 )
[ 1 ] “The t −v a l u e f o r t h e s i m p l e r e s a m p l e d t − t e s t i s ”
[ 1 ] 1.409362
5.8.4 Bootstrapping
The 0 Bootstrap
We demonstrate the 0 bootstrap by using c45 and nb on the Iris data, with
respect to accuracy. RWeka can be tuned in many different ways, but this was
not the purpose of this illustration, which is why we chose a very simple example.
(For instance, we did not use the labor data because missing values in the dataset
caused problems for the R interface with RWeka; similarly, we used accuracy
rather than AUC, because the Iris domain is a three-class problem. Although
these various issues could have been dealt with, we feel that they fall beyond
the purpose of this subsection.) Listings 5.17 and 5.18 present, respectively, the
method and the code it invokes for the 0 bootstrap.
Listing 5.17: Sample R code for executing the 0 bootstrap on two different
classifiers.
e0Boot = f u n c t i o n ( i t e r , d a t a S e t , s e t S i z e , dimension ,
classifier1 , classifier2 ){
c l a s s i f i e r 1 e 0 B o o t <− n u m e r i c ( i t e r )
5.8 Illustrations Using R 197
c l a s s i f i e r 2 e 0 B o o t <− n u m e r i c ( i t e r )
for ( i in 1: i t e r ) {
Subsamp <− s a m p l e ( s e t S i z e , s e t S i z e , r e p l a c e =TRUE)
Basesamp <− 1 : s e t S i z e
o n e T r a i n <− d a t a S e t [ Subsamp , 1 : d i m e n s i o n ]
o n e T e s t <− d a t a S e t [ s e t d i f f ( Basesamp , Subsamp ) , 1 : d i m e n s i o n ]
c l a s s i f i e r 1 m o d e l <− c l a s s i f i e r 1 ( c l a s s ˜ . , d a t a = o n e T r a i n )
c l a s s i f i e r 2 m o d e l <− c l a s s i f i e r 2 ( c l a s s ˜ . , d a t a = o n e T r a i n )
c l a s s i f i e r 1 e v a l <− e v a l u a t e W e k a c l a s s i f i e r (
c l a s s i f i e r 1 m o d e l , newdata= oneTest )
c l a s s i f i e r 1 a c c <− a s . n u m e r i c (
s u b s t r ( c l a s s i f i e r 1 e v a l $ s t r i n g , 70 ,80) )
c l a s s i f i e r 2 e v a l <− e v a l u a t e W e k a c l a s s i f i e r (
c l a s s i f i e r 2 m o d e l , newdata= oneTest )
c l a s s i f i e r 2 a c c <− a s . n u m e r i c (
s u b s t r ( c l a s s i f i e r 2 e v a l $ s t r i n g , 70 ,80) )
c l a s s i f i e r 1 e 0 B o o t [ i ]= c l a s s i f i e r 1 a c c
c l a s s i f i e r 2 e 0 B o o t [ i ]= c l a s s i f i e r 2 a c c
}
return ( rbind ( classifier1e0Boot , classifier2e0Boot ) )
}
The code just listed is invoked as follows, with 200 iterations. The result shown
is then analyzed.
Listing 5.18: Invocation and results of the 0 bootstrap followed by a t test.
> l i b r a r y ( RWeka )
> NB <− m a k e W e k a c l a s s i f i e r ( “weka / c l a s s i f i e r s / b a y e s /
NaiveBayes”)
> i r i s <− r e a d . a r f f ( s y s t e m . f i l e ( “ a r f f ” , “ i r i s . a r f f ” ,
p a c k a g e = “RWeka” ) )
> s e t S i z e <− 150
> d i m e n s i o n <− 5
> i t e r a t i o n s <− 200
>
>
> e 0 B o o t s t r a p s <− e 0 B o o t ( i t e r a t i o n s , i r i s , s e t S i z e ,
d i m e n s i o n , NB, J 4 8 )
>
> e0NB <− mean ( e 0 B o o t s t r a p s [ 1 , ] )
> e 0 J 4 8 <− mean ( e 0 B o o t s t r a p s [ 2 , ] )
>
> e0NB
[ 1 ] 95.0564
> e0J48
[ 1 ] 93.8675
>
> t . t e s t ( e 0 B o o t s t r a p s [ 1 , ] , e 0 B o o t s t r a p s [ 2 , ] , p a i r e d =TRUE)
P a i r e d t−t e s t
198 Error Estimation
data : e 0 B o o t s t r a p s [ 1 , ] and e 0 B o o t s t r a p s [ 2 , ]
t = 5 . 3 3 2 9 , d f = 1 9 9 , p−v a l u e = 2 . 6 1 3 e −07
a l t e r n a t i v e h y p o t h e s i s : t r u e d i f f e r e n c e i n means i s n o t e q u a l
to 0
95 p e r c e n t c o n f i d e n c e i n t e r v a l :
0.7492774 1.6285226
sample e s t i m a t e s :
mean o f t h e d i f f e r e n c e s
1.1889
>
Unlike previous resampling methods, the t test applied to the results of the 0
bootstrap rejects the hypothesis stipulating that nb and c45 perform equivalently.
nb is thus the preferred classifier on this dataset.
c l a s s i f i e r 1 a p p M o d e l <− c l a s s i f i e r 1 ( c l a s s ˜ . , d a t a = d a t a S e t )
c l a s s i f i e r 1 a p p E v a l u a t i o n <− e v a l u a t e W e k a c l a s s i f i e r (
classifier1appModel )
c l a s s i f i e r 1 a p p A c c u r a c y <− a s . n u m e r i c (
s u b s t r ( c l a s s i f i e r 1 a p p E v a l u a t i o n $ s t r i n g , 70 ,80) )
c l a s s i f i e r 1 F i r s t T e r m = .368 * classifier1appAccuracy
c l a s s i f i e r 2 a p p M o d e l <− c l a s s i f i e r 2 ( c l a s s ˜ . , d a t a = d a t a S e t )
c l a s s i f i e r 2 a p p E v a l u a t i o n <− e v a l u a t e W e k a c l a s s i f i e r (
classifier2appModel )
c l a s s i f i e r 2 a p p A c c u r a c y <− a s . n u m e r i c (
s u b s t r ( c l a s s i f i e r 2 a p p E v a l u a t i o n $ s t r i n g , 70 ,80) )
c l a s s i f i e r 2 F i r s t T e r m = .368 * classifier2appAccuracy
e0Terms = e 0 B o o t ( i t e r , d a t a S e t , s e t S i z e , d i m e n s i o n ,
classifier1 , classifier2 )
c l a s s i f i e r 1 e 6 3 2 B o o t <− c l a s s i f i e r 1 F i r s t T e r m + . 6 3 2 * e0Terms
[1 ,]
c l a s s i f i e r 2 e 6 3 2 B o o t <− c l a s s i f i e r 2 F i r s t T e r m + . 6 3 2 * e0Terms
[2 ,]
Listing 5.20: Invocation and results of the .632 bootstrap followed by a t test.
> l i b r a r y ( RWeka )
> NB <− m a k e W e k a c l a s s i f i e r ( “weka / c l a s s i f i e r s / b a y e s /
NaiveBayes”)
> i r i s <− r e a d . a r f f ( s y s t e m . f i l e ( “ a r f f ” , “ i r i s . a r f f ” ,
p a c k a g e = “RWeka” ) )
>
> s e t S i z e <− 150
> d i m e n s i o n <− 5
> i t e r a t i o n s <− 200
>
> e 6 3 2 B o o t s t r a p s <− e 6 3 2 B o o t ( i t e r a t i o n s , i r i s , s e t S i z e ,
d i m e n s i o n , NB, J 4 8 )
>
> e632NB <− mean ( e 6 3 2 B o o t s t r a p s [ 1 , ] )
> e 6 3 2 J 4 8 <− mean ( e 6 3 2 B o o t s t r a p s [ 2 , ] )
>
> e632NB
[ 1 ] 95.5517
> e632J48
[ 1 ] 95.39294
>
> t . t e s t ( e 6 3 2 B o o t s t r a p s [ 1 , ] , e 6 3 2 B o o t s t r a p s [ 2 , ] , p a i r e d =TRUE)
P a i r e d t−t e s t
data : e 6 3 2 B o o t s t r a p s [ 1 , ] and e 6 3 2 B o o t s t r a p s [ 2 , ]
t = 1 . 1 5 3 3 , d f = 1 9 9 , p−v a l u e = 0 . 2 5 0 2
a l t e r n a t i v e h y p o t h e s i s : t r u e d i f f e r e n c e i n means i s n o t e q u a l
to 0
95 p e r c e n t c o n f i d e n c e i n t e r v a l :
−0.1126972 0 . 4 3 0 2 0 5 2
sample e s t i m a t e s :
mean o f t h e d i f f e r e n c e s
0.158754
>
The t-test outcome does not reject the null hypothesis stipulating that the two
classifiers obtain equivalent results.
2.4). Using the t test, we found that nb was significantly more accurate than
c4.5. We now repeat this experiment using the permutation test.
We run the permutation tests over accuracy of two classifiers. In particular,
we show the code (Listing 5.21) for a comparison between c4.5 (classifier1) and
nb (classifier2) on the results of 10 runs of 10-fold cross-validation. The null
hypothesis is that the mean accuracies of the observations of the two classifiers
are equal.
c l a s s i f i e r 1 m e a n = mean ( c l a s s i f i e r 1 R e s u l t s )
c l a s s i f i e r 2 m e a n = mean ( c l a s s i f i e r 2 R e s u l t s )
mobt = a b s ( c l a s s i f i e r 1 m e a n − c l a s s i f i e r 2 m e a n )
alldata = c( classifier1Results , classifier2Results )
c o u n t =0
for ( i in 1: i t e r ) {
# S h u f f l e t h e r e s u l t s from t h e two c l a s s i f i e r s
oneperm = s a m p l e ( a l l d a t a )
# A s s i g n t h e f i r s t h a l f o f t h i s d a t a t o t h e f i r s t ‘ bogus ’
# c l a s s i f i e r and t h e s e c o n d h a l f , t o t h e s e c o n d .
p r e t e n d c l a s s i f i e r 1 = oneperm [ 1 : ( l e n g t h ( a l l d a t a ) / 2 ) ]
p r e t e n d c l a s s i f i e r 2 = oneperm [ ( ( l e n g t h ( a l l d a t a ) / 2 ) + 1 ) :
( length ( alldata ) ) ]
# Compute t h e a b s o l u t e v a l u e o f t h e d i f f e r e n c e b e t w e e n t h e
# mean a c c u r a c y o f t h e two ‘ bogus ’ c l a s s i f i e r s
m s t a r = a b s ( mean ( p r e t e n d c l a s s i f i e r 1 ) −
mean ( p r e t e n d c l a s s i f i e r 2 ) )
# Increase the counter i f t h a t d i f fe r e n c e i s g r e a t e r than
# the r e a l observed d i f f e r e n c e .
i f ( m s t a r > mobt )
{ c o u n t = c o u n t +1}
}
p r o b a b i l i t y o f m o b t = count / i t e r a t i o n s
r e t u r n ( r b i n d ( mobt , p r o b a b i l i t y o f m o b t ) )
}
The code just listed is invoked as follows, with 5000 iterations. The result shown
is then analyzed.
The observed difference between the two means was found to be 0.15262.
Given that the permutation test tells us that the probability of obtaining a mean
of 0.15262 under the null hypothesis (i.e., if the two means are indeed equal) is
0; this allows us to strongly reject the hypothesis that the two classifiers perform
similarly on the labor data.
We repeated the same experiment in a different context: that of comparing
nb and svm on several datasets (see Subsection 6.8.2 in the next chapter). The
data for this comparison are the following with nb standing for classifier1 and
svm standing for classifier2. The running and results of this test are shown in
Listing 5.23.
Listing 5.23: Data preparation for the permutation test.
>
> c l a s s i f i e r 1 = c (96.43 , 73.42 , 72.3 , 71.7 , 71.67 , 74.36 ,
70.63 , 83.21 , 98.22 , 69.62)
> c l a s s i f i e r 2 = c (99.44 , 81.34 , 91.51 , 66.16 , 71.67 , 77.08 ,
62.21 , 80.63 , 93.58 , 99.9)
> i t e r a t i o n s = 5000;
>
> p e r m t e s t r e s u l t s <− p e r m t e s t ( i t e r a t i o n s , classifier1 ,
classifier2 ) ;
>
> p r i n t ( ’ The d i f f e r e n c e i n means o b t a i n e d i s ’ )
[ 1 ] “The d i f f e r e n c e i n means o b t a i n e d i s ”
> print ( permtestresults [1])
[1] 4.196
> p r i n t ( ’ The p r o b a b i l i t y o f o b t a i n i n g t h a t mean i s ’ )
[ 1 ] “The p r o b a b i l i t y o f o b t a i n i n g t h a t mean i s ”
> print ( permtestresults [2])
[ 1 ] 0.4552
>
202 Error Estimation
In this case, the obtained observed difference in means is 4.196, but this time the
probability of obtaining such a mean under the null hypothesis that stipulates
that the two means are equal is found to be 0.4552; i.e., 45.52%. This suggests
that we cannot reject the null hypothesis because there is almost a one in two
chance to obtain the observed difference under it (note that calculations here are
done over percentages).
5.9 Summary
In this chapter, we focused on the issue of error-estimation and studied various
error-estimation techniques along with the relationship of their bias–variance
behavior to factors such as dataset size and classifier complexity. One of the main
limitations of robust holdout-based estimation has long been the unavailability
of a large enough amount of data, leading to inaccurate estimates. Resampling
methods alleviate this problem by giving alternatives to utilize the limited avail-
able data to obtain relatively robust estimates when the holdout approach cannot
be employed. We discussed both simple and multiple resampling schemes.
Although the former, including techniques such as k-fold cross-validation and
leave-one-out among others, tend to avoid reusing the examples for testing (i.e.,
tests each examples only once), the latter, including methods such as bootstrap
and randomization, do not limit themselves in this manner. Of course, there
are both advantages and limitations to these two strategies. They were discussed
both directly and in a relative sense in various places. Our discussion highlighted
a common line of understanding with regard to these techniques in that they are
dependent on not only the classifier under consideration, but also on the domain
of application, of course, in addition to the basic empirical data charateristics
such as dimensionality and size. Once a robust estimate of error (or of any
other performance measure) is obtained, statistical significance tests enable us
to verify if the observed difference is indeed statistically significant (e.g., as in
the t-test examples previously shown). This battery of tests is our focus in the
next chapter.
from Monte Carlo simulations, but differ from them in that they are based on
some real data. Monte Carlo simulations, on the other hand, could be based on
completely hypothetical data. Although the use of resampling methods in the
machine learning context is relatively recent (owing especially to increased com-
putational capability), some of them have been known for a while. For example,
permutation tests were developed by Fisher and described in his book published
and republished between 1935 and 1960 (Fisher, 1960). Cross-validation was
first proposed by Kurtz (1948) and the jackknife (or leave-one-out) was first
invented by Quenouille (1949).
The discussion of the goals of error estimation with regard to taking into
account different variations discussed in Section 5.1 is based on (Dietterich,
1998). With regard to multiple resampling, the study of Dietterich (1998) also
maps the problem of accounting for the preceding variations to statistical signif-
icance testing, especially the type I error and the power of the tests employed.
We focus on these in the next chapter.
Weiss and Kulikowski (1991) and Witten and Frank (2005b) discuss the com-
putational complexity of the k-fold CV and other resampling methods and also
suggest stratification in the event of class imbalance. As discussed by Weiss
and Kulikowski (1991), the leave-one-out estimate is not recommended for very
small samples (fewer than 50 cases). On the other hand, it is recommended for
sample sizes between 50 and 100 cases, as it may yield more reliable estimates
than 10-fold cross-validation in such cases. Above that, the leave-one-out esti-
mate may be computationally too expensive to be worth applying as it does not
provide particular advantages over cross-validation. Irrespective of the sample
size, there are a couple of special cases, however, for which the leave-one-out
estimate is particularly useful, and that is when there is wide dispersion of the
data distribution or when the dataset contains extreme scores (Yu, 2003). In such
cases, the estimate produced by leave-one-out is expected to be better than the
one produced by k-fold cross-validation. However, the k-fold cross-validation
estimate, which uses the same principle as leave-one-out cross-validation, is
easier to apply and has a lesser computational cost.
The discussion on the relationship between bootstraping and cross-validation
is largely based on the empirical studies of Weiss and Kulikowski (1991), Efron
(1983), Bailey and Elkan (1993), Kohavi (1995), Reich and Barai (1999), and
Jain et al. (1987). Margineantu and Dietterich (2000) also discuss bootstrapping
in cost-sensitive settings. These studies were all done in the context of the
accuracy measure. Currently an open question, it would be interesting to see the
results in the context of, and their dependence on, other performance measures.
There are many variants of resampling methods. See (Weiss and Kapouleas,
1989, Mitchell, 1997), and (Kibler and Langley, 1988) for resampling and other
evaluation methods.
Replicability of the results was emphasized as a necessary characteristic
for an error estimator by Bouckaert (2003, 2004), who noticed that random
204 Error Estimation
subsampling and k-fold cross-validation both suffer from low replicability. This
was also noticed earlier by Dietterich (1998), who suggested replacing 10-
fold cross-validation with 5 × 2-fold cross-validation to improve the stability
of the test. Bouckaert (2003) suggested that k-fold cross-validation repeated
multiple times could be even more effective than 5 × 2-fold cross-validation.
In particular, he investigated the use of 10 × 10-fold cross-validation. An open
question would be to investigate if this approach can be generalized to a k × p-
fold cross-validation approach. With regard to statistical testing too, it would be
interesting to see if the problems posed by a single k-fold cross-validation-based
testing can be alleviated by averaging over numerous trials.
The randomized method is seldom used in the machine learning community
(see, e.g., Jensen and Cohen, 2000). The description that was made in the text
was based on that of Yu (2003) and Howell (2007).
Finally, the RWeka package used to illustrate the implementation of resam-
pling techniques is available at http://cran.r-project.org/web/packages/RWeka/
index.html.
where m = |T | is the number of examples in the testing set. Note that the test set
T = z1 , ..., zm of m samples is formed from the instantiation of the variables
def
Zm = Z1 , . . . , Zm . Every Zi is distributed according to some distribution D
that generates the sample S. Each zi consists of an example xi and its label yi .
Each example xi can hence be considered an instantiation of a variable Xi and
its label yi as an instantiation of variable Yi .
Hence, over all the test sets generated from the instantiations of variables
Zm , the risk of some classifier f can be represented as
1
m
m def
R(Z , f ) = L(Yi , f (Xi )),
m i=1
where L() is again the loss function over the misclassification. Now consider
the loss function L = Lz such that Lz is a Bernoulli variable; then the true risk
can be expressed as
def
R(f ) = {Pr(Lz (Yi , f (Xi )) = 1} = p.
5.10 Bibliographic Remarks 205
To bound the true risk R(f ), we make use of the Hoeffding’s inequality,
stated in the following theorem.
Theorem 5.1. (Hoeffding: Bernoulli case). For any sequence Y1 , Y2 , . . . , Ym of
variables obeying a Bernoulli distribution with Pr(Yi = 1) = p ∀i, we have
m
1
Pr Yi − p > ≤ 2 exp(−2m 2 ).
m i=1
Example 6.1. Consider the results of running c4.5 (c45) and naive Bayes (nb)
algorithms on the breast cancer dataset and using the root-mean-square error
(RMSE) as our performance evaluation metric. Table 3.4 of Chapter 3 presented
these results. We saw that c45 obtained a RMSE of 0.4324 and nb obtained a
RMSE of 0.4534. Without statistical analysis, we would simply conclude that
c45 performs better than nb on the breast cancer domain. This, however, is not
necessarily the case because this result could have been obtained by chance.
What do we mean, though, when we say that the results may have been
obtained by chance? Well, what we are really interested in is whether c45 is
consistently better than nb on this domain. If it is not and if this lack of difference
between the two algorithms can be detected by a null-hypothesis statistical test
that makes small enough type I errors in this kind of situation, then we will know
about the failure of c45 to surpass nb because the statistical test will inform us
that we cannot reject the null hypothesis.
Informally stated, statistical tests work by observing the consistency of the
difference in classifier performance, implicitly or explicitly. Such consistency
estimates can either be obtained over multiple test cases (generally, test sets) or
by performing multiple trials over a given test set. Such consistency would then
indicate that the performance difference between two or more classifiers was
not merely a chance result. Consider the following hypothetical example.
Example 6.2. Assume that classifiers fA and fB were tested on a test set of size
5, using RMSE, and that classifiers fA and fB obtained the following squared
error results, on each testing instance, respectively. Classifier fA : 0.012, 0.015,
0.02, 0.26, 0.009, and classifier fB : 0.061, 0.054, 0.055, 0.062, 0.050. The
RMSE for classifier fA is thus 0.0632, and the RMSE for classifier fB is thus
0.0564. A simple look at the RMSEs of the two classifiers would thus suggest
that classifier fB , the one with the lowest RMSE, exhibits better performance
6.1 The Purpose of Statistical Significance Testing 209
than classifier fA . Would you agree with this conclusion? Probably not. Classifier
fA usually performs significantly better than classifier fB , because it typically
obtains squared errors in the [0.009, 0.02] interval whereas classifier fB obtains
squared errors in the [0.050, 0.062] interval. There was only one instance for
which classifier fA performed miserably and obtained a squared error of 0.26.
Given that the classifiers of the preceding example were tested on so few
points, can we really conclude what the RMSE results suggest? That is, are the
preceding test points enough to make any conjectures about the consistency of
the classifier performances? It would perhaps be warranted if classifier fA were
shown on a large testing set to obtain such bad results relatively often while
classifier fB were shown to remain more stable. However, it is also possible that
on a large sample, classifier fB would also have obtained bad results once in
a while. Perhaps it is only by chance that our size 5 sample did not contain a
point on which classifier fB did not fail miserably. Perhaps the point on which
classifier fA failed is the only point where classifier fA would ever fail, and it
happened, quite by chance, to show up in our sample. Either way, we can see
the problems caused by the sole display of the average RMSE results. Whatever
they show is not the whole story.
The purpose of statistical significance testing is thus to help us gather evi-
dence of the extent to which the results returned by an evaluation metric are
representative of the general behavior of our classifiers. The best way would be
to look at the isolated results themselves, as we just did. But this is not realistic,
given the size of the testing sets that need to be used to obtain sufficient infor-
mation about the classifiers.1 Statistical significance tests thus summarize this
information. In doing so, however, they often make a number of assumptions
that need to be considered before the test is applied to make sure that the results
represent the situation accurately.
The remainder of this chapter discusses a number of statistical tests used in
the case of learning problems and states their assumptions clearly. The reason
for choosing to describe this particular subset of tests is that their assumptions
are most in tune with the situations typically encountered in the field. It should
be noted that we could not consider all possible situations a researcher is likely
to encounter. The reader should thus take into consideration the fact that, in
some cases, he or she may have to look beyond this book and into the statistics
literature to find a more appropriate test. What we hope this book does in this
vein, however, is to attempt to provide the reader with an appropriate launchpad
by providing the necessary tools and understanding to make informed choices.
Note as well that, although statistical tests play an important part in assessing
the validity of the results obtained, it is a mistake to believe that favorable
1 Please note that looking at individual scores is not always useful either. In our example 6.2, for
instance, we just do not have a sufficient number of observations to draw any conclusions. This, in
fact, corresponds to cases in which statistical tests could be neither conclusive nor applied. (See the
following section on the limitations of statistical significance testing.)
210 Statistical Significance Testing
statistical results are all that is needed to answer the questions asked earlier.
This is not the case. The choice and availability of the datasets, the way in which
these datasets are resampled, and the testing regimen, as well as the number of
experiments run, are all considerations that should also be kept in mind.
Prior to embarking onto the main matter of this chapter, though, we discuss
yet another, more fundamental, limitation of statistical testing.
2 Of course, not everyone makes this error, as many researchers are aware of the true meaning of the
test.
3 While we use α for significance level elsewhere in the book, we retain p for this discussion in
accordance with its common usage in such references. This should not be confused with other use(s)
of p in the book.
6.2 The Limitations of Statistical Significance Testing 211
noted that this error is similar to the absurd mistake of assuming that, because
only a small proportion of U.S. citizens are members of Congress, then if some
man is a congressman, one could conclude that he is probably not a U.S. citizen.
This is, hence, confusing the likelihood over H with posterior on H in the
Bayesian sense. Now, of course, Bayes rule can enable us to derive P (S|H )
from P (H |S), but one of the main obstacles in doing so is the lack of any
knowledge of the priors over H and S (Demšar, 2008). In fact, as noted by
Drummond (2006), both Fisher and Neyman–Pearson categorically rejected the
use of Bayesian reasoning in NHST.
Another related misinterpretation of NHST is the assumption that (1 − p) is
the probability of successful replication of our experimental results. Yet the p
value is not related to the issue of replication (Goodman, 2007).
Uncertainty in a clear interpretation over the NHST outcome has been cited
as a reason for a discontinuation of its use (Hubbard and Lindsay, 2008). Con-
fidence intervals have been suggested as alternatives to using NHST, along
with effect sizes (Gardner and Altman, 1986). We introduced the concepts of
confidence intervals and effect size in Chapter 2, along with some examples.
However, there are counterarguments that could (and should) be made to these
suggestions, as we will see a bit later.
results truly mean. They should thus refrain from overvaluing such results and,
further, make sure to apply and interpret them (only) in apt contexts. Indeed,
however small a contribution, the role of NHST cannot be dismissed. At the very
least, the impossibility of rejecting a null hypothesis while using a reasonable
effect size is telling: It helps us guard against the claim that our algorithm is
better than others when the evidence to support this claim is too weak. This is
important, and the fact that nothing very specific can be concluded in the case in
which the null hypothesis gets rejected does not take away the value of knowing
that it was not accepted.
In addition, pragmatically speaking, the use of NHST is quite unlikely to
disappear in the foreseeable future, despite the criticisms. This is not to say,
however, that such tests are inevitable and that better alternatives do not exist
or will never appear. Indeed, scientific progress is pinned on such discoveries
(if some such alternatives already exist out there) and inventions (if we can
come up with better, more-disciplined alternatives). But until this is done, we
are indeed better off with at least utilizing the benefits that the current practices
have to offer while trying to minimize the related costs. And this can be done
only by a thorough understanding of these practices and the contexts in which
they operate.
Keeping all the preceding critical observations and associated suggestions
in mind, we thus choose to present the various approaches to NHST so as to
at least help users apply them properly when deemed necessary. Finally, there
is also a social aspect to our motivation in assuming this position. Science
practice, although idealized as a search for the truth, is not independent of social
implications. As such, advising new students or practitioners to go against the
norm in their field might result in their ideas not getting considered. Such
changes progress gradually. In the meantime, we are better off training the new
cohorts as carefully as possible within the confines of the society they work in,
while warning them of the limitations of its customs and hence encouraging a
positive change.
4 Note that Kononenko and Kukar (2007) used a similar breakdown of classifier evaluation approaches,
although in a more limited context than the elaborate one that we present here.
214 Statistical Significance Testing
Parametric and
Parametric Test Nonparametric Nonparametric
we will assume that the difference between these means is zero and see whether
this null hypothesis H0 can be rejected. To see whether the hypothesis can be
rejected, we need to find out what kind of differences between two samples
can be expected because of chance alone. We can do this by considering the
mean of every possible sample of the same size as our sample coming from the
first population and comparing it with the mean of every other possible sample
of the same size coming from the second population, i.e., we are looking at
the distribution of differences between sample means. Such a distribution is
a sampling distribution (see Chapter 2) and will tend to normality (a normal
distribution) as the sample size increases.
Given this normality assumption, we would expect the mean difference to
deviate from zero according to the normal distribution centered at zero. We now
check if the obtained difference (along with its variance) indeed displays this
behavior. This is done with the following t statistic:5
d¯ − 0
t= σ̄d , (6.1)
√
n
or, equivalently,
pm(f1 ) − pm(f2 )
t= σ̄d , (6.2)
√
n
5 We know that the distribution of differences between sample means is zero, but we do not know
what its standard deviation σ is. Therefore we need to estimate it. We do so by using our sample
variance. This leads to an overestimate of the value that would have been obtained if σ had been
known. This is why the distribution of z, perhaps familiar to some of the readers, cannot be used to
accept or reject the null hypothesis. Instead, we use the Student’s t distribution, which corrects for
this problem, and we compare t with the t table with degree of freedom n − 1.
6.5 Comparing Two Classifiers on a Single Domain 219
mean differences are observable in the situation in which the two means come
from samples emanating from the same distribution. The t table is found in
Appendix A.2.
We output this probability if we are solely interested in a one-tailed test, and
we multiply it by two before outputting it if we are interested in a two-tailed test
(see Chapter 2). If this output probability, the p value, is small (by convention,
we use the threshold of 0.05 or 0.01), we would reject H0 at the 0.05 or 0.01
level of significance. Otherwise, we would state that we have no evidence to
conclude that H0 does not hold.
The following example illustrates the use of the t test as is done in many
cases currently. However, along with demonstrating the practical calculations,
the following examples also demonstrate the caveats that one should keep in
mind before such an application.
Our overall average performance measures are the average of the cross-
validated error rates of the respective classifiers in each trial and were computed
in Chapter 2 as the mean of the continuous random variables representing the
performance of each classifier. We recall that the values obtained for c4.5 and
nb were
pm(c45) = 0.2175,
pm(nb) = 0.0649.
and thus
n
− d̄)2
i=1 (di 0.0321
σ¯d = = = 0.05969,
n−1 10 − 1
so that
d̄ − 0 0.1526 − 0
t= σ¯d = 0.05969
= 8.0845.
√ √
n 10
where σ12 and σ22 are the sample variances of the performance measures of the
two classifiers and n1 and n2 are their respective number of trials.
6.5 Comparing Two Classifiers on a Single Domain 221
Effect Size
The t test determines whether the observed difference in the performance mea-
sures of the classifiers is statistically significant. However, it cannot confirm
whether this difference, although statistically significantly different, is also of
any practical importance. That is, it does measure the effect but not the size of
this effect. This can be done using one of the available effect-size measuring
statistics. Many methods of assessing effect sizes can be found in the statistics
literature such as Pearson’s correlation coefficient, Hedges’ G, and coefficient
of determination. The effect size in the case of the t test is generally determined
with Cohen’s d statistic. Please refer back to Chapter 2 for more details on
effect size and associated statistics. The Cohen’s d statistic in the case of two
matched samples was briefly mentioned in Chapter 2. Here we describe it in
greater detail. More specifically, given classifiers f1 and f2 , Cohen’s d statistic
can be calculated as follows (we denote the Cohen’s d statistic with the notation
dcohen to avoid ambiguity):
pm(f1 ) − pm(f2 )
dcohen = ,
σp
where σp , the pooled standard deviation estimate, is defined as
σ12 + σ22
σp = ,
2
and σ12 and σ22 represent the variances of distributions of the respective measures
pm(f1 ) and pm(f2 ). A typical interpretation proposed by Cohen for this effect-
size statistic was the following discretized scale:
r Ad
cohen value of around 0.2 or 0.3 denotes a small effect, but is probably
meaningful.
r Ad
cohen value of about 0.5 signifies a medium effect that is noticeable.
r Ad
cohen value of 0.8 signifies a large effect size.
Note, however, that dcohen need not lie in the [0, 1] interval and can indeed be
even greater than 1, as we subsequently see.
We apply the formula for dcohen with the values obtained in the previ-
ous example. In particular, we compute σ12 = σc24.5 and σ22 = σnb 2
, using R as
6
follows.
6 Note that we cannot use the variances obtained in Chapter 2 because these were based on a sample
size of 100; here we made the problem more manageable by averaging the results obtained on the
10 folds of each run and reducing the problem to a sample size of 10. Please note that, even when
using the 100 values instead of the 10 values when calculating dcohen , we obtain a value greater than
0.8 (1.1221).
222 Statistical Significance Testing
We thus have
0.2175 − 0.0649
dcohen = = 3.4381.
0.00261+0.00133
2
From Cohen’s guidelines, we conclude that, because dcohen > 0.8, the effect size
is large. That is, the difference in the means of the two populations and hence
the performances of the two classifiers do differ (as confirmed by the t test), and
the difference is practically important as further confirmed by an estimate of the
size of this difference using the Cohen’s d statistic (dcohen ).
Randomness of the Samples. The t test assumes that the samples from which
the means are estimated are representative of the underlying population. That
7 The sample size requirement of 30 is widely found in the literature as a minimum required sample
size so as to approximate the sample by use of normal distribution. This number comes from a
wide number of simulation studies that show how various distributions can converge to a normal
distribution with increasing samples and is used as an empirical guide.
8 As discussed later, however, one caveat of these tests is that they require a large sample size, which,
if available, would make these tests unnecessary.
6.5 Comparing Two Classifiers on a Single Domain 223
Equal Variances of the Populations. The paired t test assumes that the two
samples come from populations with equal variance. This is necessary because
we use the sample information to estimate the entire population’s standard
deviation.
The first assumption is easy to verify because it requires us to verify only that
our algorithms are applied to testing sets that consists of enough samples for
the assumption to hold. We discussed the sample size requirement in Chapter 2.
Recall that, as a rule of thumb, each set should contain at least 30 samples. Further
implication of this requirement is felt in our resampling strategy. Because we
usually run 10-fold cross-validation experiments, this sample size requirement
of 30 or more suggests that the datasets should be at least of size 10 × 30 = 300
for individual trials.
The second assumption is usually difficult for machine learning researchers to
verify for the simple reason that they are typically not the people who gathered
the data used for learning and testing. They must thus trust that the people
responsible for this task built truly random samples and try to gather enough
information about the dataset construction process to pass a judgment. There
are cases, however, when true randomness is difficult to achieve, and this trust
can be problematic. Note that, by randomness, we mean choosing the samples
in an i.i.d. manner. This does not mean making an assumption about the data
distribution. The data can come from any arbitrary distribution. However, the
assumption is based on the notion of i.i.d. sampling from this distribution, which
is indeed hard to verify.
We could verify the third assumption either by observing the calculated
variances or by plotting the two populations (the measures on which the t test
is to be applied) and visually deciding whether they indeed have (almost) equal
variance. Alternatively, the similarity of variances can be tested with the F test,
Bartlett’s test, Levene’s test, or the Brown–Forsythe test. Again, these can be
found in many statistics texts.
The procedure that can be used in R is illustrated in the following example,
where we question whether our use of the t test on the labor dataset was warranted
or not.
0.30
0.25
0.20
0.15
0.10
0.05
C4.5 NB
Classifier f2
0 1
Mc Mc
0 c00 c01
Classifier f1
Mc Mc
1 c10 c11
|S
test |
Mc
c01 = [I (f1 (xi ) = yi ) ∧ I (f2 (xi ) = yi )],
i=1
|S
test |
Mc
c10 = [I (f1 (xi ) = yi ) ∧ I (f2 (xi ) = yi )],
i=1
|S
test |
Mc
c11 = [I (f1 (xi ) = yi ) ∧ I (f2 (xi ) = yi )],
i=1
Mc
with the rest of the notations as before. That is, c00 denotes the number of
Mc
examples in Stest misclassified by both f1 and f2 ; c01 denotes the number of
Mc
examples in Stest that are misclassified by f1 but correctly classified by f2 ; c10
denotes the number of examples in Stest that are misclassified by f2 but correctly
Mc
classified by f1 ; and c11 denotes the number of examples in Stest that are
classified correctly by both f1 and f2 .
The null hypothesis assumes that both f1 and f2 have the same performance
and hence the same error rates. That is, c01Mc
= c10
Mc
= cnull
Mc
. The next step is to
compute the following statistic that is approximately distributed as χ 2 :
Mc
(|c01 − c10
Mc
| − 1)2
2
χMc = .
Mc
c01 + c10
Mc
2
The χMc is then looked up against the table of χ 2 distribution values (See
Appendices A.3.1 and A.3.2) and the null hypothesis is rejected if the obtained
value exceeds that used in the table for the desired level of significance.
2
The χMc basically tests the goodness of fit that compares the observed counts
with the expected distribution of counts if the null hypothesis holds. If this statis-
2
tic is larger than χ1,1−α , (the first subscript denotes the degrees of freedom), then
we reject the null hypothesis with an α significance level or 1 − α confidence.
228 Statistical Significance Testing
Example 6.5. In this example, we apply the c4.5 (c45) and nb classifiers on
the labor data. Our default resampling method is 10-fold cross-validation, and
we thus look at the results obtained on 10 different folds. (This means that we
end up testing the algorithms on the entire dataset, because we add up the errors
made by each classifier on all the folds.9 ) The specifics of the data, i.e., the actual
classification of each example by both classifiers, are given in Appendix B.
From the listing in Appendix B, we can see that c4.5 (classifier f1 )
makes errors on the instances numbered 1, 4, 5, 8, 11, 22, 23, 24, 31, 34, 38,
49, 53, 55, 57; nb (classifier f2 ) makes errors on instances numbered
1, 7, 11, 22, 24, 40. Therefore the populated McNemar’s contingency matrix
is (see Table 6.2) such, that for |Stest | = 57, we have
Mc
c00 = 4, Mc
c01 = 11, Mc
c10 = 2, Mc
c11 = 40
(|11 − 2| − 1)2
2
and hence, χMc = = 64/13 = 4.92.
11 + 2
Because 4.92 > 3.841 (where χ1,0.05 2
= 3.841 per the table in Appendix
A.3.2), we reject the null hypothesis and claim that c4.5 and nb classify the
dataset differently. However, as mentioned previously, it turns out that McNe-
mar’s test should not be used in this case because c01 Mc
+ c10
Mc
= 11 + 2 < 20,
and the sign test that is discussed in Subsection 6.6.1 should be used instead.
9 Note that Dietterich (1998) did not mention this use of the McNemar test. In his paper, he considered
McNemar’s test applied to a testing set proper (not cross-validated) and compared this technique
with the 10-fold cross-validated approach, among others. We, on the other hand, decided to separate
the issues of statistical testing and resampling in this book. We use stratified cross-validation as a
default, unless otherwise stated, in all our experiments. The issue of resampling was discussed in
Chapter 5.
6.5 Comparing Two Classifiers on a Single Domain 229
nb
0 1
0 4 11
c4.5
1 2 40
With regard to the testing of normality using the KS test or the Shapiro–Wilk
test, as mentioned earlier, these tests work well when the sample size is large.
However, for smaller samples, as is generally the case when two classifiers are
tested on multiple domains (generally fewer than 30), these tests are not so
powerful. Hence, as Demšar (2006, p. 6) pointed out, the irony is that, “for using
the t-test we need Normal distributions because we have small samples, but the
small samples also prohibit us from checking the distribution shape.”
Finally, the t test in this setting is also susceptible to outliers. That is,
rare classifier performances that deviate largely from the performance on other
domains will tend to skew the distribution, thereby affecting the power of the t
test because they increase the estimated standard deviation of the performance
measures.
With regard to McNemar’s test, which works on nominal variables10 as can
easily be seen in the use of the contingency matrix in this case, extending the
test to the case of multiple domains is not straightforward. There are other
tests available, for instance, Cochran’s test, that aims at extending this approach
to multiple-classifier testing on multiple domains. However, as we will see
later, we have better alternatives to perform such testing. Finally, also note that
McNemar’s test works on binary classification data. The related extension to
the multiclass case would be the marginal homogeneity test, which we do not
describe here, but can be found in many standard statistical hypothesis testing
texts.
10 This corresponds to classifying data into two categories but not quantifying the extent of classi-
fication.
6.6 Comparing Two Classifiers on Multiple Domains 231
Qualitatively, these two tests have been compared on the grounds of their
respective powers and type I error probabilities too. As mentioned previously,
Dietterich (1998) compared McNemar’s test applied to a testing set proper
(with no cross-validation) to various cross-validated versions of the t test. In
his experiments, McNemar’s test was thus at a disadvantage with respect to the
t test. Yet he concluded that McNemar’s test has as low, if not a lower, probability
of making a type I error than the other tests in the study, including the t test.
McNemar’s test was also shown to have slightly less power than the 5 × 2-CV t
test and a lot less power than the cross-validated t test (but the cross-validated t
test had a higher probability of making a type I Error).11 Dietterich’s conclusions
thus suggest that McNemar’s test may be a good alternative to the t test; however,
it is important to recall that McNemar’s test applies only under the condition
in which the number of disagreements between the two classifiers is large
(generally, c01Mc
+ c10
Mc
≥ 20), as previously discussed.
11 We recall that a type I error of a statistical test corresponds to the probability of incorrectly detecting
a difference when no such difference exists and that the power of a test corresponds to the ability
of the test to discover a difference when such a difference does exist. Please refer to Chapter 2 for
further details about these concepts.
232 Statistical Significance Testing
Example 6.6. We use Table 6.314 for our example, which lists the performance
of eight classifiers on 10 different domains. From the table, we see that nb wins
4 times over svm, ties once, and loses 5 times. Conversely, svm wins 5 times,
ties once, and loses 4 times. Because the table of critical values in Appendix A.4
shows that, for 10 datasets, a classifier needs to win at least 8 times for the null
hypothesis to be rejected with significance level α = 0.05 (or α = 0.1 in the
two-tailed test), we cannot reject the null hypothesis with the sign test. It cannot
12 This can also be scaled up for comparing multiple classifiers by constructing a matrix of pairwise
comparisons.
13 A z statistic corresponds to the standard (or unit) normal distribution assumption.
14 Please note that the results in the table are listed in percentages that are easily readable. These
numbers will be divided by 100 in the context of certain statistical tests, as will be further discussed.
6.6 Comparing Two Classifiers on Multiple Domains 233
Notes: A “v” indicates the significance test’s success in favor of the corresponding classifier against
nb while a “∗” indicates this success in favor of nb. No symbol indicates the result between the
concerned classifier and nb were not found to be statistically significantly different.
To see this, we revert to the same convention we used in the t test. Let us
consider two classifiers f1 and f2 and the two conditions corresponding to their
performance measures pm(f1 ) and pm(f2 ). The procedure is as follows:
r For each trial i, i ∈ {1, 2, . . . , n}, we calculate the difference in perfor-
mance measures of the two classifiers di = pmi (f2 ) − pmi (f1 ), where the
subscript i denotes the corresponding quantity for the ith trial.
r We rank all the absolute values of d . That is, we rank |d |. In the case of
i i
ties, we assign average ranks to each tied di .
r Next, we calculate the following sum of ranks:
n
Ws1 = I (di > 0)rank(di ),
i=1
n
Ws2 = I (di < 0)rank(di ).
i=1
r Assuming that there are r differences whose values are zero, there are
two approaches to dealing with this issue. The first is to ignore these
differences, in which case n takes on the new value of n − r. And all
the further calculations follow in the same manner, as previously stated.
Another approach is to split the ranks of these r zero-valued differences
between Ws1 and Ws2 equally. If r is odd, we just ignore one of the zero-
valued differences. Hence, in this case, n retains its original value (except
when r is odd, where n = n − 1). The Ws1 and Ws2 are defined as
n
1
n
Ws1 = I (di > 0)rank(di ) + I (di = 0)rank(di ),
i=1
2 i=1
n
1
n
Ws2 = I (di < 0)rank(di ) + I (di = 0)rank(di ).
i=1
2 i=1
r Next, a T
wilcox statistic is calculated as Twilcox = min(Ws1 , Ws2 ).
r In the case of smaller n’s (n ≤ 25), exact critical values of T can be looked
up from the tabulated critical values of Twilcox to verify if the null hypothesis
can be rejected.
r In the case of larger n’s, the T
wilcox distribution can be approximated nor-
mally. We compute the following z statistic as
Twilcox − μTwilcox
zwilcox =
σTwilcox
where μTwilcox is the mean of the normal approximation of the distribution
of Twilcox when the null hypothesis holds:
n(n + 1)
μTwilcox = ;
4
6.6 Comparing Two Classifiers on Multiple Domains 235
Table 6.4. Classifiers NB and SVM data ran on 10 realistic domains and used in
the Wilcoxon test example
Let us now illustrate this test with the following example that compares the nb
and the svm classifiers on 10 domains, as per the first two columns of Table 6.3,
but divided by 100 to obtain accuracy rates in the [0, 1] interval. The results of
the analysis performed to apply the Wilcoxon test are shown in Table 6.4.
The sum of signed ranks is then computed, yielding the values of WS1 = 17
and WS2 = 28. According to the algorithm previously listed,
the training data). This would result in unfairly skewing the test results in favor
of this classifier. A relatively better method, hence, would be to use a single
trial of cross-validation because in this case the performances of classifiers over
each fold can be compared, resulting in (relatively) independent performance
assessments, unlike the multiple-trial version. However, it should be noted that,
even in this case, the training sets of various folds can have a significant over-
lap, especially in the case of small sample sizes and larger numbers of folds
considered in the cross-validated resampling regimen. We subsequently illus-
trate an example of how the sign test and Wilcoxon’s test can be used in the
single-domain scenario.16
16 This, however does in no way condone the view that these tests should be used in a single-domain
scenario.
238 Statistical Significance Testing
Table 6.5. Classifiers C45 and NB data used for the Wilcoxon test example on a single
cross-validation run
Example 6.8. In this example, we are comparing c4.5 and nb on the labor
dataset on the first of the 10 runs presented in Table 2.3, trying to establish
whether the two classifiers behave significantly differently from one another. In
more detail, Table 6.5 lists the fold number in the first column; the error rate of
c4.5 on this fold in the first run in the second column; the error rate of nb on this
fold in the first run is in the third column; the difference between the two scores
just listed, in the fourth; the absolute value of this difference in the fifth; the
rank of this absolute value in the sixth (with the folds for which no difference is
found, removed); and the signed rank of this absolute value (where the sign of
column 4 is added to the rank of column 6), in column 7.
The sum of signed ranks is then computed, yielding the values of WS1 = 33
and WS2 = 3. According to the algorithm previously listed,
17 Sometimes threefold when some preanalysis is performed using the so-called pre-hoc tests.
18 Salzberg (1997) suggests another approach to deal with the problem of multiple comparisons: the
binomial test with the Bonferroni correction for multiple comparisons. However, he himself remarks
that such a test does not have sufficient power and that the Bonferroni correction is too drastic.
Demšar (2006) agrees that the field of statistics produced more powerful tests to deal with these
conditions.
240 Statistical Significance Testing
these cases. For the case of ANOVA, we detail two of the main versions as
they both apply to the evaluation of learning approaches, the difference between
which will be clear from the description.
The next term, denoted as the Sum of Squares Total, defines the total variation
over the classifiers’ performances on all datasets (with degrees of freedom =
kn − 1):
k
n
SSTotal = (pmij − pm)2 .
j =1 i=1
The variation in the error follows naturally as the difference between the total
variation and the combined variation accounted for by SSBlock and SSpm [with
Let us now define the quantities that enable us to model the variations of
classifiers’ performance measures over datasets. The first quantity, the between
mean squares (MS), is a measure of variability between classifiers:
SSpm
MSpm = ,
k−1
where k − 1 are the degrees of freedom for SSpm .
The second quantity, the within mean squares or mean-squares error, mea-
sures the variability within classifiers:
SSError
MSError = ,
(n − 1)(k − 1)
where (n − 1)(k − 1) are the degrees of freedom of SSError .
Finally, the statistic of interest, the F ratio, can be obtained as
MSpm
F = .
MSError
This F ratio can then be looked up in the table of critical value for F
ratios to assess whether the null hypothesis can be refuted for the desired sig-
nificance level. The degrees of freedom used are df1 = dfSSpm = k − 1 and
df2 = dfMSError = (n − 1)(k − 1). Note that the null hypothesis is rejected if the
F ratio previously obtained is greater than the critical value in the lookup table.
That is, larger F ’s demonstrate greater statistical significance than smaller ones.
As with the z and t statistics, there are tables of significance levels associated
with the F ratio. (Please refer to Appendix A.6 for these tables.)
Briefly summarizing, the goal of ANOVA is to discover whether the differ-
ences in means (the classifiers’ performances) between different groups (i.e.,
over different datasets) are statistically significant. To do so, ANOVA partitions
the total variance into variance caused by random error (the within-group vari-
ation) and variance caused by actual differences between means (the between-
group variation). If the null hypothesis holds, then the within-group SS should
be about the same as the between-group SS. We can compare these two varia-
tions by using the F test, which checks whether the ratio of the two variations,
measured as mean squares, is significantly greater than one.
An Illustration
We illustrate this process with the following hypothetical example:
Assume that classifiers fA , fB , and fC obtain the percentage accuracy results
shown in Table 6.6 on domains 1–10 (we assume that each entry in the table is
the result of 10-fold cross-validation).
6.7 Comparing Multiple Classifiers on Multiple Domains 243
Dividing all the results by 100 to obtain an accuracy rate estimate that can
be modeled by the statistical tests for Table 6.6, we have pm = 81.47
100
= 0.8147
and the corresponding pmi. and pm.j values as shown in Tables 6.7 and 6.8
respectively. Let us calculate the relevant quantities:
k
SSpm = n (pm.j − pm)2
j =1
3
= 10 (pm.j − 0.8147)2
j =1
n
SSBlock =k (pmi. − pm)2
i=1
10
=3 (pmi. − 0.8147)2
i=1
Domain pmi.
1 0.8196
2 0.8166
3 0.7968
4 0.8166
5 0.8246
6 0.7784
7 0.8285
8 0.8265
9 0.8149
10 0.8242
Similarly, we get
k
n
SSTotal = (pmij − pm)2
j =1 i=1
3
10
= (pmij − 0.8147)2
j =1 i=1
= 0.1249,
Domain pm.j
fA 0.8605
fB 0.7286
fC 0.8550
6.7 Comparing Multiple Classifiers on Multiple Domains 245
SSError
MSError =
(n − 1)(k − 1)
0.006995
=
(10 − 1)(3 − 1)
= 0.000389.
Finally,
0.055675
F =
0.000389
= 143.12.
pmij = pm + αi + eij ,
where αi = pm.j − pm and eij refers to the random error in the performance
measures and is distributed normally with mean 0 and variance σ 2 . Hence it
follows that the performance measures for the classifiers over each dataset Si
are distributed normally about the mean pm + αi with variance σ 2 .
In terms of the calculation, the only difference with regard to the one-way
repeated-measures ANOVA previously described is that we do not subtract the
block variability (that is, variability of performance measures over all classifiers
within datasets) from SSError . And all calculations proceed accordingly.
Note that the degrees of freedom for SSError become nk − k because
21 Although other terms can also be found such as circularity and compound symmetry.
248 Statistical Significance Testing
1
n
R .j = Rij .
n i=1
22 Note that here we make an implicit assumption that a higher value of the performance measure (e.g.,
accuracy) is always preferred. However, there are other performance measures, such as classification
error Rs (f ) of classifier f , for which a lower value is an indicator of better performance. In this
respect, the statement pmij > pmij should be interpreted as a representation of this criterion. That
is, the classifier fj will be considered to outperform fj when fj performs better. In the case of
a measure such as accuracy, this would hold when Acc(fj ) > Acc(fj ), whereas in the case of a
measure such as classification error, this would hold when Rs (fj ) < Rs (fj ). In this sense, a better
representation would be pmij pmij .
6.7 Comparing Multiple Classifiers on Multiple Domains 249
1 n k
SSError = (Rij − R)2 .
n(k − 1) i=1 j =1
r The test statistic, also called the Friedman statistic, is calculated as
SSTotal
χF2 = .
SSError
r According to the null hypothesis, which states that all the classifiers are
equivalent in their performance and hence their average ranks R.j should
be equal, the χF2 follows a χ 2 distribution with k − 1 degrees of freedom
for large n (usually > 15) and k (usually > 5).
r Hence, in the case of large n and k, χ 2 can be looked up in the table
F
for χ 2 distribution (Appendix A.3). A p value, signifying P (χk−12
≥ χF2 ),
is obtained and, if found to be less than the critical value for the desired
significance level, the null hypothesis can be rejected.
r In the case of smaller n and k, the χ 2 approximation is imprecise and a
table lookup is advised from tables of χF2 values approximated specifically
for the Friedman test (Appendix A.7).
Note that the preceding χF2 statistic can be simplified to
k
12
χF =
2
× (R.j ) − 3 × n × (k + 1).
2
n × k × (k + 1) j =1
An Illustration
To better explain the test, we illustrate our discussion with the example used
previously in Table 6.6 in Subsection 6.7.1. We rewrite Table 6.6 in terms of
the rank obtained by each classifier on each domain to produce Table 6.9, i.e.,
we look across each row and assign attribute values 1, 2, or 3 to the largest, the
second-largest, and the third-largest accuracies, respectively, that we find. We
use average ranks for ties. If there were no differences between the algorithms,
we would expect the ranks to be evenly spread among the datasets, i.e., on some
datasets, algorithm A would win, on others, algorithm B or C would win, i.e.,
250 Statistical Significance Testing
we could not notice any patterns. (fA , fB and fC denote the classifiers output
by the algorithms A, B and C respectively).
Consider the example previously used in the case of ANOVA. In our example,
we see a pattern: Classifier fB is always ranked third, whereas Classifiers fA and
fC share the first and second places more or less equally. (There may not be any
difference between fA and fC , but this is not what is getting tested here. This
question is considered in the next subsection, which discusses post hoc tests.)
We then compute the following statistics for the Friedman test:
k
12
χF2 = (R.j )2 − 3 × n × (k + 1),
n × k × (k + 1) j =1
23 This generally corresponds to utilizing the measures and respective degrees of freedom based on
repeated-measures ANOVA when the respective post hoc test is applied.
252 Statistical Significance Testing
the higher risk of type I error encountered by use of multiple t tests. The
null hypothesis, as before, states that the performance of two classifiers being
compared is equivalent (i.e., not statistically significantly different). The test is
performed in the following manner:
r Compute the means of the performance measures of various classifiers for
each dataset. That is, compute (in fact reuse) pm.j , j ∈ {1, 2, . . . , k}.
r Calculate the standard error as
MSError
SE = ,
n
where MSError is as defined in one-way ANOVA (or one-way repeated-
measures ANOVA, as the case may be).
r Calculate the q statistic for any two classifiers, say, f and f :
j1 j2
pm.j1 − pm.j2
q= . (6.5)
SE
r Compare the |q| just calculated with the critical value q in the table of
α
critical q values for the desired significance level α and (n − 1)(k − 1)
degrees of freedom and number of groups = k for repeated-measures one-
way ANOVA (degree of freedom nk − k for one-way ANOVA). This table
is available in Appendix A.8.
r Reject the null hypothesis if |q| > q .
α
The mean difference in the numerator of Equation (6.5) can also be character-
ized in terms of a statistic commonly known as the HSD (honestly significant
difference) between means. That is, when expressed by Equation (6.5) as
HSD = qα SE,
the HSD denotes the minimum required mean difference between the two means
of interest so as to be statistically significantly different at the significance level
α. Hence any pair of classifiers with a mean difference greater than the HSD
would imply a statistically significant difference.
Note that, in the case of unequal sample sizes, which is generally not the case
in classifier evaluation, the Scheffe test can be used as an alternative equivalent
of the Tukey test.
Example of the Tukey Test. To illustrate the process involved in the Tukey test,
we go back to the comparison of the three classifiers, fA , fB , and fC , over 10
domains listed in Table 6.6 used in Subsection 6.7.1 (recall that the accuracies are
divided by 100). In that example, we recall the means for classifiers fA (j = 1),
fB (j = 2), and fC (j = 3), over every domain:
We also have
MSError 0.000389
SE = = = 0.006237.
n 10
We compute the q statistics for the three pairs of classifiers and obtain
and
but not by
Dunnett Test
In the case in which the comparisons of various means of interest are to be
made against a control classifier, the Dunnett test can be applied. Hence this
test is performed when the difference in the performance of each classifier
fj , j ∈ {1, 2, . . . , k} \ {c} is measured with respect to the control classifier fc .
254 Statistical Significance Testing
For instance, the control classifier fC can be the baseline classifier against which
the comparisons might be needed. We calculate the following t statistic:
pm.j − pm.c
td = ,
2MSError
n
with the usual meanings for the notations. The null hypothesis that classifier fj
performs equivalently to the control is refuted by a table lookup for comparing
the obtained t statistic for a degree of freedom nk − k (the degree of freedom for
MSError for one-way ANOVA), (n − 1)(k − 1) (for one-way repeated-measures
ANOVA), and for number of groups = k. The table is given in Appendix A.9.
Example of the Dunnett Test. Assume that classifier fB in Table 6.6 used in
Subsection 6.7.1 is our control classifier. Then, we compare classifier fA with
classifier fB by computing
pm.j − pm.c 0.8605 − 0.7286
tAB = = = 14.96
2MSError 2×0.000389
n 10
For df = (10 − 1)(3 − 1) = 18 and α = 0.05, we get the value 2.40 from the
table in Appendix A.9. Because our obtained values for both comparisons of
fA and fC with control algorithm fB are greater than 2.40, we can reject the
hypothesis that fA and fB are equivalent and fB and fC are equivalent at the
α = 0.05 significance level. In fact, it can also be rejected at the α = 0.01 level
because the critical value in the table is 3.17, which is still lower than our
obtained values.
Bonferroni Test
Another alternative to discover the pairs with significant mean differences is to
use the Bonferroni correction for multiple comparisons. This involves calculat-
ing a t statistic as
pm.j1 − pm.j2
tb = .
2MSError
n
That is, the Bonferroni correction is similar to the Dunnett test except that,
in this case, the pairwise comparisons are made.
Example of the Bonferroni Test. This test is the same as the Dunnett test, and
its results were already calculated for classifiers fA and fB and fB and fC . We
6.7 Comparing Multiple Classifiers on Multiple Domains 255
Because 0.624 < 2.40 (2.40 was the critical value for α = 0.05 that we obtained
in the previous section when discussing the Dunnett test), we cannot reject the
null hypothesis that states that fA and fC perform equivalently.
Performing all pairwise comparisons, however, has a drawback. As the num-
ber of comparisons increases, so does the probability of making a type I error.
Bonferroni’s correction for multiple comparisons (Salzberg, 1997) attempts to
address the issue by using a tighter scaling of the tb statistic (compare this with
the t test, for instance). But, although this method may work fine for a small
number of comparisons, it becomes more and more conservative as the number
of comparisons increases. A refinement to the Bonferroni approach to deal with
this issue has been proposed and is known as the Bonferroni–Dunn test or simply
the Dunn test.
Bonferroni–Dunn Test
The Bonferroni–Dunn test is basically the same as the Bonferroni test except
that the significance level α is divided by the number of comparisons made.
That is, the lookup for the q statistic is performed not in comparison with qα but
rather with q ncα , where nc is the number of comparisons made. This is especially
beneficial when the comparisons are made against a control ! " classifier because,
in that case, the number of comparison nc is k − 1 unlike k2 = k(k−1) 2
when all
pairwise comparisons are made.
Example of the Bonferroni–Dunn Test. In this example, the only thing that
will change is the value with which the calculated t statistics will be compared.
Because we made three comparisons in the Bonferroni test, we divide α = 0.05
by 3 and obtain 0.017. A look at the t table shows that the value for df =
(10 − 1)(3 − 1) = 18 and α = 0.01 (which is close to 0.017) is 3.17.
Even with this correction, we can reject the hypotheses that suggests that fA
and fB and fC and fB behave equivalently (because their values computed in
the example associated with the Dunnett test are both greater than 3.17), but we
cannot reject the hypothesis suggesting that fA and fC perform equivalently.
We cannot run the test for α < 0.05 because the table does not show the critical
values for the significance level we would need in that case.
to discover if the rank differences obtained as a result of the Friedman test are,
indeed, significant. This can be done using the Nemenyi test.
1
n
R .j = Rij .
n i=1
r For any two classifiers f and f , we compute the q statistic as
j1 j2
R .j1 − R .j2
q= .
k(k+1)
6n
r The null hypothesis is rejected after a comparison of the obtained q value
with the q value for the desired significance table for critical qα values,
where α refers to the significance level.25 Reject the null hypothesis if the
obtained q value exceeds qα .
Note the similarity in the statistic with the Tukey test. However, if expressed
as a critical difference (CD) over ranks analogous to the HSD statistic, this CD
would represent a different quantity (not the absolute mean difference but the
rank difference). Also, qα would √ correspond to the q values from the Tukey
test but scaled by dividing it by 2. In this respect, the Nemenyi test works by
computing the average rank of each classifiers and taking their difference. In
the cases in which these average rank differences are larger than or equal to the
CD just computed, we can say, with the appropriate amount of certainty, that
the performances of the two classifiers corresponding to these differences are
significantly different from one another.
Example of the Nemenyi Test. To illustrate the process of the Nemenyi test, as
for the other post hoc tests, we go back to the comparison of the three classifiers,
R .1 − R .2 15.5 − 30 −14.5
q12 = = = = −32.22,
3(3+1) 3(3+1) 0.45
6×10 6×10
R .1 − R .3 15.5 − 14.5 1
q13 = = = = 2.22,
3(3+1) 3(3+1) 0.45
6×10 6×10
R .2 − R .3 30 − 14.5 15.5
q23 = = = = 34.44.
3(3+1) 3(3+1) 0.45
6×10 6×10
Other Methods
Other methods exist that basically compute a statistic similar to those previously
discussed but scale the significance level values so as to account for the family-
wise error rate over multiple-classifier comparisons. Some examples include
Hommel’s test, Holm’s test, and Hochberg’s test. Hommel’s test (Hommel,
1988) is a slightly more powerful nonparametric post hoc test. However, its
practical use is difficult as a result of the added complexity in implementing it.
Hence we have not discussed it here. Interested readers are encouraged to look
these up in statistics texts.
26 Chapter 5 already described the random subsampling scheme. Here we focus on the ensuing t test.
260 Statistical Significance Testing
In their experiments, Nadeau and Bengio (2003) show that the corrected resam-
pled t test has an acceptable probability of type I error and has much better
power than the cross-validated t test, simple resampled t test, and Dietterich’s
5 × 2 CV, subsequently described.
Dietterich (1998) shows that, under the null hypothesis, t˜ follows a t distribution
with 5 degrees of freedom. The null hypothesis is hence rejected by a t-table
lookup (Appendix A.2) when t˜ > t for the desired level of significance.
Dietterich’s (1998) experiments with this new test showed that its probability
of issuing a type I error is lower than that of the k-fold CV paired t tests, but the
5 × 2-CV t test has less power than the k-fold CV paired t test.
r × k CV
The preceding method of the 5 × 2-CV t test is a special case of the general
r × k cross-validation approach. In such an approach, r runs of a k-fold cross-
validation are performed on a given dataset for the two algorithms, and the
empirical differences of their performance, along with its statistical significance,
262 Statistical Significance Testing
are studied. Increasing the number of runs as well as folds, depending on the
dataset size, can help in addressing issues such as replicability. Indeed, Bouckaert
(2003, 2004) studied the 5 × 2-CV test with an emphasis on replicability and
found the method wanting.28 He proposed a version with r = k = 10. We first
describe the method in its general form, and then we focus on the specific
strengths and limitations of different versions of 10 × 10-CV tests.
Let dij represent the difference in performance of the classifiers obtained
from algorithms A1 and A2 on the test set represented by the ith fold in the
j th run.
For r runs of k-fold cross-validation, we can obtain the average difference of
the empirical error estimates as
11
k r
d¯ = dij .
k i=1 r j =1
28 A formal definition of replicability is given in the Bibliographic Remarks section at the end of this
chapter.
6.9 Illustration of the Statistical Tests Application Using R 263
fold after the ordering has been obtained by sorting the folds in each run.
Hence this averages the fold with the ith lowest performance in each run.
Then the variance is obtained as
k
(do(i). − d̄)2
σ̂ = i=1
2
.
k−1
Based on the variance obtained from any of the preceding methods, a Z score
can be computed as
√
d̄ fd + 1
Z= ,
σ
where fd denotes the degrees of freedom which is k × r − 1 for the use-all-data
scheme, r − 1 for the average-over-folds scheme, and k − 1 for both average-
over-runs and average-over-sorted-runs schemes.
Let us now look at some specific findings for the special case in which
r = k = 10.
10 × 10 CV
Bouckaert (2003) investigated the previously mentioned four variations in the
context of variance estimation in the case in which r = k = 10 in an attempt to
establish its reliability. Other variations were also considered but turned out to
be either similar or less appropriate, in practice, than the four just considered.
In the use-all-data scheme, a degree of freedom k × r − 1 would give fd =
99. However, Bouckaert (2004) suggests the use of a calibrated paired t test
with fd = 10 instead of 99 because this choice shows excellent replicability. All
the versions of the 10 × 10-CV tests typically do not yield a higher expected
probability of type I errors than predicted for the experiments, although their
type I error is higher than that of simple 10-fold cross-validation. However,
the 10 × 10-CV scheme was shown to have as much, and often more, power.
Based on the empirical observations, a 10 × 10-CV method with the use-all-
data scheme for variance estimation with fd = 10 seems more appropriate when
the aim is to have a test with high power rather than a low type I error. When
replicability is important though, a 10 × 10 CV with the average-over-sorted-
runs scheme for variance estimation seems more suitable. Finally, just as a
parametric test is used for significance testings, a nonparametric alternative can
also be used.
for two classifiers on multiple domains, those for multiple classifiers on multiple
domains, post hoc tests, and tests based on resampling. Please note that the R
default for all the tests for which it is relevant is to run a two-tailed test. We used
this default in all these cases. It is possible to change this default by specifying
“one-sided” as per the R manual for statistical test commands. For rejecting the
null hypothesis, we choose α with the highest confidence level based on the
associated p-value.
P a i r e d t−t e s t
d a t a : c45 and nb
t = 8 . 0 7 0 1 , d f = 9 , p−v a l u e = 2 . 0 6 4 e −05
a l t e r n a t i v e h y p o t h e s i s : t r u e d i f f e r e n c e i n means i s n o t e q u a l
to 0
95 p e r c e n t c o n f i d e n c e i n t e r v a l :
0.1096298 0.1950302
sample e s t i m a t e s :
mean o f t h e d i f f e r e n c e s
0.15233
6.9 Illustration of the Statistical Tests Application Using R 265
From the p value we thus reject the null hypothesis, at the 0.01 level of signifi-
cance, that c4.5 and nb perform similarly to one another. If all the assumptions
of the t test were met (which they were not, as discussed in Subsection 6.5.1),
we would conclude that if nb indeed classifies the domain better than c4.5, then
the results we observed most probably did not occur by chance. In addition,
please notice that this procedure also returns the 95% confidence interval of
this difference in means. This result corroborates the one obtained manually in
Subsection 6.5.1.
Let us now obtain the value of effect size (Listing 6.4).
Listing 6.4: Sample R code for computing the effect size using Cohen’s d
statistic.
> c45 = c ( 0 . 2 4 3 3 , 0 . 1 7 3 3 , 0 . 1 7 3 3 , 0 . 2 6 3 3 , 0 . 1 6 3 3 ,
0.2400 , 0.2067 , 0.1500 , 0.2667 , 0.2967)
> nb = c ( 0 . 0 7 3 3 , 0 . 0 6 6 7 , 0 . 0 1 6 7 , 0.0700 , 0.0733 ,
0.0500 , 0.1533 , 0.0400 , 0.0367 , 0.0733)
> d= a b s ( mean ( c45 )−mean ( nb ) ) / s q r t ( ( v a r ( c45 ) + v a r ( nb ) ) / 2 )
> d
[ 1 ] 3.430371
The R calculation returns the same result as the manual calculation, and
we thus conclude, as before, that the effect size is large because it is greater
than 0.8.
McNemar’s Test
R implements the McNemar test and can be used as shown in Listing 6.5, where
x is the contingency matrix. We use the same example as the one calculated by
hand in Subsection 6.5.2 and obtain the same result, which tells us that the null
hypothesis can be rejected with a p value of 0.0265.
McNemar ’ s Chi−s q u a r e d t e s t w i t h c o n t i n u i t y c o r r e c t i o n
data : x
McNemar ’ s c h i −s q u a r e d = 4 . 9 2 3 1 , d f = 1 , p−v a l u e = 0 . 0 2 6 5
266 Statistical Significance Testing
Listing 6.6: Sample R code for Wilcoxon’s test on the comparison of c4.5 and
nb on the labor data, using the first run of 10-fold cross-validation results of
Table 2.3.
> c4510folds= c (0.5 , 0 , 0.3333 , 0 , 0.3333 , 0.3333 , 0.3333 , 0.2 ,
0.2 , 0.2)
> nb10folds = c (0.1667 , 0 , 0 , 0 , 0 , 0.1667 , 0 , 0.4 , 0 , 0)
> w i l c o x . t e s t ( n b 1 0 f o l d s , c 4 5 1 0 f o l d s , p a i r e d = TRUE)
Wilcoxon s i g n e d r a n k t e s t w i t h c o n t i n u i t y c o r r e c t i o n
data : n b 1 0 f o l d s and c 4 5 1 0 f o l d s
V = 3 , p−v a l u e = 0 . 0 4 0 3 0
a l t e r n a t i v e hypothesis : true l o c a t i o n s h i f t i s not equal to 0
We point out, however, that there are two causes for concern with regard to
the R statistical software’s implementation of Wilcoxon’s test. The first one is
the fact that the order in which the data are presented to the test seems to matter.
For example, in Listing 6.6, it should be noted that we listed the results of nb
(the better-performing classifier) before those of c4.5. This yielded V = 3. Had
we listed c4.5 before nb, we would have obtained V = 33, which represents the
maximum, rather than the minimum, of the two sums of signed ranks. This is
erroneous, although the p value indicated by the system is the same in both cases,
meaning that the conclusion drawn by R is the same whatever order is used.
The second cause for concern has to do with the set of warnings R issues
along with their meaning. In our preceding example, two such warning were
issued as in Listing 6.7:
Dataset nb svm
This suggests that the p value issued is not fully reliable. It might thus be best
for the user to run “wilcox.test” twice, using a different order in the presentation
of the algorithms’ results each time, to select the smallest V value issued by
these two runs and to use the Wilcoxon table in Appendix A.5 to compute the
p value manually.
Wilcoxon s i g n e d r a n k t e s t w i t h c o n t i n u i t y c o r r e c t i o n
268 Statistical Significance Testing
data : n b D a t a s e t s and S V M d a t a s e t s
V = 1 7 , p−v a l u e = 0 . 5 5 3 6
a l t e r n a t i v e hypothesis : true l o c a t i o n s h i f t i s not equal to 0
Warning m e s s a g e :
I n w i l c o x . t e s t . d e f a u l t ( n b D a t a s e t s , SVMdatasets , p a i r e d = TRUE) :
c a n n o t compute e x a c t p−v a l u e w i t h z e r o e s
>
The p value returned by R is very high; thus we do not reject the null
hypothesis. Indeed, for N = 10, the number of datasets included in our study,
the critical values from the Wilcoxon table are 8 and 10 for the two-sided and
one-sided 95% confidence tests, respectively. These two values are smaller than
the value of 17 returned by the R procedure, and so the hypothesis cannot be
rejected. This is the same conclusion as the one that was obtained in Subsec-
tion 6.6.2.
29 Note that we chose one approach to executing repeated-measures one-way ANOVA in R. That is
the approach we considered to be the simplest. Other means of doing so are also possible in R.
6.9 Illustration of the Statistical Tests Application Using R 269
makes the difference between the two is the inclusion of the term + Error(X),
where X takes the values “datasets” or “factor(datasets).” With this term, R
performs repeated-measures one-way ANOVA. Without it, it performs one-way
ANOVA.
In our second example, the formula inside aov() is
In effect, this formula can be translated as meaning that we are trying to establish
whether the various classifiers differ in accuracy and explicitly removing the
within-dataset variation that may affect accuracy. If the term “+ Error(dataset)”
were not included, we would simply be trying to establish whether the various
classifiers differ in accuracy (with no worries about the correlations of the
samples, i.e., the fact that classifiers’ accuracies obtained on the same datasets
have a measure of correlation). That is, we would be performing (simple) one-
way ANOVA.
In our first example, the formula inside aov() is slightly more complicated as
it is:
The difference between the two formulas lies in the use of the function “fac-
tor().” The reason why this function did not need to be used in the second
example and needed to be used in the first is simple: In the second example,
our database used the dataset names, anneal, audio, etc. In the first database, we
specified the datasets numerically, using values 1, 2, 3, and so on. To signify to
R that we wanted these numerical values treated as categories, we had to specify
“factor(dataset)” every time “dataset” occurred in the formula of the first exam-
ple. This was not necessary in the second example. With respect to the classifiers,
because both examples referred to them as categorical values: classA, classB,
or classC in the first example or the actual name of the classifiers in the second;
there was no need to use the function factor(). However, had we referred to clas-
sifiers numerically (1, 2, 3, etc.) we would have needed to use factor(classifier)
in the formulas.
Listing 6.9: Sample R code for executing repeated-measures ANOVA on the
hypothetical dataset evaluating the three classifiers classA, classB, and classC
on 10 domains.
> t t <− r e a d . t a b l e ( “rmanova−e x a m p l e . c s v ” , h e a d e r =T , s e p =“ , ” )
> attach ( tt )
> tt
dataset accuracy c l a s s i f i e r
1 1 0.8583 classA
2 2 0.8591 classA
3 3 0.8612 classA
4 4 0.8582 classA
270 Statistical Significance Testing
5 5 0.8628 classA
6 6 0.8642 classA
7 7 0.8591 classA
8 8 0.8610 classA
9 9 0.8595 classA
10 10 0.8612 classA
11 1 0.7586 classB
12 2 0.7318 classB
13 3 0.6908 classB
14 4 0.7405 classB
15 5 0.7471 classB
16 6 0.6590 classB
17 7 0.7625 classB
18 8 0.7510 classB
19 9 0.7050 classB
20 10 0.7395 classB
21 1 0.8419 classC
22 2 0.8590 classC
23 3 0.8383 classC
24 4 0.8511 classC
25 5 0.8638 classC
26 6 0.8120 classC
27 7 0.8638 classC
28 8 0.8675 classC
29 9 0.8803 classC
30 10 0.8718 classC
>
> summary ( aov ( a c c u r a c y ˜ c l a s s i f i e r + Error ( factor ( dataset ) ) ) )
Error : Within
Df Sum Sq Mean Sq F v a l u e P r (> F )
c l a s s i f i e r 2 0 . 1 1 1 3 0 7 0 . 0 5 5 6 5 3 1 4 2 . 4 2 9 . 2 6 e −12 ***
R e s i d u a l s 18 0 . 0 0 7 0 3 4 0 . 0 0 0 3 9 1
−−−
S i g n i f . c o d e s : 0 *** 0 . 0 0 1 ** 0 . 0 1 * 0 . 0 5 . 0 . 1 1
>
Listing 6.10: Sample R code showing the data used for repeated-measures
ANOVA on the realistic experiments of Table 6.3.
> t t <− r e a d . t a b l e ( “rmanova−complex−e x a m p l e . c s v ” , h e a d e r =T ,
s e p =“ , ” )
> attach ( tt )
> tt
dataset accuracy c l a s s i f i e r
1 Anneal 0.9643 NB
6.9 Illustration of the Statistical Tests Application Using R 271
2 Audio 0.7342 NB
3 Balance 0.7230 NB
4 B r e a s t −C 0.7170 NB
5 C o n t a c t −L 0.7167 NB
6 Pima 0.7436 NB
7 Glass 0.7063 NB
8 Hepa 0.8321 NB
9 Hypothyr 0.9822 NB
10 Tic−t a c −t o e 0.6962 NB
11 Anneal 0.9944 SVM
12 Audio 0.8134 SVM
13 Balance 0.9151 SVM
14 B r e a s t −C 0.6616 SVM
15 C o n t a c t −L 0.7167 SVM
16 Pima 0.7708 SVM
17 Glass 0.6221 SVM
18 Hepa 0.8063 SVM
19 Hypothyr 0.9358 SVM
20 Tic−t a c −t o e 0.9990 SVM
21 Anneal 0.9911 IB1
22 Audio 0.7522 IB1
23 Balance 0.7903 IB1
24 B r e a s t −C 0.6574 IB1
25 C o n t a c t −L 0.6333 IB1
26 Pima 0.7017 IB1
27 Glass 0.7050 IB1
28 Hepa 0.8063 IB1
29 Hypothyr 0.9152 IB1
30 Tic−t a c −t o e 0.8163 IB1
31 Anneal 0.8363 ADA−DT
32 Audio 0.4646 ADA−DT
33 Balance 0.7231 ADA−DT
34 B r e a s t −C 0.7028 ADA−DT
35 C o n t a c t −L 0.7167 ADA−DT
36 Pima 0.7435 ADA−DT
37 Glass 0.4491 ADA−DT
38 Hepa 0.8254 ADA−DT
39 Hypothyr 0.9321 ADA−DT
40 Tic−t a c −t o e 0.7254 ADA−DT
41 Anneal 0.9822 BAG−REP
42 Audio 0.7654 BAG−REP
43 Balance 0.8289 BAG−REP
44 B r e a s t −C 0.6784 BAG−REP
45 C o n t a c t −L 0.6833 BAG−REP
46 Pima 0.7461 BAG−REP
47 Glass 0.6963 BAG−REP
48 Hepa 0.8450 BAG−REP
49 Hypothyr 0.9955 BAG−REP
50 Tic−t a c −t o e 0.9207 BAG−REP
51 Anneal 0.9844 C45
272 Statistical Significance Testing
Listing 6.11: Sample R code (using data in Listing 6.10) for executing repeated-
measure ANOVA.
> summary ( aov ( a c c u r a c y ˜ c l a s s i f i e r + Error ( dataset ) ) )
Error : dataset
Df Sum Sq Mean Sq
c l a s s i f i e r 9 0.82480 0.09164
Error : Within
Df Sum Sq Mean Sq F v a l u e P r ( > F )
c l a s s i f i e r 16 0 . 1 2 8 0 3 9 0 . 0 0 8 0 0 2 2 . 0 8 0 2 0 . 0 2 3 4 4 *
R e s i d u a l s 54 0 . 2 0 7 7 3 4 0 . 0 0 3 8 4 7
−−−
S i g n i f . c o d e s : 0 *** 0 . 0 0 1 ** 0 . 0 1 * 0 . 0 5 . 0 . 1 1
>
6.9 Illustration of the Statistical Tests Application Using R 273
The results obtained on the first example corroborate those obtained manually
in Subsection 6.7.1, save for the rounding error. We thus strongly reject the
hypothesis that classifiers fA , fB and fC all perform similarly on domains 1–10.
We did not treat the second example manually because of its size. However,
the R treatment of this example concludes that we can reject the hypothesis
that the eight classifiers perform similarly on the 10 domains considered at the
95% significance level (but not at a higher level of significance, as in the first
example.)
F r i e d m a n r a n k sum t e s t
data : t
F r i e d m a n c h i −s q u a r e d = 1 5 , d f = 2 , p−v a l u e = 0 . 0 0 0 5 5 3 1
>
We can see in the first example that the R results corroborate those we obtained
manually. The Friedman test thus also concludes that there is a difference in
274 Statistical Significance Testing
F r i e d m a n r a n k sum t e s t
data : table
F r i e d m a n c h i −s q u a r e d = 2 0 . 6 8 7 2 , d f = 7 , p−v a l u e = 0 . 0 0 4 2 6 2
>
6.9 Illustration of the Statistical Tests Application Using R 275
The results show that we can reject the hypothesis that all the algorithms are the
same at the 0.005 level, which is a higher significance level than the one obtained
with the repeated-measures one-way ANOVA, suggesting a greater sensitivity
for the Friedman test in this particular case.
30 That being said, David Howell (2007) suggests that the major reason why statistical software does
not implement simple off-the-shelf post hoc tests for repeated-measures tests is that “unrestrained
use of such procedures is generally unwise” (sic). His reason is basically that, although the omnibus
tests are often robust enough to violation of certain constraints, the ensuing post hoc tests are not.
It is thus suggested that the manual computations for post hoc tests for repeated-measures one-way
ANOVA presented in Subsection 6.7.7 be used only with extreme caution.
276 Statistical Significance Testing
Listing 6.14: Sample R code for executing (simple) one-way ANOVA on the
hypothetical example.
> classA= c (85.83 , 85.91 , 86.12 , 85.82 , 86.28 , 86.42 , 85.91 ,
86.10 , 85.95 , 86.12) /100
> classB= c (75.86 , 73.18 , 69.08 , 74.05 , 74.71 , 65.90 , 76.25 ,
75.10 , 70.50 , 73.95) /100
> classC= c (84.19 , 85.90 , 83.83 , 85.11 , 86.38 , 81.20 , 86.38 ,
86.75 , 88.03 , 87.18) /100
> df = s t a c k ( d a t a . frame ( classA , classB , classC ) )
> summary ( x <− aov ( v a l u e s ˜ i n d , d a t a = d f ) )
Df Sum Sq Mean Sq F v a l u e P r (> F )
ind 2 0 . 1 1 1 3 0 7 0 . 0 5 5 6 5 3 1 1 0 . 5 4 9 . 9 1 4 e −14 ***
Residuals 27 0 . 0 1 3 5 9 3 0 . 0 0 0 5 0 3
−−−
S i g n i f . c o d e s : 0 *** 0 . 0 0 1 ** 0 . 0 1 * 0 . 0 5 . 0 . 1 1
>
Listing 6.15: Sample R code for executing Tukey’s test on the results of
Listing 6.14.
> TukeyHSD ( x )
Tukey m u l t i p l e c o m p a r i s o n s o f means
95% f a m i l y −w i s e c o n f i d e n c e l e v e l
F i t : aov ( f o r m u l a = v a l u e s ˜ i n d , d a t a = d f )
ind
diff lwr upr p adj
c l a s s B −c l a s s A −0.13188 −0.15675964 −0.10700036 0 . 0 0 0 0 0 0 0
c l a s s C −c l a s s A −0.00551 −0.03038964 0 . 0 1 9 3 6 9 6 4 0 . 8 4 7 8 0 0 2
c l a s s C −c l a s s B 0 . 1 2 6 3 7 0 . 1 0 1 4 9 0 3 6 0 . 1 5 1 2 4 9 6 4 0 . 0 0 0 0 0 0 0
Listing 6.16: Sample R code for excuting one-way ANOVA on the eight-
classifier and 10-domain realistic example of Table 6.3.
>
> nbDatasets= c (96.43 , 73.42 , 72.3 , 71.7 , 71.67 , 74.36 , 70.63 ,
83.21 , 98.22 , 69.62) /100
> SVMDatasets =c ( 9 9 . 4 4 , 8 1 . 3 4 , 9 1 . 5 1 , 6 6 . 1 6 , 7 1 . 6 7 , 7 7 . 0 8 ,
62.21 , 80.63 , 93.58 , 99.9) /100
> I B 1 D a t a s e t s =c ( 9 9 . 1 1 , 7 5 . 2 2 , 7 9 . 0 3 , 6 5 . 7 4 , 6 3 . 3 3 , 7 0 . 1 7 ,
70.5 , 80.63 , 91.52 , 81.63) /100
> A d a b o o s t D a t a s e t s =c ( 8 3 . 6 3 , 4 6 . 4 6 , 7 2 . 3 1 , 7 0 . 2 8 , 7 1 . 6 7 , 7 4 . 3 5 ,
44.91 , 82.54 , 93.21 , 72.54) /100
> B a g g i n g D a t a s e t s =c ( 9 8 . 2 2 , 7 6 . 5 4 , 8 2 . 8 9 , 6 7 . 8 4 , 6 8 . 3 3 ,
74.61 , 69.63 , 84.50 , 99.55 , 92.07) /100
6.9 Illustration of the Statistical Tests Application Using R 277
> C 4 5 D a t a s e t s =c ( 9 8 . 4 4 , 7 7 . 8 7 , 7 6 . 6 5 , 7 5 . 5 4 , 8 1 . 6 7 ,
73.83 ,66.75 , 83.79 , 99.58 , 85.07) /100
> RFDatasets= c (99.55 ,79.15 , 80.97 , 69.99 , 71.67 , 74.88 ,
79.87 , 84.58 , 99.39 , 93.94) /100
> J R i p D a t a s e t s =c ( 9 8 . 2 2 , 7 6 . 0 7 , 8 1 . 6 0 , 6 8 . 8 8 , 7 5 . 0 0 , 7 5 . 0 0 ,
70.95 , 78.00 , 99.42 , 97.39) /100
> d f = s t a c k ( d a t a . f r a m e ( n b D a t a s e t s , SVMDatasets , I B 1 D a t a s e t s ,
AdaboostDatasets , BaggingDatasets , C45Datasets , RFDatasets ,
JRipDatasets ) )
>
> summary ( x <− aov ( v a l u e s ˜ i n d , d a t a = d f ) )
Df Sum Sq Mean Sq F v a l u e P r ( > F )
ind 7 0.11294 0.01613 1.1088 0.3671
Residuals 72 1 . 0 4 7 6 3 0 . 0 1 4 5 5
>
Listing 6.17: Sample R code for executing Tukey’s test for Listing 6.16.
> TukeyHSD ( x )
Tukey m u l t i p l e c o m p a r i s o n s o f means
95% f a m i l y −w i s e c o n f i d e n c e l e v e l
F i t : aov ( f o r m u l a = v a l u e s ˜ i n d , d a t a = d f )
ind
diff lwr upr p adj
B a g g i n g D a t a s e t s −A d a b o o s t D a t a s e t s 0 . 1 0 2 2 8 −0.06612666 0 . 2 7 0 6 8 6 7 0 . 5 5 8 1 3 5 1
C 4 5 D a t a s e t s −A d a b o o s t D a t a s e t s 0 . 1 0 7 2 9 −0.06111666 0 . 2 7 5 6 9 6 7 0 . 4 9 6 4 3 9 7
I B 1 D a t a s e t s −A d a b o o s t D a t a s e t s 0 . 0 6 4 9 8 −0.10342666 0 . 2 3 3 3 8 6 7 0 . 9 2 8 0 2 1 0
J R i p D a t a s e t s −A d a b o o s t D a t a s e t s 0 . 1 0 8 6 3 −0.05977666 0 . 2 7 7 0 3 6 7 0 . 4 8 0 1 6 0 9
n b D a t a s e t s −A d a b o o s t D a t a s e t s 0 . 0 6 9 6 6 −0.09874666 0 . 2 3 8 0 6 6 7 0 . 8 9 9 1 7 9 6
R F D a t a s e t s −A d a b o o s t D a t a s e t s 0 . 1 2 2 0 9 −0.04631666 0 . 2 9 0 4 9 6 7 0 . 3 2 8 0 5 3 3
SVMDatasets−A d a b o o s t D a t a s e t s 0 . 1 1 1 6 2 −0.05678666 0 . 2 8 0 0 2 6 7 0 . 4 4 4 3 8 6 2
C 4 5 D a t a s e t s −B a g g i n g D a t a s e t s 0 . 0 0 5 0 1 −0.16339666 0 . 1 7 3 4 1 6 7 1 . 0 0 0 0 0 0 0
I B 1 D a t a s e t s −B a g g i n g D a t a s e t s −0.03730 −0.20570666 0 . 1 3 1 1 0 6 7 0 . 9 9 6 9 8 8 8
J R i p D a t a s e t s −B a g g i n g D a t a s e t s 0 . 0 0 6 3 5 −0.16205666 0 . 1 7 4 7 5 6 7 1 . 0 0 0 0 0 0 0
n b D a t a s e t s −B a g g i n g D a t a s e t s −0.03262 −0.20102666 0 . 1 3 5 7 8 6 7 0 . 9 9 8 7 1 5 5
R F D a t a s e t s −B a g g i n g D a t a s e t s 0 . 0 1 9 8 1 −0.14859666 0 . 1 8 8 2 1 6 7 0 . 9 9 9 9 5 3 1
SVMDatasets−B a g g i n g D a t a s e t s 0 . 0 0 9 3 4 −0.15906666 0 . 1 7 7 7 4 6 7 0 . 9 9 9 9 9 9 7
I B 1 D a t a s e t s −C 4 5 D a t a s e t s −0.04231 −0.21071666 0 . 1 2 6 0 9 6 7 0 . 9 9 3 4 4 4 0
J R i p D a t a s e t s −C 4 5 D a t a s e t s 0 . 0 0 1 3 4 −0.16706666 0 . 1 6 9 7 4 6 7 1 . 0 0 0 0 0 0 0
n b D a t a s e t s −C 4 5 D a t a s e t s −0.03763 −0.20603666 0 . 1 3 0 7 7 6 7 0 . 9 9 6 8 1 8 0
R F D a t a s e t s −C 4 5 D a t a s e t s 0 . 0 1 4 8 0 −0.15360666 0 . 1 8 3 2 0 6 7 0 . 9 9 9 9 9 3 6
SVMDatasets−C 4 5 D a t a s e t s 0 . 0 0 4 3 3 −0.16407666 0 . 1 7 2 7 3 6 7 1 . 0 0 0 0 0 0 0
J R i p D a t a s e t s −I B 1 D a t a s e t s 0 . 0 4 3 6 5 −0.12475666 0 . 2 1 2 0 5 6 7 0 . 9 9 2 0 8 4 7
n b D a t a s e t s −I B 1 D a t a s e t s 0 . 0 0 4 6 8 −0.16372666 0 . 1 7 3 0 8 6 7 1 . 0 0 0 0 0 0 0
R F D a t a s e t s −I B 1 D a t a s e t s 0 . 0 5 7 1 1 −0.11129666 0 . 2 2 5 5 1 6 7 0 . 9 6 3 1 4 3 8
SVMDatasets−I B 1 D a t a s e t s 0 . 0 4 6 6 4 −0.12176666 0 . 2 1 5 0 4 6 7 0 . 9 8 8 2 5 9 1
n b D a t a s e t s −J R i p D a t a s e t s −0.03897 −0.20737666 0 . 1 2 9 4 3 6 7 0 . 9 9 6 0 4 2 9
R F D a t a s e t s −J R i p D a t a s e t s 0 . 0 1 3 4 6 −0.15494666 0 . 1 8 1 8 6 6 7 0 . 9 9 9 9 9 6 7
278 Statistical Significance Testing
SVMDatasets−J R i p D a t a s e t s 0 . 0 0 2 9 9 −0.16541666 0 . 1 7 1 3 9 6 7 1 . 0 0 0 0 0 0 0
R F D a t a s e t s −n b D a t a s e t s 0 . 0 5 2 4 3 −0.11597666 0 . 2 2 0 8 3 6 7 0 . 9 7 6 9 7 6 6
SVMDatasets−n b D a t a s e t s 0 . 0 4 1 9 6 −0.12644666 0 . 2 1 0 3 6 6 7 0 . 9 9 3 7 6 6 8
SVMDatasets−R F D a t a s e t s −0.01047 −0.17887666 0 . 1 5 7 9 3 6 7 0 . 9 9 9 9 9 9 4
>
Listing 6.18: Sample R code for executing the corrected random subsampling
t test.
c o r r e c t e d R e s a m p t t e s t = f u n c t i o n ( i t e r , dataSet , s e t S i z e , dimension ,
classifier1 , classifier2 ){
p r o p o r t i o n s <− randomSubsamp ( i t e r , d a t a S e t , s e t S i z e , d i m e n s i o n ,
classifier1 , classifier2 )
a v e r a g e P r o p o r t i o n <− mean ( p r o p o r t i o n s )
sum=0
for ( i in 1: i t e r ) {
6.9 Illustration of the Statistical Tests Application Using R 279
sum = sum + ( p r o p o r t i o n s [ i ]− a v e r a g e P r o p o r t i o n ) ˆ 2
}
# C o r r e c t e d resampled t−t e s t
p r i n t ( ’ The t −v a l u e f o r t h e c o r r e c t e d r e s a m p l e d t − t e s t i s ’ )
print ( t )
}
The code just listed is invoked as in Listing 6.19, with 30 iterations. The result
shown is then analyzed.
Listing 6.19: Invocation and results of the corrected resampling t test code.
> l i b r a r y ( RWeka )
Loading r e q u i r e d package : g r i d
>
> NB <− m a k e W e k a c l a s s i f i e r ( “weka / c l a s s i f i e r s / b a y e s / N a i v e B a y e s ” )
> i r i s <− r e a d . a r f f ( s y s t e m . f i l e ( “ a r f f ” , “ i r i s . a r f f ” ,
p a c k a g e = “RWeka” ) )
> c o r r e c t e d R e s a m p t t e s t ( 3 0 , i r i s , 1 5 0 , 5 , NB, J 4 8 )
[ 1 ] “The t −v a l u e f o r t h e c o r r e c t e d r e s a m p l e d t − t e s t i s ”
[ 1 ] 0.5005025
# We t r a n s f o r m t h e p e r c e n t a g e a c c u r a c i e s r e t u r n e d by Weka i n t o t h e i r
# equivalent in the [0 ,1] i n t e r v a l in order for these values to
# be p r o p e r c o n t i n u o u s random v a r i a b l e s . c l i s t h e c l a s s i f i e r number .
for ( i in 1: i t e r )
for ( j in 1: k )
for ( cl in 1:2)
a l l R e s u l t s [ [ i ] ] [ [ c l ] ] [ j ] <− a l l R e s u l t s [ [ i ] ] [ [ c l ] ] [ j ] / 1 0 0
return (p)
}
6.9 Illustration of the Statistical Tests Application Using R 281
# T 5 2 c v V a r i a n c e t a k e s a s i n p u t t h e o u t u t o f T52cv and o u t p u t s a
# v e c t o r c o n t a i n i n g t h e e s t i m a t e d v a r i a n c e a s c a l c u l a t e d by t h e
# 5 x2cv t − t e s t and f− t e s t f o r e a c h o f t h e 5 i t e r a t i o n s
T52cvVariance = f u n c t i o n ( p ) {
# pBar c o n t a i n s t h e a v e r a g e on r e p l i c a t i o n i
pBar <− n u m e r i c ( 5 )
for ( i in 1:5)
pBar [ i ] <− ( p [ i , 1 ] + p [ i , 2 ] ) / 2
# Estimated Variance
s S q u a r e d <− n u m e r i c ( 5 )
for ( i in 1:5)
s S q u a r e d [ i ] <− ( p [ i ,1] − pBar [ i ] ) ˆ 2 + ( p [ i ,2] − pBar [ i ] ) ˆ 2
r e t u r n ( sSquared )
}
We now show, in Listing 6.21, the code used to calculate the results of the
5 × 2-CV test introduced by Dietterich (1998).
p <− T i k c v ( 5 , 2 , d a t a S e t , s e t S i z e , d i m e n s i o n , n u m C l a s s e s ,
classSize , classifier1 , c l a s s i f i e r 2 )
s S q u a r e d <− T 5 2 c v V a r i a n c e ( p )
#calculating t
t <− p [ 1 , 1 ] / denom
p r i n t ( ’ The t −v a l u e i s e q u a l t o ’ )
print ( t )
This code is invoked as per the next description in Listing 6.22 and returns the
output shown in the listing.
282 Statistical Significance Testing
p <− T i k c v ( 5 , 2 , d a t a S e t , s e t S i z e , d i m e n s i o n , n u m C l a s s e s ,
classSize , classifier1 , c l a s s i f i e r 2 )
s S q u a r e d <− T 5 2 c v V a r i a n c e ( p )
#calculating f
}
6.9 Illustration of the Statistical Tests Application Using R 283
When the 5 × 2-CV F test is applied to the Iris data as follows (Listing 6.24)
we get the following results.
10 × 10-CV Schemes
All the 10 × 10-CV schemes described by Bouckaert (2004) run the same 10 ×
10-CV resampling scheme and compute the mean value of this resampling in the
same manner. What changes from scheme to scheme is the computation of the
variance and of the Z value of the hypothesis test that follows the resampling.
The resampling, per se, is invoked using the function that was already defined in
the context of the 5 × 2-CV tests, namely, Tikcv(), although this time the values
of parameters iter and k are 10 and 10, respectively, rather than 5 and 2. The
mean of this resampling is simply calculated by the code in Listing 6.25.
We now show the specific code for each of the four 10 × 10-CV schemes
discussed in this book (these vary in how the sample variance is calculated).
284 Statistical Significance Testing
10 × 10-CV Use-All-Data
The code for the 10 × 10-CV use-all-data scheme is given in Listing 6.26.
# Compute t h e v a r i a n c e
v <− 0
for ( i in 1:10)
for ( j in 1:10)
v <− v + ( x [ i , j ]−m) ˆ 2
v <− v / 9 9
# Compute Z
z <− m / ( s q r t ( v ) / s q r t ( 1 0 0 ) )
The code is invoked as follows in Listing 6.27 and yields the results shown.
The Z value is located between 3.2905 and 3.7190, which corresponds to levels
of confidence between 99.9% and 99.98%. We thus conclude that the difference
in means between the two classifiers is statistically significant at the 99.9%
confidence level.
# Compute x D o t j
x D o t j <− n u m e r i c ( 1 0 )
for ( j in 1:10) {
x D o t j [ j ] <− 0
for ( i in 1:10)
x D o t j [ j ] <− x D o t j [ j ] + x [ i , j ]
x D o t j [ j ] <− x D o t j [ j ] / 1 0
}
# Compute t h e v a r i a n c e
v <− 0
for ( j in 1:10)
v <− v + ( x D o t j [ j ]−m) ˆ 2
v <− v / 9
# Compute Z
z <− m / ( s q r t ( v ) / s q r t ( 1 0 ) )
Listing 6.29 shows the results obtained when the code is invoked.
286 Statistical Significance Testing
This time, the Z value is located between 1.28 and 1.64, which corresponds
to levels of confidence between 80% and 90%. Because 1.64 is not reached,
we cannot reject the hypothesis that the difference in means between the two
classifiers is statistically significant at the 90% confidence level.
10 × 10-CV Average-Over-Runs
The 10 × 10-CV average-over-runs scheme is very similar to the 10 × 10-CV
average-over-folds scheme. The code is given in Listing 6.30.
Listing 6.30: Sample R Code for executing 10 × 10-CV average-over-runs.
# 10 x 10 CV A v e r a g e Over Runs
T1010cvRuns = f u n c t i o n ( d a t a S e t , s e t S i z e , d i m e n s i o n , n u m C l a s s e s ,
classSize , classifier1 , c l a s s i f i e r 2 ){
# Compute x D o t i
x D o t i <− n u m e r i c ( 1 0 )
for ( i in 1:10) {
x D o t i [ i ] <− 0
for ( j in 1:10)
x D o t i [ i ] <− x D o t i [ i ] + x [ i , j ]
x D o t i [ i ] <− x D o t i [ i ] / 1 0
}
# Compute t h e v a r i a n c e
v <− 0
for ( j in 1:10)
v <− v + ( x D o t i [ i ]−m) ˆ 2
v <− v / 9
6.9 Illustration of the Statistical Tests Application Using R 287
# Compute Z
z <− m / ( s q r t ( v ) / s q r t ( 1 0 ) )
In this test, the Z value is slightly below 3.09, which means that we can not reject
the hypothesis that the two classifiers perform similarly at the 95% confidence
level.
10 × 10-CV Average-Over-Sorted-Runs
The code for the 10 × 10-CV average-over-sorted-runs variation modifies the
code for the 10 × 10-CV average-over-runs only very slightly. These modifi-
cations occur in only a few places. First, the return/last line of the stratcv()
function is replaced with the following line that sorts the results obtained by
each classifier at each iteration. The function is renamed stratsortedcv() and is
identical to stratcv() in all other respects:
return(list(sort(classifier1ResultArray),
sort(classifier2ResultArray)))
The only other difference occurs in function Tikcv(), which now calls stratsort-
edcv() rather than stratcv() on line 3. The rest of the code is identical in all
288 Statistical Significance Testing
respect to that used in the 10 × 10-CV average-over-runs except for the first
print statement which now states
print(’The Z Value obtained in the 10x10CV Average over
Sorted Runs Scheme is’)
[ 1 ] 0.0001264988
[ 1 ] “The Z V a l u e o b t a i n e d i n t h e 10x10CV A v e r a g e o v e r Runs \ n
Scheme i s ”
[ 1 ] 1.124649
>
We ran this code four times. In two out of four of the runs, the difference in
the mean of the two classifiers appears highly significant (ridiculously highly
significant in one case!). However, this is not the case in the other two runs
which show less than 80% confidence.
Note that, in all the other 10 × 10-CV schemes, we also found the results to
be somewhat unstable. It would thus be important to investigate these methods
more thoroughly before using them systematically.∗
6.10 Summary
This chapter discussed the philosophy of null-hypothesis statistical testing
(NHST) along with its underlying caveats, objections, and advantages with
the perspective of evaluation of machine learning approaches. We also looked at
the common misinterpretations of the statistical tests and how such occurrences
can be avoided by raising our understanding of the statistical tests and their
frame of application. Building on the notions of machine learning and statistics
reviewed in Chapter 2, we then looked at various statistical tests as applied in
the classifier evaluation settings. The description of relevant statistical tests was
studied in four parts. The first part covered the tests relevant to (and used in)
assessing the performances of two classifiers on a single domain. In particular,
we reviewed the most widely employed two-matched-samples t test. An impor-
tant place was given to the description of the assumptions on which the t test is
built and that need to be verified for it to be valid. More general shortcomings
of the t test, as used by machine learning practitioners, were also discussed. A
nonparametric alternative, called McNemar’s test, was then also discussed. We
also showed how these tests can be extended to the case in which two classifiers
are evaluated on multiple domains, the evaluation setting, which was the subject
of the second part. In the second part, we discussed two nonparametric tests,
the sign test and Wilcoxon’s Signed-Rank test, discussing their advantages and
limitations, both overall and relative to each other. Also, we showed how these
can be applied in a single-domain setting and the underlying caveats. The third
part then focused on the more general setting of comparing multiple classifiers
on multiple domains. In particular, we focused on the parametric test ANOVA
and its nonparametric equivalent, Friedman’s test. We also showed how these are
omnibus tests that indicate whether at least one classifier performance difference
among all the comparisons being made is statistically significant. The identifi-
cation of the particular differences, however, cannot be achieved by the omnibus
∗ A possible cause for such behavior could also be the manner in which the R environment handles
variables in consecutive runs. This should also be verified.
290 Statistical Significance Testing
tests and is, rather, accomplished using the post hoc tests that we discussed next.
We covered major post hoc tests for both ANOVA and the Friedman tests along
with their assumptions, advantages, and limitations. The fourth and final part
discussed how multiple runs of the simple resampling techniques from Chapter 5
can improve on conventional statistical tests for comparing two classifiers on a
single domain, discussed in this chapter. In particular, it discussed tests based on
multiple runs of random subsampling and tests based on multiple runs of k-fold
cross-validation.
Finally, the chapter concluded with an illustration of all the tests using the R
statistical package.
as a power analysis of various post hoc tests with regard to classifier evaluation
can be found in (Demšar, 2006).
The random subsampling t test and 5 × 2-CV t test are discussed and com-
pared in terms of their type I error and power in (Dietterich, 1998). The corrected
resampled t test comes from Nadeau and Bengio (2003), and the 5 × 2-CV F
test was described in (Alpaydyn, 1999). The various repetitions of 10-fold cross-
validated tests are compared in (Bouckaert, 2004).
Along the lines of Bouckaert (2003), replicability is formally defined as
follows.
Definition 6.1. The replicability of an experiment is the probability that two
runs of the experiment on the same dataset, with the same pair of algorithms
and the same method of sampling the data, produce the same outcome.
Although the type I error states the probability that a difference in outcome is
found over all datasets, replicability states this probability over a single dataset.
It can be quantified as
I (ei = ej )
R̂2 (e) = ,
1≤i<j ≤n
n.(n − 1)/2
On the other hand, one might specifically be interested in failure analysis and
hence might put a strict requirement on the dataset, that the distribution of
the instances be non-Gaussian. Even for generic algorithms, it is sometimes
desirable to assess their performance on particular data characteristics such as
robustness to particular noise models.
Such requirements on the data have resulted in at least two different
approaches to addressing the problem. The first is what we call the data repos-
itory approach and the second is the synthetic or artificial data approach. Both
approaches have been followed widely, either by themselves or in conjunction
with each other. However, just as with any other component that we have dis-
cussed in the book, it should not come as a surprise that both these approaches
have their respective advantages and shortcomings. Let us discuss these two
approaches along with their benefits and limitations in a balanced manner, and
then move on to some recent proposals for both dataset collection and dissemi-
nation and overall evaluation benchmark design.
Table 7.1. Some of the general machine learning and data mining repositories
Name of the
Repository Hyperlink
UCI Repository http://archive.ics.uci.edu/ml/
StatLib http://lib.stat.cmu.edu/
StatLog http://www.the-data-mine.com/bin/view/Misc/StatlogDatasets
METAL http://www.metal-kdd.org/
DELVE http://www.cs.toronto.edu/ delve/
NASA’s datasets http://nssdc.gsfc.nasa.gov/
CMU Data http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/learning/0.html
Repository
the UCI Repository can be understood by looking at its impact in the field
in terms of the number of citations that the paper introducing the repository
has received. With over 1000 citations and growing, it is one of the top “100”
most cited papers in all of computer science. New domains are constantly
added to the repository, which receives active maintenance. Along with the UCI
Repository comes the UCI Knowledge Discovery in Databases (KDD) Archive,
which complements the UCI Repository by specializing in large datasets and
is not restricted to classification tasks. The archive includes very large datasets
including high-dimensional data, time series data, spatial data, and transaction
data. The creators of this database recognize the positive impact that the UCI
Repository has had on the field, but they lament that the datasets it contains
are not realistic for data mining because of limited size both in terms of the
number of samples and in terms of dimensionality, as well as narrow scope.
The UCI KDD Archive addresses these two issues and others, also expand-
ing the type of datasets to problems other than classification (e.g., regression,
time series analysis) and different objects (e.g., images, relational data, spatial
data).
Although a few criticisms and some defense of these datasets have been
voiced, the question of what datasets we should use has not received significant
attention in the community. In addition, although the advantages brought on by
such repositories have been noted, the disadvantages in terms of both the explicit
issues as well as unintended consequences have not received the attention they
deserve. In this section, we aim to bring the trade-off between the advantages
and limitations of the repository-based approach into perspective. Let us start
with the positives.
The UCI KDD Archive creators gathered more complex datasets as a means to
achieve the same quality of research in data mining. Their efforts were supported
by the NSF Information and Data Management program, which indicates that
practically motivated institutions concur in recognizing the usefulness of such
repositories.
There are other advantages, as well, that were actually mentioned by some of
the UCI Repository’s most ardent critiques. For example, Saitta and Neri (1998)
concede that evaluating algorithms on these datasets can provide important
insights. Holte (1993) found the repository useful in showing that, although some
real-world datasets are hard, many others are not, including a large number in the
UCI Repository, which were shown to be representative of practical domains.
Finally, another feature of the UCI Repository, which can also be construed
as an advantage is that the repository keeps on evolving with new datasets
contributed to it on a regular basis. As well, its scope has widened through
7.1 Repository-Based Approach 297
the addition of the UCI KDD Archive to incorporate datasets reflecting new
concerns such as massive databases.
Given all these advantages that make researchers’ life easier, why should we
then be cautious when using such wonderful resources? Let us look at some
counterarguments now.
On further analysis, Holte notes that the domains that were used did not
represent the whole range of problems expected to appear in real-world sce-
narios (noting that there are some real problems that are particularly “hard” for
machine learning classifiers). However, he concedes that “the number and diver-
sity of the datasets indicates that they represent a class of problems that often
arises.”
Another criticism of repository-based research has come in terms of their
representativeness of the data mining process. For instance, Saitta and Neri
(1998) contend that the UCI Repository contains only “ready-to-use” datasets
that present only a small step in the overall data mining process. Although
researchers agree that evaluating learning algorithms on these datasets can be
useful, as it can provide interesting insights, the main objections are made along
the following lines:
r Because the evaluations are typically done, even though on a common
platform, by different researchers (and possibly over different implemen-
tations of the same learning approaches), unintentional or even unconcious
tuning of some of the learning parameters can give unfair advantages to
the algorithms most familiar to the researchers (e.g., the researchers who
originally designed the approaches being tested).
298 Datasets and Experimental Framework
r Because the goal of the comparison is not to use the learned knowledge
(which would be what is useful in real-world settings), it is not clear that
the results reported on these domains are of any use.
r In an effort to make repository datasets usable off the shelf, they are usually
vastly oversimplified, resulting in datasets of limited size and complexity,
with the difficult cases often set aside.
The first point is not easy to dismiss. In fact, it has profound implications. As
a very trivial example, many studies report results concerning novel algorithms
or algorithms of interest pitted against other competitive approaches. These
other approaches are typically run using their “default” parameters (generally
from a machine learning toolbox implementation such as WEKA), whereas the
algorithm of interest to the study is carefully tuned. This can be in part due
to the limited familiarity of the researchers with the other approaches, even if
we discount intentional attempts (which is not entirely impossible). As a result,
such assessments may not reflect a true comparison.
The second point stresses the need for domain knowledge that, indeed, must
be involved or accounted for in any real-world application in order to obtain
meaningful results. Although the point is certainly valid in terms of real-world
applicability, it is relatively less troubling when generic approaches are compared
because, in these cases, the main objective is to demonstrate a wide-ranging util-
ity of the proposed approaches on a common testbed. However, when it comes to
task-specific algorithms, this concern is indeed more serious. Nevertheless, using
general repositories, such as the UCI, for evaluating task-specific approaches
may not be a good idea altogether. Keeping such points in perspectives, various
task- or domain-specific repositories are now appearing that provide a more
meaningful set of data to evaluate approaches designed for these specific classes
of problems.
The final point on the list is indeed true in many cases. However, with
novel datasets being added regularly, especially as a result of data acquisitions
from new and more mature technology, we hope that such reservations will
be addressed in the near future. Furthermore, a full disclosure on the part of
researchers as well as data submission groups would make it easier to interpret
the subsequent results of learning approaches accordingly and be aware of the
caveats implicit in such validation. We must keep in mind, however, that this
issue gains more relevance when the scalability of the algorithms to the real
world is concerned. In this respect, views such as those of Saitta and Neri (1998)
suggest that a good performance on general repository data should be seen as
a means to obtaining good results, but not as the goal of the research. Hence it
is not sufficient to demonstrate better performance of some algorithms against
others on these datasets, but rather it is both important and more relevant to
analyze the reasons behind the different behaviors.
7.1 Repository-Based Approach 299
the current issue can be to a great extent addressed by making the underlying sta-
tistical framework more robust (for instance, by using higher significance levels).
classification domains and used more sensitive performance measures than those
of the earlier studies. However, the research in this direction has been limited and
has not gained community-wide attention. One of the reasons for such limited
focus on these studies is probably the limited applied use of machine learning
methods up until very recently. Consequently such studies were either deemed
not very practical to perform on a large scale or not very relevant because of
perceived lack of novelty. However, as the applications of learning approaches
increase in various fields and on a variety of problems, such empirical studies,
although limited in their novelty value, not only offer some interesting insights
into both the learning process and the data characteristics, but also expand our
understanding of the nature of the domains and algorithms, as well as their
impact on evaluation. This issue is increasingly important with the rise in both
the number and size of data repositories and libraries of learning algorithms
aimed as standardizing the evaluation process to a significant extent.
The typical approach to selecting datasets to evaluate the algorithms have
focused on either choosing these datasets from a repository (typically UCI), at
random (of course with minor considerations on size and dimensionality), or
using the ones that have appeared in earlier studies in similar contexts. The
verdict on the best approach to choosing the most suited datasets for evaluation
is not yet out. An argument can be made for choosing the sets that demonstrate
the strength of (or are most difficult for) a learning approach. Another argu-
ment can also be made for the choice of random but wide-ranging domains for
evaluating generic algorithms, because eventually the purpose of testing such
algorithms is precisely to demonstrate their applicability across the board. The
insights obtained from metalearning approaches as well as other empirical stud-
ies, however, can have some more profound implications in the direction of our
increased understanding.
As the applied thrust for learning approaches gains momentum, more such
studies are needed. Until that is done, though, let us shift our attention toward
another approach to dataset selection, the artificial data approach.
1 However, this argument should be taken with a grain of salt because the underlying assumption is
that the limited data is not representative of the actual domain. Hence empirical models obtained
from such data would inevitably face similar issues and the resulting samples would indeed, at best,
be grossly approximate, if at all.
7.3 Artificial Data Approach 303
In the wake of such serious objections, why then bother about artificial data
at all? Well, as a rule of thumb, while looking at any approach, it is necessary to
check if we are asking the right questions. This should, in fact, be viewed as a
general rule, at least in evaluation. Throughout the book, when discussing various
concepts, we have explicitly or implicitly highlighted this concern. Indeed,
knowing the basic assumptions of any process, the constraints under which it is
applied and the proper interpretation of the results has been a common theme in
all our discussions. The same principle applies in this case too. Is the aim of the
artificial-data-based approach to give a verdict on superiority of some particular
algorithm? Is such an approach capable of discerning statistically significant
differences over the generalization performance of the classifiers compared? We
would, at least in a broader sense, answer both these questions in the negative.
What is important is the ability to control various aspects of the evaluation
process in an artifical data approach.
This trait gives us the ability to perform specific experiments aimed at assess-
ing algorithms’ behavior on certain parameters of interest while controlling for
others in a relatively precise manner. Such flexibility is typically impossible in
any real-world setting. Consider the problem of assessing the impact of chang-
ing environments on classifier performance. The topic of changing environments
has led to a new line of research in statistical machine learning. In the relatively
limited sense that it is pursued currently, the problem of changing environments
refers to the scenario in which the distributions on which the algorithm is eventu-
ally tested (deployed) differs from the domain distribution on which it has been
trained. Theoretical advancements have been made in this direction. However,
such efforts have currently focused on limited scenarios (which are controlled
to track for changing distributions). Although the problem is ubiquitous in the
practical world, very few datasets are publicly available that address this issue.
Given both the dearth of appropriate real-world data and the extremely limited
tuning flexibility over them, a controlled study is almost impossible. Accord-
ingly, the artificial data approach has been immensely helpful in obtaining such
preliminary but encouraging and verifiable results. The work of Alaiz-Rodriguez
and Japkowicz (2008) has relied on generating artificial datasets to study empir-
ically the behavior of various algorithms’ performance. Such observations can
potentially complement the theoretical insights obtained by learning theoretic
approaches, as well as understanding the premise of the problem itself. In par-
ticular, their empirical study relied on the artificial data approach to validate a
version of the Occam’s razor hypothesis, which states that, all other things being
equal, simple classifiers are preferable over complex ones. More specifically,
the study was aimed to test the hypothesis emphasized by Hand (2006) that
simple classifiers are more robust to changing environments than complex ones.
They in fact found that, in many cases, Hand (2006)’s hypothesis was not veri-
fied. Artificial datasets have been significantly relied on in this case because the
major repositories do not contain datasets that are reasonably representative of
304 Datasets and Experimental Framework
this does not necessarily mean that efficient data generators can be obtained to
expand on these empirical observations.
There have also been proposals, based on community participation, so as to
develop access to data, learning algorithms, and their implementation, as well as
methodologies to perform coherent algorithm evaluation that not only can scale
up to real-world situations, but are also robust and statistically verifiable. Among
the proposals recently put forward to address the issue, an interesting argument
is that of community participation in the process. Taking inspiration from the
success of collaborative projects such as Wikipedia, in which the collaboration
comes from the effort of the community, this line of argument proposes partic-
ipation of relevant researchers as well as groups in the evaluative process. An
example of such a proposal is that by Japkowicz (2008). In this, she contends that
the World Wide Web can be a powerful tool in facilitating such an effort. The
basic idea underlying the argument is that the resources for the evaluative pro-
cess can be made available in collaboration with various groups with a stake in
the process. For instance, data mining and other groups (e.g., hospitals, clinical
research centers, news groups) could contribute real-world data for evaluation
and outline their goals clearly. Meanwhile, the machine learning community
could offer public releases of algorithmic implementations, with comprehen-
sive documentation. Participants from the statistical analysis community could
provide guidelines for robust statistical evaluation and assessment techniques
in concordance with the data provider’s stated purpose. Performing subsequent
evaluation of approaches as well as other studies and analysis, for instance in
data characterization or the effects of statistical analysis techniques, could not
only benefit respective communities, but could also provide important insights to
other participating parties with regard to real-world implications of their work.
For instance, data-gathering groups could analyze their data in order to make
their respective data-acquisiton process more robust (e.g., when the concerns
regarding missing values or noisy data are realized), whereas the statistically ori-
ented groups would gain insight into the empirical applications of their analysis
techniques. The learning community could also obtain a deeper understanding
of the real-world application settings and the ways to make their approaches
amenable to these applications in a robust and reliable manner. Of course, for
such projects to succeed, it would be imperative for each party to provide as
much precision and detail about their contributed components as possible.
With a growth in participating groups and proper organization, such efforts
could have profound implications, not only on specific elements of the overall
evaluation process, but also on the interrelationships among various disciplines.
This could also bolster an appreciation of their strengths and the limitations
under which they operate. This would hence not only facilitate the evaluation,
but also strengthen the foundation of interdisciplinary research, the future of
science. Such arguments have been voiced earlier too, as evidenced by the very
existence of the UCI Repository itself.
306 Datasets and Experimental Framework
7.5 Summary
This short chapter was aimed at discussing both the practical as well as a (par-
tially) philosophical argument both in favor and against different approaches to
selecting and using domains of application for the purpose of evaluating learning
algorithms. The main approaches currently employed for generic evaluation in
different studies include choosing the datasets from general repositories such as
the UCI Repository for Machine Learning or using synthetic datasets. Both these
methods have their respective advantages and limitations. In the case of the for-
mer, real-world datasets can better characterize the settings and the challenges
that the learning algorithm can face when applied in such a setting, but also
suffer from limitations such as the community experiment effects and limited
control over the test criteria with regard to the algorithm of interest. The latter
approach, on the other hand, enables evaluating specific aspects of the learning
algorithms because the data behavior can be precisely controlled, but is limited
in its ability to replicate real-world settings. Repositories have also appeared
for approaches designed to address a specific task of interest (e.g., image seg-
mentation). Analogous arguments can be made with regard to their utility and
alternatives of simulated data as well. Relatively novel arguments for coming
up with better approaches for domain selection and usage have appeared in the
form of community participation over Web-based solutions. The jury is still out
on such new proposals.
This chapter completes the discussion of the various components of the eval-
uation framework for learning algorithms that we started in Chapter 3 with
performance measures. With regard to different components, such as perfor-
mance measures, error estimation, statistical significance testing and dataset
selection, we mainly focused on the approaches that have become mainstream
(although in a very limited use) in the sense that their behavior, advantages, and
limitations against competing approaches are relatively well understood. In the
next chapter, we present a brief discussion on the attempts either to offer alter-
native approaches to address different components or extensions of the current
approaches that have appeared relatively recently. These attempts can be impor-
tant steps forward in developing our understanding of a coherent evaluation
framework.
and remains the most complete and authoritative repository of domains for
machine learning. It is still maintained and often expanded to include new con-
tributed datasets. Other application-specific repositories have also appreared
such as the IBSR (http://www.cma.mgh.harvard.edu/ibsr/), the GEO database
(http://www.ncbi.nlm.nih.gov/geo/), and the Stanford microarray database
(http://smd-www.stanford.edu/).
The no-free-lunch theorems were formulated in (Wolpert, 1996) and then
with regard to optimization in (Wolpert and Macready, 1997).
The question of how relevant the UCI Repository domains are to data mining
research was scientifically investigated by Soares (2003). To do so, he compared
the distribution of the relative performance of various algorithms on a set of
data known to be relevant to data mining research with that on a large subset of
datasets contained in the UCI Repository. His statistical analysis revealed that
there is no evidence that the UCI domains (at least, those containing over 500
samples) are less relevant than the domains known to be relevant. Soares (2003)
also investigated the claim that machine learning researchers overfit the UCI
Repository. In particular, Soares (2003) tested whether an algorithm overfits the
Repository by testing whether its rank is higher in the repository than it is in the
domains that are not included in the repository. Once again, his study revealed
that there is no statistical evidence that suggests that algorithms rank higher
in the repository than in the other datasets, and thus there is no evidence that
supports the claim of overfitting. However, more research needs to be done to
guarantee reliability of the employed statistical framework.
The community experiment effect has been discussed by, among others,
Salzberg (1997) and Bay et al. (2000).
Smith-Miles (2008) gives a comprehensive survey of metalearning research
for algorithm selection. Although she looks at various kinds of algorithms, we
focused on only classification algorithms here as we followed her discussion.
Other references to metalearning approaches can be found in the works of
Rendell and Cho (1990), Aha (1992), Brodley (1993), and, more recently, Ali
and Smith (2006).
The study by Ganti and colleagues referred to in the text can be found in (Ganti
et al., 2002). Deformations of real data and their extension by use of artificial
methods discussed in the text refers to (Narasimhamurthy and Kuncheva, 2007).
8
Recent Developments
We reviewed the major components of the evaluation framework in the last chap-
ters and described in detail the various techniques pertaining to each, together
with their applications. The focus of this chapter is to complement this review
by outlining various advancements that have been made relatively recently, but
that have not yet become mainstream. We also look into some approaches aimed
at addressing problems arising from the developments on the machine learning
front in various application settings, such as ensemble classifiers.
Just as with the traditional developments in performance evaluation, recent
attempts at improving as well as designing new performance metrics have led
the way. These have resulted in both improvements to existing performance
metrics, thereby claiming to ameliorate the issues with the current versions, and
proposals for novel metrics aimed at addressing the areas of algorithm evalu-
ation not satisfactorily addressed by current metrics. We discuss in brief some
of these advancements in Section 8.1. In Section 8.2, we focus on the attempts
at unifying these performance metrics as well as studying their interrelation in
the form of both theoretical and experimental frameworks. A natural extension
to such studies is the design of more general or broader measures of perfor-
mance, either as a result of insights obtained from the theoretical framework
or by combining existing metrics based on observations from the experimental
frameworks. Such metric combinations for obtaining general measures are the
focus of Section 8.3. Then, in Section 8.4, we outline some advancements in
statistical learning theory, the branch of machine learning aimed at character-
izing the theoretical aspects of learning algorithms, that can potentially lead to
more informed algorithm evaluation approaches that can take into account not
just the empirical performance, but also the specific properties of the learning
algorithms. Finally, in Section 8.5, we discuss recent findings and developments
with regards to other aspects of algorithmic evaluation.
308
8.1 Performance Metrics 309
(2006, 2009), among others. These are indeed clearly articulated by Hand (2009),
who also proposes an alternative summary statistic called the H measure to
alleviate these limitations.1 This measure depends on the class priors, unlike the
AUC, and addresses one of the main concerns of the AUC, that of treating the
cost considerations as a classifier-specific problem. This indeed should not be
the case because relative costs should be the property of the problem domain,
independently of the learning algorithm applied.
Other novel metrics have also been proposed such as the scored AUC (abbre-
viated SAUC) (Wu et al., 2007) aimed at addressing the dependency of AUC
on score imbalances (fewer positive scores than negative, for instance), implic-
itly mitigating the effects of class imbalance. In a similar vein, Klement (2010)
shows how to build a more precise scored ROC curve and calculate a SAUC from
it. Santos-Rodrı́guez et al. (2009) investigates the utility of the adjusted RAND
index (ARI), a commonly used measure in unsupervised learning, for perfor-
mance assessment as well as model selection in classification. These and other
novel metrics have yet to be rigorously studied and validated against current
measures and so are not yet mainstream.
The issue of asymmetric cost, in which the cost of misclassifying an instance
of one class differs from that of other class(es), has also received considerable
attention, albeit in the context of specific metrics. The inherent difficulty in
obtaining specific cost has long been appreciated by the machine learning com-
munity leading, in part to cost- or skew-ratio approaches such as ROC analysis,
based on the premise that even though quantifying specific misclassification
costs might be difficult, it might be possible to provide relative costs. Other
efforts have also been made with regard to cost-sensitive learning, as it is quite
often referred to. See, for instance (Santos-Rodrı́guez et al., 2009) with regard
to using Bregman divergences for this purpose (Zadrozny and Elkan, 2002,
Lachiche and Flach, 2003), and (O’Brien et al., 2008) for cost-sensitive learning
using Bayesian theory (Zadrozny et al., 2003), and (Liu and Zhou, 2006) for
approaches based on training instance weighting, and (Landgrebe et al., 2004)
for examples of such attempts in experimental settings. Previous attempts to
perform cost-sensitive learning with regard to individual classifiers include ones
such as (Bradford et al., 1998, Kukar and Kononenko, 1998), and (Fan et al.,
1999).
Other proposed metrics include extensions to existing metrics and new mea-
sures for ensemble classifiers. Various approaches with regard to combination
of classifiers and their subsequent evaluation have been proposed. Some specific
works include those of Kuncheva et al. (2003) and Melnik et al. (2004) for ana-
lyzing accuracy-based measures, and Lebanon and Lafferty (2002), Freund
et al. (2003), and Cortes and Mohri (2004) for alternative measures in such
scenarios. Theoretical guarantees and analysis have also been proposed with
1 to 5, 1 being the worst and 5 the best. More sophisticated classifiers, such as
neural networks, can be rated as 1 on the scale whereas simple decision rules
can be rated as 5. However, the relationships between different scale values
may not be easy to interpret. For example, what about a rule-learning system,
which is often more understandable than a neural network? Should it be given
a 5? Should both the decision tree and the rule learner be given 5 or should the
decision tree be demoted to 4? How about naive Bayes? Does it belong at 3,
perhaps? If so, is the difference between 4 (the decision tree) and 5 (the rule-
based learner) the same as the difference between 3 (naive Bayes) and 4 (the
decision tree)? Probably not. Decision trees and rule-based learners may seem
closer in understandability than naive Bayes and decision trees; again, this is a
subjective opinion. Consequently the representation of the scale itself for such
criteria is important. For instance, Nakhaeizadeh and Schnabl (1998) suggest
the following raw scale for understandability:
0 0 1 ==> Low u n d e r s t a n d a b i l i t y
0 1 1 ==> medium u n d e r s t a n d a b i l i t y
1 1 1 ==> h i g h u n d e r s t a n d a b i l i t y
Although the scale can be simplified to the first two bits, the third bit can
be used to denote finer scale values. Increasing the number of bits can result
in a finer resolution, but the approach can become impractical with too many
ordinal values. Alternatively, a single output taking natural number values can
be used, but again this representation remains arbitrary (because the differences
between 1 and 2, and 3 and 4 are not necessarily equal, even though they are in
this representation) and is not robust (if someone arranged the scale from 0 to 4
rather than 1 to 5, the values would not be correct anymore).
Taking into account qualitative criteria along with the empirical-performance-
based assessment of classifier performance can indeed be more insightful. How-
ever, striking a proper balance between their trade-offs, which is inevitable,
has not been properly formalized yet. Some attempts have been made at com-
bining such criteria, as we will briefly see a little later. For now, however, we
shift our focus toward some attempts at studying the performance measures
in a unified manner and studying their interrelationship both theoretically and
experimentally.
in a specific interval, typically [0, 1], with necessary sign adjustments so that
higher values are better.
where the index i runs through positive metrics and j runs through negative
metrics.
As can be easily seen here, assigning wi ’s is, unfortunately, nontrivial.
Nakhaeizadeh and Schnabl (1997) propose to set them by optimizing the effi-
ciency of all the algorithms simultaneously. These efficiencies should be as close
to 100% as possible and none should exceed 100%. This optimization problem
can then be solved with linear programming techniques. In addition to this gen-
eral idea, Nakhaeizadeh and Schnabl (1998) go on to discuss how the approach
deals with subjective judgments of how the different criteria are assessed. This
is implemented by applying restrictions on the automatic weight computations
previously discussed that correspond to the user’s preferences.
Figure 8.1. The traditional and proposed approaches to classifier performance evaluation.
Source: (Japkowicz et al., 2008).
Figure 8.2. Classifier view on all the domains using the five evaluation metrics.
8.4 Insights from Statistical Learning Theory 323
inferences, based on the nature of the classifiers that these algorithms output.
That is, we can quantify some of the qualitative criteria, describing the perfor-
mance of the learning algorithms on the domains of interest. Indeed, different
learning algorithms can have different learning biases owing to the different
classifier spaces that they explore. However, the fact that they operate under
similar optimization constraints can be telling in terms of their respective abil-
ities to yield general results. Consider, for instance, a framework within which
the quality of an algorithm (in terms of how well the resulting classifier will
generalize to future data) is judged, by combining the performance it obtains on
some training data with a measure that quantifies the complexity of the resulting
classifier. Among the many ways in which such complexity can be characterized,
one is the extent to which the algorithm can compress the training data (i.e., iden-
tify the most important examples enough to represent the classifier). This is the
classical sample compression framework of learning. Characterizing two learn-
ing algorithms, such as decision trees and decision lists, within this framework
would then enable measuring their relative performances with regard to both the
empirical risk and compression constraints. That is, this will essentially enable
a user to characterize which algorithm manages to obtain the most meaningful
trade-off between these quantities under its respective learning bias. Training-set
bounds that characterize the generalization error of learning algorithms in terms
of their empirical risk, compression, and possibly other criteria of interest (pri-
ors on data, for instance) can be used to compare the performances of learning
algorithms. These give rise to the so-called, algorithm-dependent approach to
evaluation. Even though various training-set bounds have appeared for learning
algorithms under various learning frameworks, these, with very few excep-
tions, generally work in asypmtotic limits. We subsequently discuss a practical
bound.
Providing generalization error bounds on a classifier also involves character-
izing an algorithm within a probabilistic framework (akin to demonstrating the
confidence in the presented results). In the context of the approaches presented
in this book with regard to both reliability as well as hypothesis testing, the prob-
ably approximately correct (PAC) framework draws a close parallel. It provides
approximate guarantees on the true error of a classifier with a given confidence
parameter δ (similar to the α confidence parameter but not necessarily with the
same intepretation), as can be seen in the sample bounds subsequently presented.
With regard to the dependence that these guarantees have on the algorithm and
the framework that characterizes it, the generalization error bounds or risk
bounds can be categorized as the training-set bounds and the test-set (holdout)
bounds. The training-set bounds typically rely on two aspects: the error of the
algorithm on the training set and the properties of the algorithm itself. The first
aspect obviously results from the limited data availability, requiring the empir-
ical risk on the training set to approximate the true risk. However, an insight
into the algorithm’s behavior can prove to be a significant help in reducing our
8.4 Insights from Statistical Learning Theory 325
reliance solely on the training error and hence helps avoid overfitting so as to
yield better estimates on the true risk.
The holdout bounds, on the other hand, are the guarantees on the true error of
the classifier obtained on a given test set and can be obtained without reference
to the learning algorithm in question. With regard to this algorithm-independent
way of doing evaluation (as we have done so far) too, learning theory gives more
meaningful results in the form of alternative confidence bounds on the test-set
performance of the classifier. We briefly presented one such bound in Chapter 2,
which was based on Hoeffding’s inequality. Let us take a look at a tighter version
of this bound, based on the binomial tail inversion, and see how these can better
model the hold out error of the classifier.
As it must be obvious by now, although the test-set bounds can readily be
utilized for performance assessments, there are many challenges with regard to
training-set bounds owing to the difficulties in efficiently characterizing different
algorithms in the theoretical frameworks of interest.
empirical risk as possible. That is, we are not interested in modeling random
variables merely in the [0, 1] interval, but rather in modeling the empirical risk
of the classifier for lower values (values closer to zero). However, for smaller
values of empirical risk, a binomial distribution cannot be approximated by a
Gaussian. This observation was also made by Langford (2005). Consequently,
applying a Gaussian assumption results in estimates that are overly pessimistic
when obtaining an upper bound and overly optimistic when obtaining a lower
bound around the empirical risk. Langford (2005) also showed a comparison
between the behavior of the two distributions with an empirical example of
upper bounds on the risk of a decision tree classifier on test datasets. Let us
derive a holdout risk bound and illustrate this effect empirically. To do so, we
use the holdout bound derived by Shah (2008) and Shah and Shanian (2009)
by using a binomial tail inversion argument (the derivation of these bounds is
provided at the end of this chapter).
We define the binomial tail inversion to be the largest true error such that the
probability of observing λ or fewer errors is at least δ as
Bin(m, λ, δ) = max{p : Bin(m, λ, p) ≥ δ)}.
Now, if each of the examples of a test set is obtained i.i.d from some arbitrary
underlying distribution D, then an upper bound on the true risk of the classifier
R(f ), output by the algorithm, can be defined as follows:
Theorem 8.1. For all classifiers f , for all D, for all δ ∈ (0, 1]:
PrT ∼Dm (R(f ) ≤ Bin(m, λ, δ)) ≥ 1 − δ.
From this result, it follows that Bin(m, λ, δ) is the smallest upper bound that
holds with probability at least 1 − δ on the true risk R(f ) of any classifier f
with an observed empirical risk RT (f ) on a set of m examples.3
In an analogous manner, a lower bound on R(f ) can be derived:
Theorem 8.2. For all classifiers f , for all D, for all δ ∈ (0, 1]:
PrT ∼Dm (R(f ) ≥ min{p : 1 − Bin(m, λ, p) ≥ δ}) ≥ 1 − δ
p
the confidence in the estimate of the true risk (and by consequence, the looser
the bound) and vice versa. Such models generally provide these guarantees
over the future classifier performance in terms of its empirical performance and
possibly some other quantities obtained from training data and some measure
of the complexity of the classifier space that the learning algorithm explores.
Such measures have appeared in the form of Vapnik–Chervonenkis dimen-
sion (VC dimension), Rademacher complexities, and so on (see, for instance,
Herbrich, 2002, for discussion). Bounds on specific resampling techniques have
also appeared with the prominent of these being the leave-one-out error bounds
(see, e.g., Vapnik and Chapelle, 2000). There are other learning frameworks
that do not explicitly include the algorithm’s dependence on the classifier space
complexity in the risk bound and hence have an advantage over conventional
bounds that do. This is because the complexity measure grows with the size (and
complexity) of the classifier space and many times results in unrealistic bounds.
A brief introduction to statistical learning theory can be found in Bousquet
et al. (2004).
Successful attempts in the direction of attaining practical, realizable bounds
have appeared, although few of them are specifically designed within the sample
compression framework (see, for instance, Shah, 2006). Briefly, this framework
relies on characterizing a classifier in terms of two complementary sources of
information, viz., a compression set Si , and a message string σ . The compression
set is a (preferably) small subset of the training set S, and the message string is
the additional information that can be used to reconstruct the classifier from the
compression set. Consequently this requires the existence of such a reconstruc-
tion function that can reconstruct the classifier solely from this information. The
risk bound that we subsequently present as an example bounds the risk of the
classifier represented by f = (σ, Si ) over all such reconstruction functions.
The bound presented is due to Laviolette et al. (2005), who also utilized this
bound to perform successful model selection in the case of the scm algorithm.
Theorem 8.3. For any reconstruction function R that maps arbitrary subsets
of a training set and message strings to classifiers, for any prior distribution
PI on the compression sets and for any compression-set-dependent distribution
of messages PM(Si ) (where M denotes the set of messages that can be supplied
with compression set Si ), and for any δ ∈ (0, 1], the following relation holds
with probability 1 − δ over random draws of S ∼ D m :
∀f : R(f )
−1 m − |Si | 1
≤ 1 − exp ln + ln ,
m − |Si | − mRS (f ) mRS (f ) PI (i)PM(Si ) (σ )δ
(8.1)
where RS (f ) is the mean empirical risk of f on S\Si (i.e., on the examples that
are not in the compression set).
328 Recent Developments
As can be seen, the preceding bound will be tight when the algorithm can find
a classifier with a small compression set (a property known as sparsity) along
with a low empirical risk. The preceding bounds apply to all the classifiers in a
given classifier space uniformly, unlike the test-set bound. Hence the training-
set bound focuses precisely on what the learning algorithm can learn (in terms
of its reconstruction) and its empirical performance on the training data. As
also discussed before, training-set bounds such as the one shown above, also
provide an optimization problem for learning, and, theoretically, a classifier that
minimizes the risk bound should be selected. However, this statement should be
considered more carefully. As also discussed by Langford (2005), choosing a
classifier based on the risk bound necesarily means that this gives a better worst-
case bound on the true risk of the classifier. This is different from obtaining an
improved estimate of true risk. Generally measures such as the empirical risk that
guide the model selection have a better behavior. Some successful examples of
learning from bound minimization do exist, however. See for instance, Laviolette
et al. (2005) and Shah (2006).
this framework. So can algorithms such as decision trees (see, e.g., Shah, 2007).
However, algorithms such as svm, although characterizable within the sample
compression framework, are not originally designed with sparsity as the learn-
ing bias. Hence such a comparison will always yield biased estimates. On the
other hand, sample compression algorithms, such as the scm, consider the clas-
sifier space that is defined after having the training set at hand (because each
classifier is defined in terms of a subset of the training set), a notion widely
known as data-dependent settings. Therefore a complexity measure such as the
VC dimension, defined without reference to the data and applicable in the case
of svms, cannot characterize the complexity of the classifier space that sample
compression algorithms explore.
Other considerations also come into play here, such as the resulting nature of
the optimization problem when such frameworks are used. Also, how to obtain
tight-enough training-set bounds currently remains an active research question.
Examples such as those previously shown are few. It would be interesting to
see advances on this front in the near future and their impact on the field of
classifier evaluation. The test-set bounds, on the other hand, provide a readily
favorable alternative to the confidence-interval-based approaches in terms of
more meaningful characterization of a classifier’s empirical performance.
8.6 Summary
This chapter provided a glimpse into the different developments taking place in
various directions of machine learning research that can have a direct or indirect
impact on the issue of evaluation of learning algorithms. We have tried, instead of
being exhaustive, to discuss some of the main threads in this direction, and have
provided details for which we felt that the issues discussed are important and can
have a long-term impact in the evaluation context. Among the advancements we
discussed are, in addition to some important isolated advancements and analysis
efforts, the approaches in the broad direction of general frameworks character-
izing performance measure, metrics combination approaches, and insights from
the statistical learning theory. However, a definitive word on the status of various
approaches is still awaited and so is their import in the mainstream evaluation
framework. The reader is encouraged to follow the references provided in the
text and the ever-growing literature in the field to obtain details regarding these
and other developments.
In the next chapter, we wrap up our discussion from the chapters so far by
putting into perspective how these different components and areas of evaluation
constitute different components of a complete evaluation framework. We will
also emphasize how the interdependencies of these components need to be taken
into account when choices are made at different stages of evaluation.
1
m
def
RT (f ) = I (f (xi ) = yi ) = E(x,y)∼T I (f (x) = y).
m i=1
m
PrT ∼Dm (mRT (f ) = λ|R(f )) = (R(f ))λ (1 − R(f ))m−λ .
λ
We use the cumulative probability, which is the probability of λ or fewer errors
on m examples:
m
= (R(f ))i (1 − R(f ))m−i .
i=0
λ
which is the largest true error such that the probability of observing λ or fewer
errors is at least δ.
Then, the risk bound on the true risk of the classifier is defined as (Langford,
2005) follows.
For all classifiers f , for all D, for all δ ∈ (0, 1],
From this result, it follows that Bin(m, λ, δ) is the smallest upper bound that
holds with probability at least 1 − δ, on the true risk R(f ) of any classifier f
with an observed empirical risk RT (f ) on a set of m examples.
In an analogous manner, a lower bound on R(f ) can be found as follows:
Table 8.2. Results of various classifiers on UCI Datasets illustrating the difference
between the traditional confidence interval approach and the hold out bound based on
characterization of empirical error using a binomial distribution.
(continued)
334 Recent Developments
As can be seen in Table 8.2, the limits of intervals in the classical Gaussian
confidence interval approach are not restricted to the [0, 1] intervals, rendering
them meaningless in most scenarios. Indeed, the empirical risk of the classifier,
by definition, should always be constrained in the [0, 1] range, and so should
its true risk. Hence obtaining confidence intervals that spill over this known
interval does not make much sense. On the other hand, the risk bound approach
is guaranteed to lie in the [0, 1] interval. The confidence interval technique is
also limited in a zero-empirical-risk scenario and cannot yield a confidence
interval in this case. This can be seen directly because the resulting Gaussian in
this case has both a zero mean and a zero variance. Hence, in the case of zero
empirical risk, the confidence interval technique becomes overly optimistic. The
risk bound, on the other hand, still yields a finite upper bound [of course very
small because RT (f ) = 0].
9
Conclusion
Datasets Selection
Perform Evaluation
Notation
Within this context, we consider the different components of the evaluation
framework of Figure 9.1. Note that we show dependencies in graphical form
338 Conclusion
for various components of the evaluation with the exception of the learning
algorithms selection, in which case the considerations are largely qualitative
and not only depend on the evaluation requirements but also characterize the
evaluation process itself.
With regard to the representation of the process and dependencies in other
components (Figures 9.2 to 9.5), we use the following conventions: The compo-
nents are represented by dash-dotted boxes, and the arrows denote the dependen-
cies on the processes of the component with the originating arrow. Solid black
boxes represent the information from the corresponding process or step, and the
diamond boxes represent a test or verification step. The solid black arrows rep-
resent dependencies on the information, (output of) process, or the verification
step. Hence, arrows are of two types: solid and dashed. The solid arrows indicate
the requirement of the information of the process or components from which
they originate, to enable the action, process or component to which they point.
Dashed arrows, on the other hand, refer to the feedback from the components or
process of their origination. For instance, a dashed arrow from a diamond box
to an oval box may signify the feedback of the decision or verification action
that the diamond box represents on the possible actions represented by the oval
box. In this sense, we should use a bidirectional relationship notation for such
feedback, but we use single dashed arrows instead to keep the figures simple.
Note that a black box inside a dash-dotted box denotes the information or pro-
cess that exerts dependencies on the current component, but are themselves part
of the component denoted by the dash-dotted box that encloses them.
Let us start discussing each component, starting with algorithm selection.
The preceding description makes the algorithm selection process look pretty
straightforward. However, this is certainly not the case because of various prac-
tical hassles. Not all approaches claimed to be effective and projected as serious
candidates for various domains have their implementations available. This shifts
the onus of developing an implementation from the original inventor of the
approach onto the researcher carrying out the evaluation. However, even when
making the effort of implementing the claimed approach(es) seems worthwhile,
the issue of the limited familiarity of the researcher carrying out the evaluation
to the nitty-gritties of the algorithm or its implementation makes it extremely
difficult. In many cases, such details may not be available at all in the public
domain. Hence it turns out that including all possible candidate algorithms in
the evaluation may not be feasible, even not possible, after all. This optimal
alternative then needs to be traded off in favor of approaches deemed close in
performance to the claimed state of the art, even though they are not quite as
strong. Such problems are very common in the case of applied research, in which
the proposed approaches are composed of independent or interdependent com-
ponents put together in a processing pipeline. Implementing these then involves
difficulties not only in terms of the availability of the various components, but
also in figuring out the exact nature of their relationship to other components of
the processing pipeline. If the learning algorithm happens to be only one of the
components of such a pipeline, then even more care needs to be taken to make
sure that the other components are controlled before making any inference on
the performance behavior of the algorithm.
The next natural question in selecting the candidate algorithms is that of how
many algorithms need to be included in the evaluation. Of course, it would be
easy to answer this question if there were a universal winner, i.e., an algorithm
that proved to be better (on the criteria of interest, of course) than all other
candidates. Evaluating the algorithm of interest against this universal winner
would possibly be sufficient, at least as a first step in the evaluation. However,
this evidently is almost never the case. As a result, the answer to this question
becomes highly subjective. For instance, when a specific application domain is
involved, one might want to include the state-of-the-art approaches to evaluate
the algorithm of interest against. Of course, there can be cases in which no (or
very few) approaches have been proposed with the chosen application in mind.
As a result, even though there may be multiple algorithms that can be applied to
the domain, they may not be optimal, at least in their classical form. This may
give an unfair advantage to the application-specific approach (optimized with
the application of interest in mind). Both the evaluation and interpretation of
the subsequent results, at the very least (if optimizing the candidate algorithms
is not possible), should bear this caveat in mind. Similarly, if the overall goal
of evaluation is to determine the best algorithm for a given domain or a set of
domains, then a reasonable first step is to include algorithms with a wide range of
learning biases (e.g., linear classifiers, decision-based classifiers). Accordingly,
340 Conclusion
selected will probably not be deployed on the exact same domain. The desired
application area can even be broad such as text classification. Consequently
the domains considered would need to cover a broad spectrum of the variabil-
ities in data characteristics such as dimensionality, class distributions, noise
levels, class overlaps, and so on. In the context of generic learning algorithms,
these variabilities need to be considered in the more general context of various
domains too.
In both cases, the dependence on the other evaluation framework components,
such as the intended performance measures and error estimation, should also
be kept in mind. For instance, measuring accuracy may not bode well with a
domain with skewed class distributions. Similarly, the size of the dataset would
also affect the error-estimation process. Reverse dependencies exist as well with
regard to related concerns. In the case in which a leave-one-out error estimation
is of interest (say, because of concerns over theoretical guarantees) for instance,
a very large dataset may not only prove to be computationally prohibitive, but
also may result in highly biased estimates. A 10-fold cross-validation method,
on the other hand, would require at least a reasonably sized dataset to enable
reliable estimates. The choice of the number of datasets would also in turn
affect the resulting statistical testing. Although the large number of domains
would help in making more concrete inferences on the broad effectiveness of
the approaches, the size and other characteristics of these domains affect the
confidence in their performances’ statistical differences.
A simple two-stage approach, as shown in Figure 9.2, can be effective.
Whereas the first stage enables filtering out the domains not relevant to the goals
of the evaluation and other compatibility issues based on algorithm selections,
the second stage allows for further refinement, based on finer considerations
on domain characteristics and their dependence on other components such as
error estimation and statistical testing. Other issues can appear when multi-
ple approaches and widely varying domains are evaluated. For instance, in a
generic evaluation, if the chosen domains are too different from one another,
the results can reflect this variability in the performances of the algorithms,
rendering them less meaningful. Averaging these performances does not help
in such cases, because such estimates highlight marginal (if any) differences.
Approaches such as clustering algorithms based on their performance along
domain characteristics of interest can be useful in such scenarios. Even visu-
alization methods (e.g., Japkowicz et al., 2008) can be useful because such
methods allow the decoupling of classifier performance from the dependence on
the domains. However, such issues are naturally quite subjective and will need
to be addressed in their respective contexts.
Questions also arise in addition to the ones discussed in detail in Chapter 7.
First comes the obvious question of how many datasets are sufficient for evalu-
ation so that clear inferences can be made about the algorithms’ performances.
Second, one can ask where the datasets can be obtained from? In other words,
342 Conclusion
what is the availability of the required datasets? And finally, one can wonder
how the relevance of some domains over all the other available ones can be
determined.
The answer to the first question is, in some sense, both related to and affects
the third concern. In a generic approach it is felt that, the more varied the
datasets, the better the performance analysis. However, it should be noted that
the more varied the datasets are, the greater are the chances of obtaining high
performance variance in the algorithms, thus jeopardizing sensible interpretation
of their comparison. On the other hand, conclusions based on too few datasets
may be prone to coincidental trends and may not reflect the true difference in the
performance of the algorithms. Interestingly, with a large number of domains
too, the issue of the multipicity effect becomes significant, as discussed in
Chapter 7. In cases in which a wide variety of datasets is considered, a cluster
analysis can prove to be better. The datasets are grouped in clusters (say of 3–5
as a rule of thumb) representing their important common characteristics so as
to yield fewer variant performance estimates across them. Algorithms’ behavior
can then be studied over these individual groupings.
The second question of how to obtain datasets was discussed at length in
Chapter 7, where we analyzed the effects of using application-specific datasets,
repository-based data, and synthetic data with characteristics of interest in the
wake of either the unavailability of real-world data (say because of copyright or
privacy issues) or to evaluate the algorithms’ performance over specific criteria
of interest.
Finally, the answer to the last question, that of selecting a subset of datasets
from all those that fulfill the previously mentioned initial requirements, would
depend on both the nature of the algorithms and the evaluation requirements
and goals [e.g., discarding ones with high missing attribute values in case these
result in asymmetric performance (dis)advantages for algorithms].
The next issue, once the algorithms and the dataset(s) are decided on, is that
of deciding on the yardstick(s) on which the performances of the algorithms are
to be measured and subsequently analyzed. Let us then take a peek at the issues
involved in choosing the performance measure(s) of interest.
Coarse-Grained Performance
Measure Filter
Error Estimation
Candidate Performance
Efficient w.r.t. Measures Theoretical
Error Estimation Guarantees
Techniques? requirements
Fine-Grained Performance
Measure Filter
Selected Performance
Measure(s)
medical applications, for example, the sensitivity and specificity of a test matter
much more than its overall accuracy. The performance measure will, thus, have
to be chosen appropriately.
Once several candidate performance measures have been chosen, a finer-
grained filter needs be applied to ensure that they are in synchronization with our
choice of an error estimation technique and that it is associated with appropriate
confidence measures or guarantees. The ease as well as computational com-
plexity of calculating a performance measure would play a crucial role when
assessing its dependence on the error-estimation technique. Other guarantees on
the performance measures might be desirable too, in certain scenarios. Consider,
for instance, performance guarantees in the form of either confidence intervals or
upper bounds on the generalization performances. Not all performance measures
have means of computing the tight confidence intervals associated with them.
For example, although point-wise bounds over ROC curves as well confidence
bands around the ROCs have been suggested as measures of confidence for ROC
analysis, in many cases such bands are not very tight (Elazmeh et al., 2006).
Similarly, the learning-theoretic analysis that we discussed briefly in Chapter 8
is relevant over only few measures, most specifically the empirical risk. Further,
subsequent validation (or significance testing) over such measures can affect
their choice.
We discussed different viewpoints as well as specific performance measures in
Chapters 3 and 4. These two chapters discussed the various strengths, limitation,
and the context of application of these measures that would be helpful in making
the required choices. Given the performance measures of interest, the next issue
will be selecting techniques best suited to obtain their estimate(s) objectively.
Hence we turn our focus on the issue of selecting the error-estimation method.
Theoretical
Refine Error Estimation Guarantees
Method-- Phase 2 requirements
Large number of
experiment
problem
Verify Synchronization w/
Error Estimation Method
Quantity of
Interest
Assumptions Candidate Statistical Tests
made by the
Statistical Test
Behavior and
Choose statistical test(s): Phase 2 Evaluation of
Verify constraints and interests the Test
2 Note that the rank test and the randomization test are indeed related in that randomization tests can
be interpreted as a brute-force version of rank tests. Hence randomization tests should not be used
when an exact solution exists in the form of a rank test (Cohen, 1995).
348 Conclusion
In the meantime, we conclude this discussion, and, indeed, the entire book with
the following remarks.
fully understand his or her practices, we hope that this book sheds light into the
underlying reasons for these practices. With a greater integration of learning-
based approaches in various applications, this will, incidentally, facilitate an
interdisciplinary dialogue. We thus hope that this book fills the existing void in
this area in the machine learning and data mining literature and proves to be a
productive first step toward meaningful evaluation.
This appendix presents all the statistical tables necessary for constructing the
confidence intervals or for running a hypothesis test of the kind discussed in this
book. In particular, we present seven kinds of tables, although, in some cases,
the table is broken up into several ones. More specifically, the following nine
tables are presented:
1. the Z table
2. the t table
3. the χ 2 table (two subtables)
4. the table of critical values for the signed test
5. the Wilcoxon table for signed-rank test
6. the F-ratio table (4 subtables)
7. the Friedman table
8. critical values for the Tukey test
9. critical values for the Dunnett test
351
352 Appendix A
5 0.2672 0.5594 0.7267 0.9195 1.156 1.476 2.015 2.571 3.365 4.032 5.893 6.869
6 0.2648 0.5534 0.7176 0.9057 1.134 1.440 1.943 2.447 3.143 3.707 5.208 5.959
7 0.2632 0.5491 0.7111 0.8960 1.119 1.415 1.895 2.365 2.998 3.499 4.785 5.408
8 0.2619 0.5459 0.7064 0.8889 1.108 1.397 1.860 2.306 2.896 3.355 4.501 5.041
9 0.2610 0.5435 0.7027 0.8834 1.100 1.383 1.833 2.262 2.821 3.250 4.297 4.781
10 0.2602 0.5415 0.6998 0.8791 1.093 1.372 1.812 2.228 2.764 3.169 4.144 4.587
11 0.2596 0.5399 0.6974 0.8755 1.088 1.363 1.796 2.201 2.718 3.106 4.025 4.437
12 0.2590 0.5386 0.6955 0.8726 1.083 1.356 1.782 2.179 2.681 3.055 3.930 4.318
13 0.2586 0.5375 0.6938 0.8702 1.079 1.350 1.771 2.160 2.650 2.012 3.852 4.221
14 0.2582 0.5366 0.6924 0.8681 1.076 1.345 1.761 2.145 2.624 2.977 3.787 4.140
15 0.2579 0.5357 0.6912 0.8662 1.074 1.341 1.753 2.131 2.602 2.947 3.733 4.073
16 0.2576 0.5350 0.6901 0.8647 1.071 1.337 1.746 2.120 2.583 2.921 3.686 4.015
17 0.2573 0.5344 0.6892 0.8633 1.069 1.333 1.740 2.110 2.567 2.898 3.646 3.965
18 0.2571 0.5338 0.6884 0.8620 1.067 1.330 1.734 2.101 2.552 2.878 3.610 3.922
19 0.2569 0.5333 0.6876 0.8610 1.066 1.328 1.729 2.093 2.539 2.861 3.579 3.883
20 0.2567 0.5329 0.6870 0.8600 1.064 1.325 1.725 2.086 2.528 2.845 3.552 3.850
21 0.2566 0.5325 0.6864 0.8591 1.063 1.323 1.721 2.080 2.518 2.831 3.527 3.819
22 0.2564 0.5321 0.6858 0.8583 1.061 1.321 1.717 2.074 2.508 2.819 3.505 3.792
23 0.2563 0.5317 0.6853 0.8575 1.060 1.319 1.714 2.069 2.500 2.807 3.485 3.768
24 0.2562 0.5314 0.6848 0.8569 1.059 1.318 1.711 2.064 2.492 2.797 3.467 3.745
25 0.2561 0.5312 0.6844 0.8562 1.058 1.316 1.708 2.060 2.485 2.787 3.450 3.725
26 0.2560 0.5309 0.6840 0.8557 1.058 1.315 1.706 2.056 2.479 2.779 3.435 3.707
27 0.2559 0.5306 0.6837 0.8551 1.057 1.314 1 703 2.052 2.473 2.771 3.421 3.690
28 0.2558 0.5304 0.6834 0.8546 1.056 1.313 1.701 2.048 2.467 2.763 3.408 3.674
29 0.2557 0.5302 0.6830 0.8542 1.055 1.311 1.699 2.045 2.462 2.756 3.396 3.659
30 0.2556 0.5300 0.6828 0.8538 1.055 1.310 1.697 2.042 2.457 2.750 3.385 3.646
32 0.2555 0.5297 0.6822 0.8530 1.054 1.309 1.694 2.037 2.449 2.738 3.365 3.622
34 0.2553 0.5294 0.6818 0.8523 1.052 1.307 1.691 2.032 2.441 2.728 3.348 3.601
36 0.2552 0.5291 0.6814 0.8517 1.052 1.306 1.688 2.028 2.434 2.719 3.333 3.582
38 0.2551 0.5288 0.6810 0.8512 1.051 1.304 1.686 2.024 2.429 2.712 3.319 3.566
40 0.2550 0.5286 0.6807 0.8507 1.050 1.303 1.684 2.021 2.423 2.704 3.307 3.551
50 0.2547 0.5278 0.6794 0.8489 1.047 1.299 1.676 2.009 2.403 2.678 3.261 3.496
60 0.2545 0.5272 0.6786 0.8477 1.045 1 296 1.671 2.000 2.390 2.660 3.232 3.460
120 0.2539 0.5258 0.6765 0.8446 1.041 1.289 1.658 1.980 2.358 2.617 3.160 3.373
∞ 0.2533 0.5244 0.6745 0.8416 1.036 1.282 1.645 1.960 2.326 2.576 3.090 3.291
Appendix A 353
5 0.1581 0.2102 0.4117 0.5543 0.8312 1.145 1.610 2.343 3.000 3.655
6 0.2994 0.3811 0.6757 0.8721 1.237 1.635 2.204 3.070 3.828 4.570
7 0.4849 0.5985 0.9893 1.239 1.690 2.167 2.833 3.822 4.671 5.493
8 0.7104 0.8571 1.344 1.646 2.180 2.733 3.490 4.594 5.527 6.423
9 0.9717 1.152 1.735 2.088 2.700 3.325 4.168 5.380 6.393 7.357
10 1.265 1.479 2.156 2.558 3.247 3.940 4.865 6.179 7.267 8.295
11 1.587 1.834 2.603 3.053 3.816 4.575 5.578 6.989 8.148 9.237
12 1.934 2.214 3.074 3.571 4.404 5.226 6.304 7.807 9.034 10.18
13 2.305 2.617 3.565 4.107 5.009 5.892 7.042 8.634 9.926 11.13
14 2.697 3.041 4.075 4.660 5.629 6.571 7.790 9.467 10.82 12.08
15 3.108 3.483 4.601 5.229 6.262 7.261 8.547 10.31 11.72 13.03
16 3.536 3.942 5.142 5.812 6.908 7.962 9.312 11.15 12.62 13.98
17 3.980 4.416 5.697 6.408 7.564 8.672 10.09 12.00 13.53 14.94
18 4.439 4.905 6.265 7.015 8.231 9.390 10.86 12.86 14.44 15.89
19 4.912 5.407 6.844 7.633 8.907 10.12 11.65 13.72 15.35 16.85
20 5.398 5.921 7.434 8.260 9.591 10.85 12.44 14.58 16.27 17.81
21 5.896 6.447 8.034 8.897 10.28 11.59 13.24 15.44 17.18 18.77
22 6.404 6.983 8.643 9.542 10.98 12.34 14.04 16.31 18.10 19.73
23 6.924 7.529 9.260 10.20 11.69 13.09 14.85 17.19 19.02 20.69
24 7.453 8.085 9.886 10.86 12.40 13.85 15.66 18.06 19.94 21.65
25 7.991 8.649 10.52 11.52 13.12 14.61 16.47 18.94 20.87 22.62
26 8.538 9.222 11.16 12.20 13.84 15.38 17.29 19.82 21.79 23.58
27 9.093 9.803 11.81 12.88 14.57 16.15 18.11 20.70 22.72 24.54
28 9.656 10.39 12.46 13.56 15.31 16.93 18.94 21.59 23.65 25.51
29 10.23 10.99 13.12 14.26 16.05 17.71 19.77 22.48 24.58 26.48
30 10.80 11.59 !3.79 14.95 16.79 18.49 20.60 23.36 25.51 27.44
32 11.98 12.81 15.13 16.36 18.29 20.07 22.27 25.15 27.37 29.38
34 13.18 14.06 16.50 17.79 19.81 21.66 23.95 26.94 29.24 31.31
36 14.40 15.32 17.89 19.23 21.34 23.27 25.64 28.73 31.12 33.25
38 15.64 16.61 19.29 20.69 22.88 24.88 27.34 30.54 32.99 35.19
40 16.91 17.92 20.71 22.16 24.43 26.51 29.05 32.34 34.87 37.13
50 23.46 24.67 27.99 29.71 32.36 34.76 37.69 41.45 44.31 46.86
60 30.34 31.74 35.53 37.48 40.48 43.19 46.46 50.64 53.81 56.62
70 37.47 39.04 43.28 45.44 48.76 51.74 55.33 59.90 63.35 66.40
80 44.79 46.52 51.17 53.54 57.15 60.39 64.28 69.21 72.92 76.19
90 52.28 54.16 59.20 61.75 65.65 69.13 73.29 78.56 82.51 85.99
100 59.90 61.92 67.33 70.06 74.22 77.93 82.36 87.95 92.13 95.81
354 Appendix A
5 4.351 5.132 6.064 7.289 9.236 11.07 12.83 15.09 16.75 20.52 22.11
6 5.348 6.211 7.231 8.558 10.64 12.59 14.45 16.81 18.55 22.46 24.10
7 6.346 7.283 8.383 9.803 12.02 14.07 16.01 18.48 20.28 24.32 26.02
8 7.344 8.351 9.524 11.03 13.36 15.51 17.53 20.09 21.95 26.12 27.87
9 8.343 9.414 10.66 12.24 14.68 16.92 19.02 21.67 23.59 27.88 29.67
10 9.342 10.47 11.78 13.44 15.99 18.31 20.48 23.21 25.19 29.59 31.42
11 10.34 11.53 12.90 14.63 17.28 19.68 21.92 24.72 26.76 31.26 33.14
12 11.34 12.58 14.01 15.81 18.55 21.03 23.34 26.22 28.30 32.91 34.82
13 12.34 13.64 15.12 16.98 19.81 22.36 24.74 27.69 29.82 34.53 36.48
14 13.34 14.69 16.22 18.15 21.06 23.68 26.12 29.14 31.32 36.12 38.11
15 14.34 15.73 17.32 19.31 22.31 25.00 27.49 30.58 32.80 37.70 39.72
16 15.34 16.78 18.42 20.47 23.54 26.30 28.85 32.00 34.27 39.25 41.31
17 16.34 17.82 19.51 21.61 24.77 27.59 30.19 33.41 35.72 40.79 42.88
18 17.34 18.87 20.60 22.76 25.99 28.87 31.53 34.81 37.16 42.31 44.43
19 18.34 19.91 21.69 23.90 27.20 30.14 32.85 36.19 38.58 43.82 45.97
20 19.34 20.95 22.77 25.04 28.41 31.41 34.17 37.57 40.00 45.31 47.50
21 20.34 21.99 23.86 26.17 29.62 32.67 35.48 38.93 41.40 46.80 49.01
22 21.34 23.03 24.94 27.30 30.81 33.92 36.78 40.29 42.80 48.27 50.51
23 22.34 24.07 26.02 28.43 32.01 35.17 38.08 41.64 44.18 49.73 52.00
24 23.34 25.11 27.10 29.55 33.20 36.42 39.36 42.98 45.56 51.18 53.48
25 24.34 26.14 28.17 30.68 34.38 37.65 40.65 44.31 46.93 52.62 54.95
26 25.34 27.18 29.25 31.79 35.56 38.89 41.92 45.64 48.29 54.05 56.41
27 26.34 28.21 30.32 32.91 36.74 40.11 43.19 46.96 49.64 55.48 57.86
28 27.34 29.25 31.39 34.03 37.92 41.34 44.46 48.28 50.99 56.89 59.30
29 28.34 30.28 32.46 35.14 39.09 42.56 45.72 49.59 52.34 58.30 60.73
30 29.34 31.32 33.53 36.25 40.26 43.77 46.98 50.89 53.67 59.70 62.16
32 31.34 33.38 35.66 38.47 42.58 46.19 49.48 53.49 56.33 62.49 65.00
34 33.34 35.44 37.80 40.68 44.90 48.60 51.97 56.06 58.96 65.25 67.80
36 35.34 37.50 39.92 42.88 47.21 51.00 54.44 58.62 61.58 67.99 70.59
38 37.34 39.56 42.05 45.08 49.51 53.38 56.90 61.16 64.18 70.70 73.35
40 39.34 41.62 44.16 47.27 51.81 55.76 59.34 63.69 66.77 73.40 76.09
50 49.33 51.89 54.72 58.16 63.17 67.50 71.42 76.15 79.49 86.66 89.56
60 59.33 62.13 65.23 68.97 74.40 79.08 83.30 88.38 91.95 99.61 102.7
70 69.33 72.36 75.69 79.71 85.53 90.33 95.02 100.4 104.2 112.3 115.6
80 79.33 82.57 86.12 90.41 96.58 101.9 106.6 112.3 116.3 124.8 128.3
90 89.33 92.76 96.52 101.1 107.6 113.1 118.1 124.1 128.3 137.2 140.8
100 99.33 102.9 106.9 111.7 118.5 124.3 129.6 135.8 140.2 149.4 153.2
Appendix A 355
5 5 – – – 35 11 13 15 17
6 6 6 – – 36 12 14 16 18
7 7 7 7 – 37 11 13 17 17
8 6 8 8 8 38 12 14 16 18
9 7 7 9 9 39 13 15 17 17
10 8 8 10 10 40 12 14 16 18
11 7 9 9 11 45 13 15 17 19
12 8 8 10 10 46 14 16 18 20
13 7 9 11 11 49 13 15 19 19
14 8 10 10 12 50 14 16 18 20
15 9 9 11 11 55 15 17 19 21
16 8 10 12 12 56 14 16 18 20
17 9 9 11 13 59 15 17 19 21
18 8 10 12 12 60 14 18 20 22
19 9 11 11 13 65 15 17 21 23
20 10 10 12 14 66 16 18 20 22
21 9 11 13 13 69 15 19 23 25
22 0 12 12 14 70 16 18 22 24
23 9 11 13 15 75 17 19 23 25
24 10 12 14 14 76 16 20 22 24
25 11 11 13 15 79 17 19 23 25
26 10 12 14 14 80 16 20 22 24
27 11 13 13 15 89 17 21 23 27
28 10 12 14 16 90 18 20 24 26
29 11 13 15 15 99 19 21 25 27
30 10 12 14 16 100 18 22 26 28
356 Appendix A
v1 = 1 2 3 4 5 6 7 8 10 12 24 ∞
v2 = 1 39.86 49.50 53.59 55.83 57.24 58.20 58.91 59.44 60.19 60.71 62.00 63.33
2 8.526 9.000 9.162 9.243 9.293 9.326 9.349 9.367 9.392 9.408 9.450 9.491
3 5.538 5.462 5.391 5.343 5.309 5.285 5.266 5.252 5.230 5.216 5.176 5.134
4 4.545 4.325 4.191 4.107 4.051 4.010 3.979 3.955 3.920 3.896 3.831 3.761
5 4.060 3.780 3.619 3.520 3.453 3.405 3.368 3.339 3.297 3.268 3.191 3.105
6 3.776 3.463 3.289 3.181 3.108 3.055 3.014 2.983 2.937 2.905 2.818 2.722
7 3.589 3.257 3.074 2.961 2.883 2.827 2.785 2.752 2.703 2.668 2.575 2.471
8 3.458 3.113 2.924 2.806 2.726 2.668 2.624 2.589 2.538 2.502 2.404 2.293
9 3.360 3.006 2.813 2.693 2.611 2.551 2.505 2.469 2.416 2.379 2.277 2.159
10 3.285 2.924 2.728 2.605 2.522 2.461 2.414 2.377 2.323 2.284 2.178 2.055
11 3.225 2.860 2.660 2.536 2.451 2.389 2.342 2.304 2.248 2.209 2.100 1.972
12 3.177 2.807 2.606 2.480 2.394 2.331 2.283 2.245 2.188 2.147 2.036 1.904
13 3.136 2.763 2.560 2.434 2.347 2.283 2.234 2.195 2.138 2.097 1.983 1.846
14 3.102 2.726 2.522 2.395 2.307 2.243 2.193 2.154 2.095 2.054 1.938 1.797
15 3.073 2.695 2.490 2.361 2.273 2.208 2.158 2.119 2.059 2.017 1.899 1.755
16 3.048 2.668 2.462 2.333 2.244 2.178 2.128 2.088 2.028 1.985 1.866 1.718
17 3.026 2.645 2.437 2.308 2.218 2.152 2.102 2.061 2.001 1.958 1.836 1.686
18 3.007 2.624 2.416 2.286 2.196 2.130 2.079 2.038 1.977 1.933 1.810 1.657
19 2.990 2.606 2.397 2.266 2.176 2.109 2.058 2.017 1.956 1.912 1.787 1.631
20 2.975 2.589 2.380 2.249 2.158 2.091 2.040 1.999 1.937 1.892 1.767 1.607
21 2.961 2.575 2.365 2.233 2.142 2.075 2.023 1.982 1.920 1.875 1.748 1.586
22 2.949 2.561 2.351 2.219 2.128 2.060 2.008 1.967 1.904 1.859 1.731 1.567
23 2.937 2.549 2.339 2.207 2.115 2.047 1.995 1.953 1.890 1.845 1.716 1.549
24 2.927 2.538 2.327 2.195 2.103 2.035 1.983 1.941 1.877 1.832 1.702 1.533
25 2.918 2.528 2.317 2.184 2.092 2.024 1.971 1.929 1.866 1.820 1.689 1.518
26 2.909 2.519 2.307 2.174 2.082 2.014 1.961 1.919 1.855 1.809 1.677 1.504
27 2.901 2.511 2.299 2.165 2.073 2.005 1.952 1.909 1.845 1.799 1.666 1.491
28 2.894 2.503 2.291 2.157 2.064 1.996 1.943 1.900 1.836 1.790 1.656 1.478
29 2.887 2.495 2.283 2.149 2.057 1.988 1.935 1.892 1.827 1.781 1.647 1.467
30 2.881 2.489 2.276 2.142 2.049 1.980 1.927 1.884 1.819 1.773 1.638 1.456
32 2.869 2.477 2.263 2.129 2.036 1.967 1.913 1.870 1.805 1.758 1.622 1.437
34 2.859 2.466 2.252 2.118 2.024 1.955 1.901 1.858 1.793 1.745 1.608 1.419
36 2.850 2.456 2.243 2.108 2.014 1.945 1.891 1.847 1.781 1.734 1.595 1.404
38 2.842 2.448 2.234 2.099 2.005 1.935 1.881 1.838 1.772 1.724 1.584 1.390
40 2.835 2.440 2.226 2.091 1.997 1.927 1.873 1.829 1.763 1.715 1.574 1.377
60 2.791 2.393 2.177 2.041 1.946 1.875 1.819 1.775 1.707 1.657 1.511 1.291
120 2.748 2.347 2.130 1.992 1.896 1.824 1.767 1.722 1.652 1.601 1.447 1.193
∞ 2.706 2.303 2.084 1.945 1.847 1.774 1.717 1.670 1.599 1.546 1.383 1.000
358 Appendix A
v1 = 1 2 3 4 5 6 7 8 10 12 24 ∞
v2 = 1 161.4 199.5 215.7 224.6 230.2 234.0 236.8 238.9 241.9 243.9 249.1 254.3
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.40 19.41 19.45 19.50
3 10.13 9.552 9.277 9.117 9.013 8.941 8.887 8.845 8.786 8.745 8.639 8.526
4 7.709 6.944 6.591 6.388 6.256 6.163 6.094 6.041 5.964 5.912 5.774 5.628
5 6.608 5.786 5.409 5.192 5.050 4.950 4.876 4.818 4.735 4.678 4.527 4.365
6 5.987 5.143 4.757 4.534 4.387 4.284 4.207 4.147 4.060 4.000 3.841 3.669
7 5.591 4.737 4.347 4.120 3.972 3.866 3.787 3.726 3.637 3.575 3.410 3.230
8 5.318 4.459 4.066 3.838 3.687 3.581 3.500 3.438 3.347 3.284 3.115 2.928
9 5.117 4.256 3.863 3.633 3.482 3.374 3.293 3.230 3.137 3.073 2.900 2.707
10 4.965 4.103 3.708 3.478 3.326 3.217 3.135 3.072 2.978 2.913 2.737 2.538
11 4.844 3.982 3.587 3.357 3.204 3.095 3.012 2.948 2.854 2.788 2.609 2.404
12 4.747 3.885 3.490 3.259 3.106 2.996 2.913 2.849 2.753 2.687 2.505 2.296
13 4.667 3.806 3.411 3.179 3.025 2.915 2.832 2.767 2.671 2.604 2.420 2.206
14 4.600 3.739 3.344 3.112 2.958 2.848 2.764 2.699 2.602 2.534 2.349 2.131
15 4.543 3.682 3.287 3.056 2.901 2.790 2.707 2.641 2.544 2.475 2.288 2.066
16 4.494 3.634 3.239 3.007 2.852 2.741 2.657 2.591 2.494 2.425 2.235 2.010
17 4.451 3.592 3.197 2.965 2.810 2.699 2.614 2.548 2.450 2.381 2.190 1.960
18 4.414 3.555 3.160 2.928 2.773 2.661 2.577 2.510 2.412 2.342 2.150 1.917
19 4.381 3.522 3.127 2.895 2.740 2.628 2.544 2.477 2.378 2.308 2.114 1.878
20 4.351 3.493 3.098 2.866 2.711 2.599 2.514 2.447 2.348 2.278 2.082 1.843
21 4.325 3.467 3.072 2.840 2.685 2.573 2.488 2.420 2.321 2.250 2.054 1.812
22 4.301 3.443 3.049 2.817 2.661 2.549 2.464 2.397 2.297 2.226 2.028 1.783
23 4.279 3.422 3.028 2.796 2.640 2.528 2.442 4.375 2.275 2.204 2.005 1.757
24 4.260 3.403 3.009 2.776 2.621 2.508 2.423 2.355 2.255 2.183 1.984 1.733
25 4.242 3.385 2.991 2.759 2.603 2.490 2.405 2.337 2.236 2.165 1.964 1.711
26 4.225 3.369 2.975 2.743 2.587 2.474 2.388 2.321 2.220 2.148 1.946 1.691
27 4.210 3.354 2.960 2.728 2.572 2.459 2.373 2.305 2.204 2.132 1.930 1.672
28 4.196 3.340 2.947 2.714 2.558 2.445 2.359 2.291 2.190 2.118 1.915 1.654
29 4.183 3.328 2.934 2.701 2.545 2.432 2.346 2.278 2.177 2.104 1.901 1.638
30 4.171 3.316 2.922 2.690 2.534 2.421 2.334 2.266 2.165 2.092 1.887 1.622
32 4.149 3.295 2.901 2.668 2.512 2.399 2.313 2.244 2.142 2.070 1.864 1.594
34 4.130 3.276 2.883 2.650 2.494 2.380 2.294 2.225 2.123 2.050 1.843 1.569
36 4.113 3.259 2.866 2.634 2.477 2.364 2.277 2.209 2.106 2.033 1.824 1.547
38 4.098 3.245 2.852 2.619 2.463 2.349 2.262 2.194 2.091 2.017 1.808 1.527
40 4.085 3.232 2.839 2.606 2.449 2.336 2.249 2.180 2.077 2.003 1.793 1.509
60 4.001 3.150 2.758 2.525 2.368 2.254 2.167 2.097 1.993 1.917 1.700 1.389
120 3.920 3.072 2.680 2.447 2.290 2.175 2.087 2.016 1.910 1.834 1.608 1.254
∞ 3.841 2.996 2.605 2.372 2.214 2.099 2.010 1.938 1.831 1.752 1.517 1.000
Appendix A 359
v1 = 1 2 3 4 5 6 7 8 10 12 24 ∞
v2 = 1 647.8 799.5 864.2 899.6 921.8 937.1 948.2 956.7 968.6 976.7 997.2 1018
2 38.51 39.00 39.17 39.25 39.30 39.33 39.36 39.37 39.40 39.41 39.46 39.50
3 17.44 16.04 15.44 15.10 14.88 14.73 14.62 14.54 14.42 14.34 14.12 13.90
4 12.22 10.65 9.979 9.605 9.364 9.197 9.074 8.980 8.844 8.751 8.511 8.257
5 10.01 8.434 7.764 7.388 7.146 6.978 6.853 6.757 6.619 6.525 6.278 6.015
6 8.813 7.260 6.599 6.227 5.988 5.820 5.695 5.600 5.461 5.366 5.117 4.849
7 8.073 6.542 5.890 5.523 5.285 5.119 4.995 4.899 4.761 4.666 4.415 4.142
8 7.571 6.059 5.416 5.053 4.817 4.652 4.529 4.433 4.295 4.200 3.947 3.670
9 7.209 5.715 5.078 4.718 4.484 4.320 4.197 4.102 3.964 3.868 3.614 3.333
10 6.937 5.456 4.826 4.468 4.236 4.072 3.950 3.855 3.717 3.621 3.365 3.080
11 6.724 5.256 4.630 4.275 4.044 3.881 3.759 3.664 3.526 3.430 3.173 2.883
12 6.554 5.096 4.474 4.121 3.891 3.728 3.607 3.512 3.374 3.277 3.019 2.725
13 6.414 4.965 4.347 3.996 3.767 3.604 3.483 3.388 3.250 3.153 2.893 2.595
14 6.298 4.857 4.242 3.892 3.663 3.501 3.380 3.285 3.147 3.050 2.789 2.487
15 6.200 4.765 4.153 3.804 3.576 3.415 3.293 3.199 3.060 2.963 2.701 2.395
16 6.115 4.687 4.077 3.729 3.502 3.341 3.219 3.125 2.986 2.889 2.625 2.316
17 6.042 4.619 4.011 3.665 3.438 3.277 3.156 3.061 2.922 2.825 2.560 2.247
18 5.978 4.560 3.954 3.608 3.382 3.221 3.100 3.005 2.866 2.769 2.503 2.187
19 5.922 4.508 3.903 3.559 3.333 3.172 3.051 2.956 2.817 2.720 2.452 2.133
20 5.871 4.461 3.859 3.515 3.289 3.128 3.007 2.913 2.774 2.676 2.408 2.085
21 5.827 4.420 3.819 3.475 3.250 3.090 2.969 2.874 2.735 2.637 2.368 2.042
22 5.786 4.383 3.783 3.440 3.215 3.055 2.934 2.839 2.700 2.602 2.331 2.003
23 5.750 4.349 3.750 3.408 3.183 3.023 2.902 2.808 2.668 2.570 2.299 1.968
24 5.717 4.319 3.721 3.379 3.155 2.995 2.874 2.779 2.640 2.541 2.269 1.935
25 5.686 4.291 3.694 3.353 3.129 2.969 2.848 2.753 2.613 2.515 2.242 1.906
26 5.659 4.265 3.670 3.329 3.105 2.945 2.824 2.729 2.590 2.491 2.217 1.878
27 5.633 4.242 3.647 3.307 3.083 2.923 2.802 2.707 2.568 2.469 2.195 1.853
28 5.610 4.221 3.626 3.286 3.063 2.903 2.782 2.687 2.547 2.448 2.174 1.829
29 5.588 4.201 3.607 3.267 3.044 2.884 2.763 2.669 2.529 2.430 2.154 1.807
30 5.568 4.182 3.589 3.250 3.026 2.867 2.746 2.651 2.511 2.412 2.136 1.787
32 5.531 4.149 3.557 3.218 2.995 2.836 2.715 2.620 2.480 2.381 2.103 1.750
34 5.499 4.120 3.529 3.191 2.968 2.808 2.688 2.593 2.453 2.353 2.075 1.717
36 5.471 4.094 3.505 3.167 2.944 2.785 2.664 2.569 2.429 2.329 2.049 1.687
38 5.446 4.071 3.483 3.145 2.923 2.763 2.643 2.548 2.407 2.307 2.027 1.661
40 5.424 4.051 3.463 3.126 2.904 2.744 2.624 2.529 2.388 2.288 2.007 1.637
60 5.286 3.925 3.343 3.008 2.786 2.627 2.507 2.412 2.270 2.169 1.882 1.482
120 5.152 3.805 3.227 2.894 2.674 2.515 2.395 2.299 2.157 2.055 1.760 1.310
∞ 5.024 3.689 3.116 2.786 2.567 2.408 2.288 2.192 2.048 1.945 1.640 1.000
360 Appendix A
v1 = 1 2 3 4 5 6 7 8 10 12 24 ∞
v2 = 1 4052 4999 5403 5625 5764 5859 5928 5981 6056 6106 6235 6366
2 98.50 99.00 99.17 99.25 99.30 99.33 99.36 99.37 99.40 99.42 99.46 99.50
3 34.12 30.82 29.46 28.71 28.24 27.91 27.67 27.49 27.23 27.05 26.60 26.13
4 21.20 18.00 16.69 15.98 15.52 15.21 14.98 14.80 14.55 14.37 13.93 13.46
5 16.26 13.27 12.06 11.39 10.97 10.67 10.46 10.29 10.05 9.888 9.466 9.020
6 13.75 10.92 9.780 9.148 8.746 8.466 8.260 8.102 7.874 7.718 7.313 6.880
7 12.25 9.547 8.451 7.847 7.460 7.191 6.993 6.840 6.620 6.469 6.074 5.650
8 11.26 8.649 7.591 7.006 6.632 6.371 6.178 6.029 5.814 5.667 5.279 4.859
9 10.56 8.022 6.992 6.422 6.057 5.802 5.613 5.467 5.257 5.111 4.729 4.311
10 10.04 7.559 6.552 5.994 5.636 5.386 5.200 5.057 4.849 4.706 4.327 3.909
11 9.646 7.206 6.217 5.668 5.316 5.069 4.886 4.744 4.539 4.397 4.021 3.602
12 9.330 6.927 5.953 5.412 5.064 4.821 4.640 4.499 4.296 4.155 3.780 3.361
13 9.074 6.701 5.739 5.205 4.862 4.620 4.441 4.302 4.100 3.960 3.587 3.165
14 8.862 6.515 5.564 5.035 4.695 4.456 4.278 4.140 3.939 3.800 3.427 3.004
15 8.683 6.359 5.417 4.893 4.556 4.318 4.142 4.004 3.805 3.666 3.294 2.868
16 8.531 6.226 5.292 4.773 4.437 4.202 4.026 3.890 3.691 3.553 3.181 2.753
17 8.400 6.112 5.185 4.669 4.336 4.102 3.927 3.791 3.593 3.455 3.084 2.653
18 8.285 6.013 5.092 4.579 4.248 4.015 3.841 3.705 3.508 3.371 2.999 2.566
19 8.185 5.926 5.010 4.500 4.171 3.939 3.765 3.631 3.434 3.297 2.925 2.489
20 8.096 5.849 4.938 4.431 4.103 3.871 3.699 3.564 3.368 3.231 2.859 2.421
21 8.017 5.780 4.874 4.369 4.042 3.812 3.640 3.506 3.310 3.173 2.801 2.360
22 7.945 5.719 4.817 4.313 3.988 3.758 3.587 3.453 3.258 3.121 2.749 2.305
23 7.881 5.664 4.765 4.264 3.939 3.710 3.539 3.406 3.211 3.074 2.702 2.256
24 7.823 5.614 4.718 4.218 3.895 3.667 3.496 3.363 3.168 3.032 2.659 2.211
25 7.770 5.568 4.675 4.177 3.855 3.627 3.457 3.324 3.129 2.993 2.620 2.169
26 7.721 5.526 4.637 4.140 3.818 3.591 3.421 3.288 3.094 2.958 2.585 2.131
27 7.677 5.488 4.601 4.106 3.785 3.558 3.388 3.256 3.062 2.926 2.552 2.097
28 7.636 5.453 4.568 4.074 3.754 3.528 3.358 3.226 3.032 2.896 2.522 2.064
29 7.598 5.420 4.538 4.045 3.725 3.499 3.330 3.198 3.005 2.868 2.495 2.034
30 7.562 5.390 4.510 4.018 3.699 3.473 3.304 3.173 2.979 2.843 2.469 2.006
32 7.499 5.336 4.459 3.969 3.652 3.427 3.258 3.127 2.934 2.798 2.423 1.956
34 7.444 5.289 4.416 3.927 3.611 3.386 3.218 3.087 2.894 2.758 2.383 1.911
36 7.396 5.248 4.377 3.890 3.574 3.351 3.183 3.052 2.859 2.723 2.347 1.872
38 7.353 5.211 4.343 3.858 3.542 3.319 3.152 3.021 2.828 2.692 2.316 1.837
40 7.314 5.179 4.313 3.828 3.514 3.291 3.124 2.993 2.801 2.665 2.288 1.805
60 7.077 4.977 4.126 3.649 3.339 3.119 3.953 2.823 2.632 2.496 2.115 1.601
120 6.851 4.787 3.949 3.480 3.174 2.956 2.792 2.663 2.472 2.336 1.950 1.381
∞ 6.635 4.605 3.782 3.319 3.017 2.802 2.639 2.511 2.321 2.185 1.791 1.000
Appendix A 361
k=3 k=4
P 10 5 2.5 1 0.1 P 10 5 2.5 1 0.1
n=3 6.000 6.000 – – – n=3 6.600 7.400 8.200 9.000 –
4 6.000 6.500 8.000 8.000 – 4 6.300 7.800 8.400 9.600 11.10
5 5.200 6.400 7.600 8.400 10.00 5 6.360 7.800 8.760 9.960 12.60
6 5.333 7.000 8.333 9.000 I2.00 6 6.400 7.600 8.800 10.20 12.80
7 5.429 7.143 7.714 8.857 12.29 7 6.429 7.800 9.000 10.54 13.46
8 5.250 6.250 7.750 9.000 12.25 8 6.300 7.650 9.000 10.50 13.80
9 5.556 6.222 8.000 9.556 12.67 9 6.200 7.667 8.867 10.73 14.07
10 5.000 6.200 7.800 9.600 12.60 10 6.360 7.680 9.000 10.68 14.52
11 5.091 6.545 7.818 9.455 13.27 11 6.273 7.691 9.000 10.75 14.56
12 5.167 6.500 8.000 9.500 12.67 12 6.300 7.700 9.100 10.80 14.80
13 4.769 6.615 7.538 9.385 12.46 13 6.138 7.800 9.092 10.85 14.91
14 5.143 6.143 7.429 9.143 13.29 14 6.343 7.714 9.086 10.89 15.09
15 4.933 6.400 7.600 8.933 12.93 15 6.280 7.720 9.160 10.92 15.08
16 4.875 6.500 7.625 9.375 13.50 16 6.300 7.800 9.150 10.95 15.15
17 5.059 6.118 7.412 9.294 13.06 17 6.318 7.800 9.212 11.05 15.28
18 4.778 6.333 7.444 9.000 13.00 18 6.333 7.733 9.200 10.93 15.27
19 5.053 6.421 7.684 9.579 13.37 19 6.347 7.863 9.253 11.02 15.44
20 4.900 6.300 7.500 9.300 13.30 20 6.240 7.800 9.240 11.10 15.36
21 4.952 6.095 7.524 9.238 13.24 ∞ 6.251 7.815 9.348 11.34 16.27
22 4.727 6.091 7.364 9.091 13.45
k=5
23 4.957 6.348 7.913 9.39I 13.13
P 10 5 2.5 1 0.1
24 5.083 6.250 7.750 9.250 13.08
n=3 7.467 8.533 9.600 10.13 11.47
25 4.880 6.080 7.440 8.960 13.52 4 7.600 8.800 9.800 11.20 13.20
26 4.846 6.077 7.462 9.308 13.23
27 4.741 6.000 7.407 9.407 13.41 5 7.680 8.960 10.24 11.68 14.40
28 4.571 6.500 7.714 9.214 13.50 6 7.733 9.067 10.40 11.87 15.20
29 5.034 6.276 7.517 9.172 13.52 7 7.771 9.143 10.51 12.11 15.66
8 7.700 9.200 10.60 12.30 16.00
30 4.867 6.200 7.400 9.267 13.40 9 7.733 9.244 10.67 12.44 16.36
31 4’839 6.000 7.548 9.290 13.42
32 4.750 6.063 7.563 9.250 13.69 ∞ 7.779 9.488 11.14 13.28 18.47
33 4.788 6.061 7.515 9.152 13.52
k=6
34 4.765 6.059 7.471 9.176 13.41
P 10 5 2.5 1 0.1
∞ 4.605 5.99I 7.378 9.210 13.82 n=3 8.714 9.857 10.81 11.76 13.29
4 9.000 10.29 11.43 12.71 15.29
Number of Groups
df WG α 2 3 4 5 6 7 8 9 10
.05 3.64 4.60 5.22 5.67 6.03 6.33 6.58 6.80 6.99
5
.01 5.70 6.98 7.80 8.42 8.91 9.32 9.67 9.97 10.24
.05 3.46 4.34 4.90 5.30 5.63 5.90 6.12 6.32 6.49
6
.01 5.24 6.33 7.03 7.56 7.97 8.32 8.61 8.87 9.10
.05 3.34 4.16 4.68 5.06 5.36 5.61 5.82 6.00 6.16
7
.01 4.95 5.92 6.54 7.01 7.37 7.68 7.94 8.17 8.37
.05 3.26 4.04 4.53 4.89 5.17 5.40 5.60 5.77 5.92
8
.01 4.75 5.64 6.20 6.62 6.96 7.24 7.47 7.68 7.86
.05 3.20 3.95 4.41 4.76 5.02 5.24 5.43 5.59 5.74
9
.01 4.60 5.43 5.96 6.35 6.66 6.91 7.13 7.33 7.49
.05 3.15 3.88 4.33 4.65 4.91 5.12 5.30 5.46 5.60
10
.01 4.48 5.27 5.77 6.14 6.43 6.67 6.87 7.05 7.21
.05 3.11 3.82 4.26 4.57 4.82 5.03 5.20 5.35 5.49
11
.01 4.39 5.15 5.62 5.97 6.25 6.48 6.67 6.84 6.99
.05 3.08 3.77 4.20 4.51 4.75 4.95 5.12 5.27 5.39
12
.01 4.32 5.05 5.50 5.84 6.10 6.32 6.51 6.67 6.81
.05 3.06 3.73 4.15 4.45 4.69 4.88 5.05 5.19 5.32
13
.01 4.26 4.96 5.40 5.73 5.98 6.19 6.37 6.53 6.67
.05 3.03 3.70 4.11 4.41 4.64 4.83 4.99 5.13 5.25
14
.01 4.21 4.89 5.32 5.63 5.88 6.08 6.26 6.41 6.54
.05 3.01 3.67 4.08 4.37 4.59 4.78 4.94 5.08 5.20
15
.01 4.17 4.84 5.25 5.56 5.80 5.99 6.16 6.31 6.44
.05 3.00 3.65 4.05 4.33 4.56 4.74 4.90 5.03 5.15
16
.01 4.13 4.79 5.19 5.49 5.72 5.92 6.08 6.22 6.35
.05 2.98 3.63 4.02 4.30 4.52 4.70 4.86 4.99 5.11
17
.01 4.10 4.74 5.14 5.43 5.66 5.85 6.01 6.15 6.27
.05 2.97 3.61 4.00 4.28 4.49 4.67 4.82 4.96 5.07
18
.01 4.07 4.70 5.09 5.38 5.60 5.79 5.94 6.08 6.20
.05 2.96 3.59 3.98 4.25 4.47 4.65 4.79 4.92 5.04
19
.01 4.05 4.67 5.05 5.33 5.55 5.73 5.89 6.02 6.14
.05 2.95 3.58 3.96 4.23 4.45 4.62 4.77 4.90 5.01
20
.01 4.02 4.64 5.02 5.29 5.51 5.69 5.84 5.97 6.09
.05 2.92 3.53 3.90 4.17 4.37 4.54 4.68 4.81 4.92
24
.01 3.96 4.55 4.91 5.17 5.37 5.54 5.69 5.81 5.92
.05 2.89 3.49 3.85 4.10 4.30 4.46 4.60 4.72 4.82
30
.01 3.89 4.45 4.80 5.05 5.24 5.40 5.54 5.65 5.76
.05 2.86 3.44 3.79 4.04 4.23 4.39 4.52 4.63 4.73
40
.01 3.82 4.37 4.70 4.93 5.11 5.26 5.39 5.50 5.60
.05 2.83 3.40 3.74 3.98 4.16 4.31 4.44 4.55 4.65
60
.01 3.76 4.28 4.59 4.82 4.99 5.13 5.25 5.36 5.45
.05 2.80 3.36 3.68 3.92 4.10 4.24 4.36 4.47 4.56
120
.01 3.70 4.20 4.50 4.71 4.87 5.01 5.12 5.21 5.30
.05 2.77 3.31 3.63 3.86 4.03 4.17 4.29 4.39 4.47
∞
.01 3.64 4.12 4.40 4.60 4.76 4.88 4.99 5.08 5.16
1 This table is abridged from Table 29 in E.S. Pearson and H.O. Hartley (Eds.), Biometrika tables for
statisticians (3rd ed., Vol 1), Cambridge University Press, 1970.
Appendix A 363
2 This
table is abridged from C.W. Dunnett, New tables for multiple comparisons with a control,
Biometrics, 1964, 482–491.
Appendix B
Additional Information on the Data
Tables B.1 and B.2 show the results obtained using 10-fold cross validation
by c45 and nb on each instance of the labor data respectively as output by
WEKA. The first column lists the instance number; the second column lists the
instance label, where class 1 corresponds to class “bad” and class 2 corresponds
to class “good”; the third column lists the predicted class, using the same naming
convention; column 4 uses the “+” symbol to indicate whether the predicted label
differs from the actual one and a blank if they are in agreement; finally, the last
two values, which are complementary and add up to 1, indicate the confidence
of their prediction. The first value indicates how much the classifier believes the
instance to be of class 1 (bad), and the second indicates how much the classifier
believes the instance to be of classs 2 (good). The dominant value is preceded
by a “*” symbol and corresponds to the value of the predicted label.
Please note that the numbers denoting the instances in the first column are not
sequential. After number 6 or 7 is reached, a 1–6 or 1–7 sequence is repeated.
This is because every 1–6 or 1–7 sequence represents a different fold. Indeed,
it can be seen that 10 different sequences are present in each classifier run,
corresponding to the 10 folds of 10-fold cross-validation. Note, however, that
despite the repetition, the instances are different. For example, instance 2 of fold
1 is different from instance 2 of fold 2. In fact, it can be seen that the number of
instances present in each classifier run corresponds to the number of examples
in the dataset. That is because cross-validation tests each instance exactly once,
as discussed in Chapter 5.
364
Appendix B 365
Table B.1. c45 applied to the labour data: Predictions on test data
(continued)
366 Appendix B
Table B.2. Naive Bayes applied to the labor data: Predictions on test data
that took place prior to the political push to eradicate nuclear proliferation. As
well, they have to construct weather model systems to simulate the transport of
radioxenon to the monitoring stations. On the other hand, all the background
data are readily available in large quantity at each monitoring station. The
explosion part of the dataset is thus constructed from both sources, whereas
the background data correspond to the actual readings done at the monitoring
stations. In more detail, the data are composed of radioxenon measurements
from four or five CTBTO monitoring sites. Each data point is represented by a
quadruplet representing the four activity concentrations of Xe-131m, Xe-133m,
Xe-133, and Xe-135 for a given air sample. An additional feature represents the
class of the point and corresponds to either the class “Background” or the class
“Background plus Explosion.”
One difficulty in this dataset is its small dimensionality, showing that the
data are quite convoluted. Adding to this difficulty is the fact that the dataset
is highly imbalanced with 8072 explosions (positive) versus 623 normal back-
ground (negative) samples. Learning from such imbalanced domains is, in itself,
a problem of interest in machine learning and has led to several interesting find-
ings. However, a detailed discussion on these is beyond the current scope of the
book. We, for the purpose of this case study based on a preliminary exploratory
analysis, downsampled the positive class samples so as to obtain balanced classes
with 623 examples each. Readers interested in more details on data processing
and algorithmic techniques for dealing with class imbalances on this problem
is referred to (Stocki et al., 2008). Another reason for balancing the dataset is
that, although we discussed the performance measures that are recommended in
the case of class imbalances, we did not want to shift the focus of the study to
dealing with class imbalances only. Therefore, although we demonstrate the use
of the performance measures most appropriate for class imbalances, we do not
have to restrict our attention solely to them.
The purpose of the study was to compare the performance of the decision
tree classifier (c45) with that of AdaBoost (ada) on this domain. However,
to illustrate the difficulty of the learning domain, we subsequently give the
results of typically strong classifiers with nevertheless simpler biases than c45.
Applying Naive Bayes (nb) and k-Nearest Neighbors (ibk) along with c45
and ada for a preliminary analysis gives the results for various performance
metrics of interest as shown in Table C.1. These results were obtained from
WEKA that, by default, uses 10 × 10-fold stratified cross-validation as its error-
estimation method. We used this default procedure as well as all of WEKA’s
default classifier’s parameter values in this study.
Table C.1 shows that, even when the results of nb and ibk can seem reason-
able on some isolated metrics (e.g., TPR, recall, and F measure for ibk and FPR
for nb), their results show an exceptionally high degree of variation between
metrics, even over a balanced domain. Indeed, characterizing their behavior over
their full operating ranges against that of c45 using the ROC analysis further
370 Appendix C
Table C.1. Initial results of c45, nb, and ibk on the 2008 ICDM
Data Mining Contest Health Canada dataset
confirmed this fact. Drawing the ROC curves for these classifiers (using the
RWeka and ROCR packages in R), as shown in Figure C.1, further confirms that
the performances of nb and ibk are indeed not too far from that of a random
classifier (which is expected to appear along the diagonal). In fact, the rela-
tively marginally better performances of c45 and ada themselves demonstrate
the difficulty of learning the domain. We hence exclude nb and ibk from fur-
ther consideration in this study and focus on a comparative evaluation of c45
against ada. On the other hand, drawing ROC performances of c45 and ada
1.0
0.8
True-positive rate
0.6
0.4
NB
C45
0.2
IBk
0.0
Figure C.1. ROC Curves for nb, c45 and IBk. Only the curve for c45 lifts above the random
line.
Appendix C 371
1.0
0.8
True-positive rate
0.6
0.4
C45
AdaBoost
0.2
0.0
Figure C.2. ROC Curves for c45 and ada. The two curves are very similar, although c45
seems to dominates more often than ada. However, it is not clear whether this dominance
is statistically significant.
(see Figure C.2) shows that these two classifiers trade off performances across
different portions of the operating range. However, the information yielded by
the ROC curve is not very useful because of the highly overlapping nature of
the classifiers.
Hence, let us focus on metrics that can highlight their performances on
individual classes, in particular, sensitivity, specificity, positive predictive value
(PPV) and negative predictive value (NPV). The results are listed in Table C.2.
Note that, as mentioned in Chapter 3, although WEKA may not seem to output
these values, it actually does so in some hidden ways. In particular, in the part
titled “Detailed Accuracy By Class,” the “yes” TPR corresponds to sensitivity;
the “no” recall corresponds to specificity; the “yes” precision corresponds to
the PPV; and the “no” precision corresponds to the NPV. We can verify this by
comparing the ada entries in Table C.2 and the partial WEKA output for ada,
shown in Listing C.1.
Listing C.1: WEKA output on AdaBoost.
=== D e t a i l e d A c c u r a c y By C l a s s ===
C45
AdaBoost
0.0
Figure C.3. Sensitivity and specificity Curves for c45 and Adaboost.
Appendix C 373
package in R can be used for this purpose). It can be noted in Figure C.3 that,
even if overall c45 seems to dominate ada on this graph, around a specificity of
0.5, ada is more sensitive than c45.
Let us then verify if the performances of the two classifiers in terms of
the sensitivity and specificity pair are indeed statistically significant (over the
default threshold used by WEKA and that we also relied on). However, note
that a model selection could have been further performed in order to choose the
optimal threshold.
This then brings in the question of which is the most suitable statistical
test for this purpose. Note that, for this part of the study, we still did not
vary the error-estimation method and remained with WEKA’s default 10 × 10-
fold stratified cross-validation. Two additional error-estimation methods were
also experimented on, as will be subsequently seen. Clearly, because we are
interested in comparing two classifiers on a single domain (over individual
metrics of sensitivity and specificity), the three candidate statistical tests are the
paired t test and its nonparametric alternatives, McNemar’s test and Wilcoxon’s
Signed-Ranks test. Because McNemar’s test does not have straightforward ways
to integrate the measures of interest here (sensitivity and specificity) in its
computations, we focus on the other two tests.
As just mentioned, we also decided to use two additional error-estimation
strategies that yield additional measures of statistical significance from Chap-
ter 5: bootstrapping and the permutation test.
Listings C.2 and C.3 show, for sensitivity and specificity respectively, the
results of the paired t test and its corresponding effect size using Cohen’s d
statistic, Wilcoxon’s Signed-Ranks test, and .632 bootstrap and permutation
(randomization) test estimates.
#1) P a i r e d t−t e s t
d a t a : j 4 8 and AdaBoost
t = −4.1275 , d f = 9 , p−v a l u e = 0 . 0 0 2 5 6 9
a l t e r n a t i v e h y p o t h e s i s : t r u e d i f f e r e n c e i n means i s n o t e q u a l t o 0
95 p e r c e n t c o n f i d e n c e i n t e r v a l :
−0.27462727 −0.08017273
sample e s t i m a t e s :
mean o f t h e d i f f e r e n c e s
−0.1774
# 2 ) Cohen ’ s d S t a t i s t i c
> d
374 Appendix C
[ 1 ] 1.896586 ( > .8 , t h u s of p r a c t i c a l s i g n i f i c a n c e )
#3) Wilcoxon s i g n e d r a n k t e s t w i t h c o n t i n u i t y c o r r e c t i o n
d a t a : j 4 8 and AdaBoost
V = 0 , p−v a l u e = 0 . 0 2 2 2 5
a l t e r n a t i v e hypothesis : true l o c a t i o n s h i f t i s not equal to 0
#4) .632 B o o t s t r a p p i n g
> b632J48
[ 1 ] 0.4157718
> b632Adaboost
[ 1 ] 0.5523992
>
> d
[ 1 ] 2.388927 ( > .8 , t h u s of p r a c t i c a l s i g n i f i c a n c e )
#3) Wilcoxon s i g n e d r a n k t e s t w i t h c o n t i n u i t y c o r r e c t i o n
d a t a : j 4 8 and AdaBoost
V = 5 5 , p−v a l u e = 0 . 0 0 5 7 9 3
a l t e r n a t i v e hypothesis : t r ue l o c a t i o n s h i f t i s not equal to 0
> b632J48
[ 1 ] 0.8228896
> b632Adaboost
[ 1 ] 0.7028262
#5)
> mobt
[ 1 ] 0.1695
> probability of mobt
[ 1 ] 0.0026 (We r e j e c t w i t h p r o b a b i l i t y . 0 0 2 6 t h e h y p o t h e s i s t h a t
J 4 8 and A d a b o o s t d i s p l a y t h e same s p e c i f i c i t y )
As can be seen from the results, all tests concur over the finding that c45’s
performance is indeed statistically significantly different (better) than that of
AdaBoost with high certainty with regard to specificity whereas the reverse
is true with regard to sensitivity. Hence the analysis suggests that AdaBoost
would be the classifier of choice (among the ones evaluated) when the goal
is high sensitivity (at the expense of some false alarms) whereas c45 would
be more apt (again among the ones evaluated) when the issue of false alarm
(possibly leading to unjustified implications of forbidden testing of nuclear
weapons) is more important. Note that this remains an illustrative exercise and
is in no way representative of the actual approaches utilized for the purpose,
which are significantly more sophisticated and rigorously validated before being
deployed. The interesting aspect of this study, however, is the demonstration of
the flexibility that the tools discussed in this book provide us. We now turn to
our second case study in which several generic domains are involved.
Table C.3. Results obtained on eight classifiers and 10 domains, using accuracy
and tic-tac-toe were the domains that were generally easy to classify. svm, rf,
c45, rip, and bag were the systems shown to win and tie most often in the
aggregated t tests against each of the other classifiers.
Let us now redo the evaluation analysis with the perspectives obtained from
this book. The results used in Chapter 1 were all based on the accuracy measure
alone. The results on the accuracy estimates are shown here in Table C.3.1
However, in the absence of knowledge of the best performance measure to use
or a concrete measure of interest, one would be inclined to take into account more
generic measures that can characterize the performances of the classifiers. In
particular, in addition to accuracy, we use the RMSE, the AUC, the F measure,
and the Kononenko and Bratko (KB) relative information score. The results
of applying these measures are shown in Tables C.4 (RMSE), C.5 (AUC),
C.6 (F ), and C.7 and C.8 (KB). Before we go further, let us see what a visual
exploratory analysis can show us over the performance of these classifiers.
1 Note, however, that the numbers in Table C.3 do not match exactly with those in Chapter 1 because
all the experiments were rerun from scratch for this analysis.
Appendix C 377
Table C.4. Results obtained on eight classifiers and 10 domains, using RMSE
Contact lenses 0.2965 0.3052 0.3741 0.2904 0.3028 0.2333 0.2587 0.267 0.291
Anneal 0.2028 0.0384 0.3111 0.269 0.0573 0.0571 0.0474 0.0664 0.1312
Audiology 0.1355 0.1417 0.1934 0.1724 0.123 0.1208 0.1216 0.1343 0.1428
Balanced scale 0.2785 0.3791 0.345 0.3602 0.2857 0.3574 0.3102 0.3333 0.3312
Pima diabetes 0.4194 0.5402 0.4793 0.4157 0.403 0.4388 0.4199 0.4274 0.4430
Glass 0.337 0.29 0.3162 0.3026 0.2366 0.2832 0.2175 0.2759 0.2824
Hepatitis 0.3459 0.2902 0.3435 0.3601 0.3522 0.404 0.3419 0.4077 0.3557
Hypothyroid 0.1379 0.2904 0.3214 0.1216 0.0422 0.0385 0.0651 0.0488 0.1332
Breast cancer 0.4512 0.2904 0.5477 0.4355 0.4505 0.444 0.4663 0.4494 0.4419
Tic-tac-toe 0.4308 0.2904 0.1103 0.3011 0.2843 0.3344 0.275 0.1376 0.2705
AVG 0.3036 0.2856 0.3342 0.3029 0.2538 0.2712 0.2524 0.2548
Table C.5. Results obtained on eight classifiers and 10 domains, using AUC
Contact lenses 0.95 0.765 0.915 0.835 0.935 0.945 0.975 0.94 0.9075
Anneal 0.9954 0.9375 0.9826 0.831 0.9655 0.931 0.9676 0.755 0.9207
Audiology 0.7033 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5254
Balanced scale 0.9934 0.8598 0.9261 0.889 0.9596 0.8448 0.9577 0.8619 0.9115
Pima diabetes 0.815 0.6677 0.7131 0.8049 0.8218 0.7514 0.7945 0.7195 0.7610
Glass 0.729 0.7984 0.7783 0.7083 0.9063 0.7938 0.9209 0.8019 0.8046
Hepatitis 0.8567 0.6785 0.7685 0.8326 0.8257 0.6678 0.8378 0.6224 0.7613
Hypothyroid 0.9317 0.6709 0.5918 0.9914 0.9965 0.9962 0.9986 0.988 0.8956
Breast cancer 0.7025 0.6043 0.5836 0.6991 0.6416 0.6063 0.6471 0.5975 0.6353
Tic-tac-toe 0.7443 0.7851 0.9759 0.8021 0.9726 0.9013 0.9792 0.9738 0.8918
AVG 0.8421 0.7267 0.7735 0.7893 0.8525 0.7938 0.8578 0.776
Table C.6. Results obtained on eight classifiers and 10 domains, using the F measure
Contact lenses 0.3767 0.31 0.3867 0.4017 0.3667 0.4767 0.3867 0.4733 0.3973
Anneal 0.6317 0.7 0.6833 0 0.495 0.5067 0.7 0.3767 0.5117
Audiology 0 0 0 0 0 0 0 0 0
Balanced scale 0.9421 0.8475 0.9115 0.746 0.8727 0.8267 0.8773 0.8356 0.8574
Pima diabetes 0.8183 0.7785 0.8339 0.8145 0.8177 0.8063 0.8119 0.8159 0.8121
Glass 0.5762 0.7248 0.5276 0.6269 0.7519 0.6957 0.7852 0.6834 0.6715
Hepatitis 0.6405 0.4691 0.6297 0.4846 0.3678 0.4085 0.4578 0.3646 0.4778
Hypothyroid 0.9767 0.9563 0.9665 0.9737 0.9986 0.9983 0.9961 0.9977 0.9830
Breast cancer 0.8131 0.7811 0.7974 0.8066 0.8017 0.8378 0.7953 0.8126 0.8057
Tic-tac-toe 0.4868 0.718 0.9749 0.5976 0.857 0.779 0.8959 0.9642 0.7842
AVG 0.6262 0.6285 0.6712 0.5452 0.6329 0.6336 0.6706 0.6324
Table C.7. Results obtained on eight classifiers and 10 domains, using KB relative information score
378
Pima diabetes 2793.16 2512.03 3597.15 2368.57 2472.69 2489.41 2435.09 2039.3 2588.425
Glass 862.8 1308.59 276.1 596.08 1200.5 1250.15 1385.06 1072.85 994.01625
Hepatitis 619.19 526.32 764.25 450.32 257.37 298.73 383.58 204.84 438.075
Hypothyroid 16533.44 8311.91 −179227 22044.55 34802.82 35498.28 30242.13 34768.23 371.795
Breast cancer 555.38 549.69 619.1 367.74 136.22 320.52 275.03 268.84 386.565
Tic-tac-toe 1770.41 5356.4 9208.81 3280.18 5336.09 6044.11 5555.74 8881.55 5679.16125
AVG 3356.65 3219.91 −16685.04 3344.18 5726.59 5925.74 5378.50 6016.54
Appendix C 379
Table C.8. Results obtained on eight classifiers and 10 domains, using KB relative
information scores that were mapped to the [0,1] range using the procedure described
in Footnote 2
The plot of Figure C.4 shows an aggregate view of the classifier perfor-
mance on all five evaluation measures over all the domains.
Let us explain how the plot should be read. The classifiers are listed in the
window at the bottom left, the domains used are listed in the bottom middle.
The performance measures used are listed in the bottom right window. The
algorithms, domains, and measures are numerically labeled starting at 0 in the
Figure C.4. Classifier view on all the domains using the five evaluation metrics.
380 Appendix C
order in which they appear in the respective windows. That is, nb is classifier 0,
svm (svm) is classifier 1, and so on for the classifiers. Similarly, over the domains,
contact lenses is domain 0, anneal is domain 1, and so on. Note, however, that
the domain labeling is not relevant in this plot because we are focusing on the
classifier view, as indicated by the tab titled “Focused on” on the right. The
plot depicts the relative distances (under the Euclidean metric) of the classifiers
in terms of their aggregate performance as well as their distance to the ideal
classifier (in terms of the best performance over the aggregate measures).
Some simple observations can be made. rf (classifier 6) and svm (classifier 1)
represent the two extremes of performances in this framework, with the former
being the closest to the ideal classifier. The other classifiers tend to cluster in
between these two, with the exception of ada (classifier 3), which trails behind
the cluster. Although these results show some similarities to the earlier findings,
there are some interesting differences. For instance, as in Chapter 1, we see that
nb (classifier 0) and 1nn (classifier 2) are slightly behind the tree or rule-based
classifiers (c4.5, rf, rip, and bag) and that ada (classifier 3) is even further
back; the svm, unlike in Chapter 1, where it was considered the second best, is
shown to be inferior in terms of the aggregate performance over all the domains.
This demonstrates the dependence of these results on the evaluation measures,
the manner in which these are aggregated, or both.
Let us then see if the domains can be clustered (in fact, ranked) in this
framework in terms of the ease or difficulty of learning from them as measured
by the aggregate measures. Figure C.5 plots this analysis.
The domain view of the analysis, as indicated by the “Focused on” tab on
the right, shows that the easiest domains to classify are domains 3, 4, 8, and
9, corresponding to balance scale, pima diabetes, breast cancer, and tic-tac-
toe. The domains with an intermediate level of difficulty are domains 0, 1, 5,
and 6, corresponding to contact lenses, anneal, glass, and hepatitis. And the
two most difficult domains to classify are domains 2 and 7, corresponding to
audiology and hypothyroid. Keep in mind, however, that these results depend
on the classifiers and the metrics involved in the study. Were these to change,
so would the ranking. Again, note the difference with interpretations drawn
solely on accuracy in Chapter 1, indicating anneal, hypothyroid, and tic-tac-toe
as the easiest to classify, further emphasizing the reliance of the findings on the
performance measures and the aggregation process.
From this, we can then break down the plot of Figure C.4 to study classifier
performances in terms of domain difficulty. Considering the three broad groups
of domains with easy, intermediate, and high levels of difficulty in learning
them, as previously identified, we did a similar analysis (figures not shown) to
discover some interesting characteristics of classifier performances. In terms of
the aggregate performances and within the framework of analysis, we can note
that all classifiers seem to be almost equally effective (with the exception of
perhaps nb and ada) on easy domains whereas svm and ada show a marked
deterioration in performances with increasing difficulty of domains. Overall, rf
Appendix C 381
Figure C.5. Domain view using all the classifiers and the five evaluation metrics.
2 Please note that the data used in ANOVA were different from those in these tables for the accuracy
and the information scores. In particular, in both these cases, the data were mapped into the [0,1]
range. This was easy in the case of accuracy as all the values could simply be divided by 100. In the
case of the information score, the mapping was trickier. In particular, for each dataset, the average
value reported in the last column (AVG) of Table C.7 was considered to have a mapped value of 0.5.
0.5
All the other values for this dataset were then multiplied by AVG . However, if any of the values in
Table C.7 was negative, then, prior to applying the preceding step, we began by shifting the scores
by the negative amount (so that that negative value became 0 and all the others were shifted by that
amount), and then proceeded as in the other cases. The transformed data from Table C.7 is shown in
Table C.8.
382 Appendix C
follow these with the respective post hoc tests (the Dunnett test for ANOVA and
the Nemenyi test for Friedman’s test in our case).
Let us start with the one-way repeated-measures ANOVA, whose results are
listed in Listing C.4.
Listing C.4: The results of the omnibus ANOVA test on the eight classifiers, 10
domains, and five evaluation measures. (Values are imported from respective
csv files for each measure)
> t t <− r e a d . t a b l e ( “rmanova−c h a p t e r 9 −a c c u r a c y . c s v ” ,
h e a d e r =T , s e p =“ , ” )
> attach ( tt )
> summary ( aov ( A c c u r a c y ∼ c l a s s i f i e r + E r r o r ( d a t a s e t ) ) )
Error : dataset
Df Sum Sq Mean Sq F v a l u e P r ( > F )
Residuals 9 0.81416 0.09046
Error : Within
Df Sum Sq Mean Sq F v a l u e P r (> F )
c l a s s i f i e r 7 0 . 1 0 8 3 1 6 0 . 0 1 5 4 7 4 4 . 2 2 4 5 0 . 0 0 0 7 0 9 1 ***
R e s i d u a l s 63 0 . 2 3 0 7 5 7 0 . 0 0 3 6 6 3
−−−
S i g n i f . c o d e s : 0 *** 0 . 0 0 1 ** 0 . 0 1 * 0 . 0 5 . 0 . 1 1
˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜
Error : dataset
Df Sum Sq Mean Sq F v a l u e P r ( > F )
Residuals 9 0.99017 0.11002
Error : Within
Df Sum Sq Mean Sq F v a l u e P r ( > F )
c l a s s i f i e r 7 0.061714 0.008816 1.9412 0.07767 .
R e s i d u a l s 63 0 . 2 8 6 1 1 9 0 . 0 0 4 5 4 2
−−−
S i g n i f . codes : 0 *** 0 . 0 0 1 ** 0 . 0 1 * 0 . 0 5 . 0 . 1 1
>
˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜
Error : dataset
Df Sum Sq Mean Sq F v a l u e P r ( > F )
Residuals 9 1.29361 0.14373
Error : Within
Df Sum Sq Mean Sq F v a l u e P r (> F )
c l a s s i f i e r 7 0.14657 0 . 0 2 0 9 4 3 . 3 5 0 5 0 . 0 0 4 2 4 3 **
R e s i d u a l s 63 0 . 3 9 3 7 0 0.00625
−−−
S i g n i f . c o d e s : 0 *** 0 . 0 0 1 ** 0 . 0 1 * 0 . 0 5 . 0 . 1 1
>
˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜
Error : dataset
Df Sum Sq Mean Sq F v a l u e P r ( > F )
Residuals 9 6.0324 0.6703
Error : Within
Df Sum Sq Mean Sq F v a l u e P r ( > F )
c l a s s i f i e r 7 0.10585 0.01512 1.3814 0.2289
R e s i d u a l s 63 0 . 6 8 9 6 2 0 . 0 1 0 9 5
>
˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜
Error : dataset
Df Sum Sq Mean Sq F v a l u e P r ( > F )
R e s i d u a l s 9 2 . 5 6 0 2 e −13 2 . 8 4 4 7 e −14
Error : Within
Df Sum Sq Mean Sq F v a l u e P r ( > F )
c l a s s i f i e r 7 0.40546 0.05792 1.656 0.1364
R e s i d u a l s 63 2 . 2 0 3 5 3 0 . 0 3 4 9 8
>
From the results, we can reject the hypothesis that the results are all similar
for the eight classifiers at the 99% significance level for accuracy and AUC.
384 Appendix C
For RMSE, this hypothesis can be rejected only at the 90% significance level,
and it cannot be rejected for the F measure and KB’s information score. Let
us then follow up this with Dunnett’s post hoc test on the results of accuracy,
AUC, and RMSE. Note that the degree of freedom for these experiments is
(10 − 1)(8 − 1) = 63 (10 domains and eight classifiers) and the significance
level α = 0.05. Accordingly, we get the value of approximately 1.671 for the
one-tailed test or of approximately 2.0 for the two-tailed test (i.e., looking at
α = 0.025) from the t table of Appendix A3 for the resulting Dunnett test
statistic. Under the assumption that rf is superior to the other algorithms, we
can use the one-tailed test, and thus the value of 1.671. Consequently, if the
absolute value of Dunnett’s t statistic (denoted tf1 ,f2 for classifiers f1 and f2 )
is smaller than 1.671, then we cannot reject the hypothesis that both classifiers
perform equivalently.
For the accuracy measure, we then get the following results:
rt
nb,rf = √ 2×0.003663 = −2.05
0.7716−0.8272
10
r t1nn,rf = −1.46
r tsvm,rf = −0.43
r tada,rf = −4.24
r tbag,rf = −0.14
r tc45,rf = −0.35
r trip,rf = −0.23
These results thus suggest (we take their absolute value) that, as far as accu-
racy is concerned, rf performs significantly better (at significance level 0.05)
than two classifiers: nb and ada.
Similarly, for AUC, we get the following results, suggesting that rf performs
significantly better, at a significance level of 0.05, than 1nn, svm, ada, c45,
and rip.
rt
nb,rf = √ 2×0.00625 = −0.45
0.8421−0.8578
10
r t1nn,rf = −3.75
r tsvm,rf = −2.41
r tada,rf = −1.96
r tbag,rf = −0.15
r tc45,rf = −1.83
r trip,rf = −2.34
Finally, the following post hoc test results for RMSE shows rf to be signifi-
cantly better than nb, svm, and ada at a significance level 0.05.
rt
nb,rf = √ 2×0.004542 = 1.71
0.3036−0.2524
10
3 The table does not list a value for degree of freedom 63, so we chose the closest: degree of freedom
60.
Appendix C 385
r t1nn,rf = 1.11
r tsvm,rf = 2.73
r tada,rf = 1.68
r tbag,rf = 0.05
r tc45,rf = 0.63
r trip,rf = 0.08
Overall, based on ANOVA–Dunnett, it then seems that rfs are slightly more
effective than a number of other classifiers, at least with respect to AUC and, to
some extent with respect to RMSE for the chosen domains.
Let us now investigate the effect of paramteric assumptions in the preceding
findings by using a potentially more sensitive nonparamteric test, Friedman’s,
and follow it with Nemenyi’s post hoc test in the event of null-hypothesis rejec-
tions. Listings C.5, C.6, C.7, C.8 and C.9 give the results of running Friedman’s
test in R for the results on accuracy, AUC, RMSE, F measure, and KB informa-
tion score, respectively.
Listing C.5: The results of the omnibus Friedman test on the eight classifiers
and 10 domains, using accuracy.
>
> NBaccuracy = c ( . 7 6 1 7 , . 8 5 5 9 , . 7 2 6 4 , . 9 0 5 3 , . 7 5 7 6 , . 4 9 4 5 ,
.8381 , .9531 , .727 , .6964)
> IB1accuracy= c (.7217 , .9913 , .7529 , .7816 , .7062 , .6995 ,
.814 , .9152 , .6859 , .8085)
> SVMaccuracy= c ( . 7 2 5 , . 9 7 4 6 , . 8 0 7 7 , . 8 7 5 7 , . 7 6 8 , . 5 7 3 6 ,
.8577 , .9358 , .6552 , .9833)
> Adaaccuracy= c ( . 7 2 1 7 , .8363 , .4646 , .7177 , .7492 , .4489 ,
.8137 , .9297 , .7162 , .7272)
> Bagacuracy= c ( . 7 5 6 7 , .9876 , .7 6 , .8337 , .7566 , .7248 , .8 2 ,
.9956 , .691 , .9098)
> J48accuracy= c (.835 , .9858 , .7727 , .7782 , .7449 , .6763 ,
.7922 , .9954 , .7428 , .8528)
> RFaccuracy= c ( . 7 5 6 7 , .9941 , .7709 , .8011 , .7444 , .7616 ,
.8247 , .9919 , .697 , .93)
> JRIPaccuracy= c (.8067 , .9826 , .7311 , .803 , .7518 , .6678 ,
.7813 , .9942 , .7145 , .9755)
F r i e d m a n r a n k sum t e s t
data : t
F r i e d m a n c h i −s q u a r e d = 1 5 . 6 2 0 5 , d f = 7 , p−v a l u e = 0 . 0 2 8 8 2
386 Appendix C
Listing C.6: The results of the omnibus Friedman test on the eight classifiers
and 10 domains, using AUC.
>
> NBAUC = c ( 0 . 9 5 , 0 . 9 9 5 4 , 0 . 7 0 3 3 , 0 . 9 9 3 4 , 0 . 8 1 5 , 0 . 7 2 9 , 0 . 8 5 6 7 ,
0.9317 ,0.7025 ,0.7443)
> IB1AUC= c ( 0 . 7 6 5 , 0 . 9 3 7 5 , 0 . 5 , 0 . 8 5 9 8 , 0 . 6 6 7 7 , 0 . 7 9 8 4 , 0 . 6 7 8 5 ,
0.6709 ,0.6043 ,0.7851)
> SVMAUC= c ( 0 . 9 1 5 , 0 . 9 8 2 6 , 0 . 5 , 0 . 9 2 6 1 , 0 . 7 1 3 1 , 0 . 7 7 8 3 , 0 . 7 6 8 5 ,
0.5918 ,0.5836 ,0.9759)
> AdaAUC= c ( 0 . 8 3 5 , 0 . 8 3 1 , 0 . 5 , 0 . 8 8 9 , 0 . 8 0 4 9 , 0 . 7 0 8 3 , 0 . 8 3 2 6 ,
0.9914 ,0.6991 ,0.8021)
> BagAUC= c ( 0 . 9 3 5 , 0 . 9 6 5 5 , 0 . 5 , 0 . 9 5 9 6 , 0 . 8 2 1 8 , 0 . 9 0 6 3 , 0 . 8 2 5 7 ,
0.9965 ,0.6416 ,0.9726)
> J48AUC= c ( 0 . 9 4 5 , 0 . 9 3 1 , 0 . 5 , 0 . 8 4 4 8 , 0 . 7 5 1 4 , 0 . 7 9 3 8 , 0 . 6 6 7 8 ,
0.9962 ,0.6063 ,0.9013)
> RFAUC= c (0.975 ,0.9676 ,0.5 ,0.9577 ,0.7945 ,0.9209 ,0.8378 ,
0.9986 ,0.6471 ,0.9792)
> JRIPAUC=c ( 0 . 9 4 , 0 . 7 5 5 , 0 . 5 , 0 . 8 6 1 9 , 0 . 7 1 9 5 , 0 . 8 0 1 9 , 0 . 6 2 2 4 , 0 . 9 8 8 ,
0.5975 , 0.9738)
> t = m a t r i x ( c (NBAUC, IB1AUC , SVMAUC, AdaAUC , BagAUC , J48AUC ,
RFAUC, JRIPAUC ) , nrow =10 , byrow=FALSE )
>
> friedman . t e s t ( t )
F r i e d m a n r a n k sum t e s t
data : t
F r i e d m a n c h i −s q u a r e d = 2 4 . 5 , d f = 7 , p−v a l u e = 0 . 0 0 0 9 3 0 2
Listing C.7: The results of the omnibus Friedman test on the eight classifiers
and 10 domains, using RMSE.
>
> NBRMSE= c (0.2965 ,0.2028 ,0.1355 ,0.2785 ,0.4194 ,0.337 ,0.3459 ,
0.1379 ,0.4512 ,0.4308)
> IB1RMSE= c (0.3052 ,0.0384 ,0.1417 ,0.3791 ,0.5402 ,0.29 ,0.2902 ,
0.2904 ,0.2904 ,0.2904)
> SVMRMSE= c (0.3741 ,0.3111 ,0.1934 ,0.345 ,0.4793 ,0.3162 ,0.3435 ,
0.3214 ,0.5477 ,0.1103)
> AdaRMSE= c (0.2904 ,0.269 ,0.1724 ,0.3602 ,0.4157 ,0.3026 ,0.3601 ,
0.1216 ,0.4355 ,0.3011)
> BagRMSE= c (0.3028 ,0.0573 ,0.123 ,0.2857 ,0.403 ,0.2366 ,0.3522 ,
0.0422 ,0.4505 ,0.2843)
> J48RMSE= c (0.2333 ,0.0571 ,0.1208 ,0.3574 ,0.4388 ,0.2832 ,0.404 ,
0.0385 ,0.444 ,0.3344)
Appendix C 387
F r i e d m a n r a n k sum t e s t
data : t
F r i e d m a n c h i −s q u a r e d = 1 3 . 9 3 3 3 , d f = 7 , p−v a l u e = 0 . 0 5 2 3 8
>
Listing C.8: The results of the omnibus Friedman test on the eight classifiers
and 10 domains, using the F measure.
>
>
> NBFMEAS= c (0.3767 ,0.6317 ,0 ,0.9421 ,0.8183 ,0.5762 ,0.6405 ,0.9767 ,
0.8131 ,0.4868)
> IB1FMEAS= c ( 0 . 3 1 , 0 . 7 , 0 , 0 . 8 4 7 5 , 0 . 7 7 8 5 , 0 . 7 2 4 8 , 0 . 4 6 9 1 , 0 . 9 5 6 3 ,
0.7811 ,0.718)
> SVMFMEAS= c ( 0 . 3 8 6 7 , 0 . 6 8 3 3 , 0 , 0 . 9 1 1 5 , 0 . 8 3 3 9 , 0 . 5 2 7 6 , 0 . 6 2 9 7 , 0 . 9 6 6 5 ,
0.7974 ,0.9749)
> AdaFMEAS= c ( 0 . 4 0 1 7 , 0 , 0 , 0 . 7 4 6 , 0 . 8 1 4 5 , 0 . 6 2 6 9 , 0 . 4 8 4 6 , 0 . 9 7 3 7 ,
0.8066 ,0.5976)
> BagFMEAS= c ( 0 . 3 6 6 7 , 0 . 4 9 5 , 0 , 0 . 8 7 2 7 , 0 . 8 1 7 7 , 0 . 7 5 1 9 , 0 . 3 6 7 8 , 0 . 9 9 8 6 ,
0.8017 ,0.857)
> J48FMEAS= c ( 0 . 4 7 6 7 , 0 . 5 0 6 7 , 0 , 0 . 8 2 6 7 , 0 . 8 0 6 3 , 0 . 6 9 5 7 , 0 . 4 0 8 5 , 0 . 9 9 8 3 ,
0.8378 ,0.779)
> RFFMEAS= c ( 0 . 3 8 6 7 , 0 . 7 , 0 , 0 . 8 7 7 3 , 0 . 8 1 1 9 , 0 . 7 8 5 2 , 0 . 4 5 7 8 , 0 . 9 9 6 1 ,
0.7953 ,0.8959)
> JRIPFMEAS=c ( 0 . 4 7 3 3 , 0 . 3 7 6 7 , 0 , 0 . 8 3 5 6 , 0 . 8 1 5 9 , 0 . 6 8 3 4 , 0 . 3 6 4 6 , 0 . 9 9 7 7 ,
0.8126 ,0.9642)
> t = m a t r i x ( c (NBFMEAS, IB1FMEAS , SVMFMEAS, AdaFMEAS , BagFMEAS ,
J48FMEAS , RFFMEAS, JRIPFMEAS ) , nrow =10 , byrow=FALSE )
>
> friedman . t e s t ( t )
F r i e d m a n r a n k sum t e s t
data : t
F r i e d m a n c h i −s q u a r e d = 5 . 6 9 1 , d f = 7 , p−v a l u e = 0 . 5 7 6 3
388 Appendix C
Listing C.9: The results of the omnibus Friedman test on the eight classifiers
and 10 domains, using the KB relative information score.
>
> NBKB= c (0.4781805 , 0.5187728 , 0.5531521 , 0.5058934 ,
0.5395482 , 0.4339969 , 0.7067169 , 0.5449937 ,
0.7183527 , 0.1558427)
> IB1KB= c ( 0 . 5 4 4 7 9 7 5 , 0 . 6 3 9 3 0 8 8 , 0 . 6 2 3 8 4 5 0 , 0 . 6 0 3 1 6 3 4 ,
0.4852430 , 0.6582337 , 0.6007191 , 0.5221051 ,
0.7109930 , 0.4715869)
> SVMKB= c ( 0 . 3 7 2 9 1 9 5 , 0 . 0 0 0 0 0 0 0 , 0 . 0 8 7 1 2 7 4 , 0 . 3 6 2 0 6 6 2 ,
0.6948530 , 0.1388810 , 0.8722821 , 0.0000000 ,
0.8007708 , 0.8107599)
> AdaKB= c ( 0 . 1 8 4 7 1 6 8 , 0 . 3 2 7 6 7 4 8 , 0 . 2 4 2 4 6 1 9 , 0 . 2 6 3 8 8 2 9 ,
0.4575311 , 0.2998341 , 0.5139759 , 0.5603366 ,
0.4756509 , 0.2887929)
> BagKB= c ( 0 . 4 8 4 1 3 2 2 , 0 . 6 2 6 6 4 5 3 , 0 . 6 3 1 3 7 9 1 , 0 . 5 4 8 2 1 2 3 ,
0.4776437 , 0.6038634 , 0.2937511 , 0.5958554 ,
0.1761929 , 0.4697988)
> J48KB= c ( 0 . 7 0 8 3 4 3 1 , 0 . 6 3 1 9 3 7 5 , 0 . 6 7 6 6 4 3 3 , 0 . 5 5 3 2 1 7 7 ,
0.4808735 , 0.6288378 , 0.3409576 , 0.5977915 ,
0.4145745 , 0.5321341)
> RFKB= c ( 0 . 5 9 6 3 6 5 2 , 0 . 6 3 2 7 7 1 3 , 0 . 5 9 5 3 8 9 9 , 0 . 6 1 3 8 3 3 0 ,
0.4703806 , 0.6966989 , 0.4378017 , 0.5831585 ,
0.3557358 , 0.4891372)
> JRIPKB=c ( 0 . 6 3 0 5 4 5 2 , 0 . 6 2 2 8 8 8 0 , 0 . 5 9 0 0 0 1 1 , 0 . 5 4 9 7 3 1 1 ,
0.3939268 , 0.5396542 , 0.2337956 , 0.5957591 ,
0.3477294 , 0.7819474)
> t = m a t r i x ( c (NBKB, IB1KB , SVMKB, AdaKB , BagKB , J48KB , RFKB ,
JRIPKB ) , nrow =10 , byrow=FALSE )
>
> friedman . t e s t ( t )
F r i e d m a n r a n k sum t e s t
data : t
F r i e d m a n c h i −s q u a r e d = 1 4 . 7 6 6 7 , d f = 7 , p−v a l u e = 0 . 0 3 9 1 1
>
The p values obtained for all these tests tell us that, although for the F mea-
sure, the hypothesis that all classifiers perform similarly cannot be rejected, it
can be rejected at the 95% confidence level for accuracy, AUC, and KB infor-
mation score, and at the 90% confidence level for the RMSE. To discover these
differences we apply Nemenyi’s post hoc test. We start by calculating each of the
algorithms’ rank sums over each evaluation measure. The resulting rank sums
for the classifiers on all domains are tabulated for each performance measure
in Tables C.9 (accuracy), C.10 (RMSE), C.11 (AUC), and C.12 (KB score).
Following each of these tables is the corresponding Nemenyi test calculations
Appendix C 389
Table C.10. Rank-sum results obtained on eight classifiers and 10 domains, using RMSE
Table C.11. Rank-sum results obtained on eight classifiers and 10 domains, using AUC
Table C.12. Rank-sum results obtained on eight classifiers and 10 domains, using KB
relative information score
rq 6n 60
1nn,rf = 19.18
rq
svm,rf = 1.37
rq
ada,rf = 28.31
rq
bag,rf = −1.83
rq
c45,rf = 2.28
rq = 5.94
rip,rf
All the results are significant in the case of RMSE because all of the q values
exceed the value of 3.048. Note the contrast with the previous Dunnett’s test,
for which a significant difference is found in only some of the comparisons.
Moving on to the AUC, we obtain the following results for the Nemenyi test
calculations for AUC, again using rf as control:
rq R.−nb −R.−rf
nb,rf = √ k(k+1) = √ 8×9 = 4.56
30−25
6n 60
r q1nn,rf = 34.69
r qsvm,rf = 24.65
r qada,rf = 23.73
r qbag,rf = 9.13
r qc45,rf = 25.56
r qrip,rf = 23.73
Once again, as with the RMSE, all the results are found to be statistically
significant in the case of AUC because all of the q values exceed the value of
3.048.
Finally, on the KB scores, with rf as control, the Nemenyi test calculations
give these results:
rq R.−nb −R.−rf
nb,rf = √ k(k+1) = √ 8×9 = 12.78
50−36
rq 6n 60
1nn,rf = −2.73
rq
svm,rf = 12.78
rq
ada,rf = 25.56
rq
bag,rf = 10.95
rq
c45,rf = −5.48
rq = 11.87
rip,rf
Because all the (absolute) values but one (q1NN,RF ) exceed 3.048, they are all
deemed significant with respect to the KB relative information score, except for
the comparison between 1nn and rf.
Several remarks can be made concerning this study as compared to that of
Chapter 1. First and foremost, the use of several metrics, and not just accuracy
alone, is an eye-opener with respect to the strengths and weaknesses of each
method. That a classifier, such as SVM, can jump from the top position to the
bottom is certainly indicative of the caveats in using a single metric such as
accuracy. Second, the use of visualization methods presents a nice advantage by
providing a quick summary of results thereby allowing us to focus immediately
on the points of interest (and even identifying them). This obviates the need
to track individual results over each classifier. Obtaining conclusions that are
relevant is hence rendered easier once the criteria of interest are identified. In the
current case, we focused on the strength of rf compared with other classifiers
on various measures. Making use of omnibus tests for statistical significance
392 Appendix C
followed by relevant post hoc tests was not only less cumbersome but also
more sensible given the large number of pairwise comparisons required. Hence,
we see how a broader understanding of various evaluation methods and the
tools available at our disposal allow us to perform relatively more principled
evaluation.
Bibliography
N. M. Adams and D. J. Hand. Comparing classifiers when the misallocation costs are
uncertain. Pattern Recognition, 32:1139–1147, 1999.
D. Aha. Generalizing from case studies: A case study. In Proceedings of the 9th Interna-
tional Workshop on Machine Learning (ICML ’92), pp. 1–10. Morgan Kaufmann, San
Mateo, CA, 1992.
R. Alaiz-Rodriguez and N. Japkowicz. Assessing the impact of changing environments
on classifier performance. In Proceedings of the 21st Canadian Conference in Artificial
Intelligence (AI 2008), Springer, New York, 2008.
R. Alaiz-Rodrı́guez, N. Japkowicz, and P. Tischer. Visualizing classifier performance on
different domains. In Proceedings of the 2008 20th IEEE International Conference
on Tools with Artificial Intelligence (ICTAI ’08), pp. 3–10. IEEE Computer Society,
Washington, D.C., 2008.
S. Ali and K. A. Smith. Kernel width selection for svm classification: A meta learning
approach. International Journal of Data Warehousing Mining, 1:78–97, 2006.
E. Alpaydn. Combined 52 f test for comparing supervised classification learning algorithms.
Neural Computation, 11:1885–1892, 1999.
A. Andersson, P. Davidsson, and J. Linden. Measure-based classifier performance evalua-
tion. Pattern Recognition Letters, 20:1165–1173, 1999.
J. S. Armstrong. Significance tests harm progress in forecasting. International Journal of
Forecasting, 23:321–327, 2007.
A. Asuncion and D. J. Newman. UCI machine learning repository. University of Califor-
nia, Irvine, School of Information and Computer Science, 2007. URL: http://www.ics.
uci.edu/ mlearn/MLRepository.html.
T. L. Bailey and C. Elkan. Estimating the accuracy of learned concepts. In Proceedings of
the 1993 International Joint Conference on Artificial Intelligence, pp. 895–900. Morgan
Kaufmann, San Mateo, CA, 1993.
S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth. The UCI KDD archive of large data
sets for data mining researc and experimentation. SIGKDD Explorations, 2(2):81–85,
December 2000.
C. Bellinger, J. Lalonde, M. W. Floyd, V. Mallur, E. Elkanzi, D. Ghazi, J. He, A. Mouttham,
M. Scaiano, E. Wehbe, and N. Japkowicz. An evaluation of the value added by informative
metrics. In Proceedings of the Fourth Workshop on Evaluation Methods for Machine
Learning, 2009.
E. M. Bennett, R. Alpert, and A. C. Goldstein. Communications through limited response
questioning. Public Opinion Q, 18:303–308, 1954.
393
394 Bibliography
C. Cortes and M. Mohri. Confidence intervals for the area under the ROC curve. In Advances
in Neural Information Processing Systems, Vol. 17. MIT Press, Cambridge, MA,
2005.
J. Davis and M. Goadrich. The relationship between precision-recall and ROC curves.
In Proceedings of the International Conference on Machine Learning, pp. 233–240.
Association for Computing Machinery, New York, 2006.
J. J. Deeks and D. G. Altman. Diagnostic tests 4: Likelihood ratios. British Medical Journal,
329:168–169, 2004.
G. Demartini and S. Mizzaro. A classification of IR effectiveness metrics. In Proceedings of
the European Conference on Information Retrieval, pp. 488–491. Vol. 3936 of Springer
Lecture Notes. Springer, Berlin, 2006.
J. Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine
Learning Research, 7:1–30, 2006.
J. Demšar. On the appropriateness of statistical tests in machine learning. In Proceedings of
the ICML’08 Third Workshop on Evaluation Methods for Machine Learning. Association
for Computing Machinery, New York, 2008.
L. R. Dice. Measures of the amount of ecologic association between species. Journal of
Ecology, 26:297–302, 1945.
T. G. Dietterich. Approximate statistical tests for comparing supervised classification learn-
ing algorithms. Neural Computation, 10:1895–1924, 1998.
P. Domingos. A unified bias-variance decomposition and its applications. In Proceedings
of the 17th International Conference on Machine Learning, pp. 231–238. Morgan Kauf-
mann, San Mateo, CA, 2000.
C. Drummond. Machine learning as an experimental science (revised). In Proceedings
of the AAAI’06 Workshop on Evaluation Methods for Machine Learning I. American
Association for Artificial Intelligence, Menlo Park, CA, 2006.
C. Drummond. Finding a balance between anarchy and orthodoxy. In Proceedings of the
ICML’08 Third Workshop on Evaluation Methods for Machine Learning. Association
for Computing Machinery, New York, 2008.
C. Drummond and N. Japkowicz. Warning: Statistical benchmarking is addictive. Kick-
ing the habit in machine learning. Journal of Experimental and Theoretical Artificial
Intelligence, 22(1):67–80, 2010.
B. Efron. Estimating the error rate of a prediction rule: Improvement on cross-validation.
Journal of the American Statistical Association, 78:316–331, 1983.
B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap, Chapman and Hall, New
York, 1993.
W. Elazmeh, N. Japkowicz, and S. Matwin. A framework for measuring classification
difference with imbalance. In Proceedings of the 2006 European Conference on Machine
Learning (ECML/PKDD 2008). Springer, Berlin, 2006.
W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. Adacost: Misclassification cost-sensitive
boosting. In Proceedings of the 16th International Conference on Machine Learning,
pp. 97–105. Morgan Kaufmann, San Mateo, CA, 1999.
T. Fawcett. ROC graphs: Notes and practical considerations for data mining researchers.
Technical Note HPL 2003–4, Hewlett-Packard Laboratories, 2004.
T. Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27:861–874,
2006.
T. Fawcett and A. Niculescu-Mizil. PAV and the ROC convex hull. Machine Learning, 68
(1):97–106, 2007. doi: http://dx.doi.org/10.1007/s10994-007-5011-0.
C. Ferri, P. A. Flach, and J. Hernandez-Orallo. Improving the AUC of probabilistic esti-
mation trees. In Proceedings of the 14th European Conference on Machine Learning,
pp. 121–132. Springer, Berlin, 2003.
C. Ferri, J. Haernandez-Orallo, and R. Modroiu. An experimental comparison of perfor-
mance measures for classification. Pattern Recognition Letters, 30:27–38, 2009.
R. A. Fisher. Statistical Methods and Scientific Inference. 2nd ed. Hafner, New York,
1959.
396 Bibliography
R. A. Fisher. The Design of Experiments. 2nd ed. Hafner, New York, 1960.
P. A. Flach. The geometry of ROC space: Understanding machine learning metrics through
ROC isometrics. In Proceedings of the 20th International Conference on Machine Learn-
ing, pp. 194–201. American Association for Artificial Intelligence, Menlo Park, CA,
2003.
P. A. Flach and S. Wu. Repairing concavities in ROC curves. In Proceedings of the 19th
International Joint Conference on Artificial Intelligence (IJCAI’05), pp. 702–707. Pro-
fessional Book Center, 2005.
J. L. Fleiss. Measuring nominal scale agreement among many raters. Psychological Bulletin,
76:378–382, 1971.
G. Forman. A method for discovering the insignificance of one’s best classifier and the
unlearnability of a classification task. In Proceedings of the First International Workshop
on Data Mining Lessons Learned (DMLL-2002), 2002.
M. R. Forster. Key concepts in model selection: Performance and generalizabilty. Journal
of Mathematical Psychology, 44:205–231, 2000.
Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm for
combining preferences. Journal of Machine Learning Research, 4:933–969, 2003.
M. Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis
of variance. Journal of the American Statistical Association, 32:675–701, 1937.
M. Friedman. A comparison of alternative tests of significance for the problem of m
rankings. Annals of Mathematical Statistics, 11:86–92, 1940.
J. Fuernkranz and P. A. Flach. Roc ’n’ rule learning – Towards a better understanding of
covering algorithms. Machine Learning, 58:39–77, 2005.
V. Ganti, J. Gehrke, R. Ramakrishnan, and W. Y. Loh. A framework for measuring dif-
ferences in data characteristics. Journal of Computer and System Sciences, 64:542–578,
2002.
M. Gardner and D. G. Altman. Confidence intervals rather than p values: Estimation rather
than hypothesis testing. British Medical Journal, 292:746–750, 1986.
L. Gaudette and N. Japkowicz. Evaluation methods for ordinal classification. In Proceed-
ings of the 2009 Canadian Conference on Artificial Intelligence. Springer, New York,
2009.
L. Geng and H. Hamilton. Choosing the right lens: Finding what is interesting in data
mining. In F. Guillet and H. J. Hamilton, editors, Quality Measures in Data Mining,
pp. 3–24. Vol. 43 of Springer Studies in Computational Intelligence Series, Springer,
Berlin, 2007.
G. Gigerenzer. Mindless statistics. Journal of Socio-Economics, 33:587–606, 2004.
J. Gill and K. Meir. The insignificance of null hypothesis significance testing. Political
Research Quarterly, pp. 647–674, 1999.
T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller,
M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molec-
ular classification of cancer: Class discovery and class prediction by gene expression
monitoring. Science, 286:531–537, 1999.
S. N. Goodman. A comment on replication, p-values and evidence. Statistics in Medicine,
11:875–879, 2007.
W. S. Gosset (pen name: Student). The probable error of a mean. Biometrika, 6:1–25, 1908.
K. Gwet. Kappa statistic is not satisfactory for assessing the extent of agreement
between raters. Statistical Methods for Inter-Rater Reliability Assessment Series, 1:1–6,
2002a.
K. Gwet. Inter-rater reliability: Dependency on trait prevalence and marginal homogeneity.
Statistical Methods for Inter-Rater Reliability Assessment Series, 2:1–9, 2002b.
D. J. Hand. Classifier technology and the illusion of progress. Statistical Science, 21:1–15,
2006.
D. J. Hand. Measuring classifier performance: A coherent alternative to the area under the
ROC curve. Machine Learning, 77:103–123, 2009.
D. J. Hand and R. J. Till. A simple generalisation of the area under the ROC curve for
multiple class classification problems. Machine Learning, 45:171–186, 2001.
Bibliography 397
J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating
characteristic (ROC) curve. Radiology, 143:29–36, 1982.
L. L. Harlow and S. A. Mulaik, editors. What If There Were No Significance Tests? Lawrence
Erlbaum, Mahwah, NJ, 1997.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference and Prediction. Springer-Verlag, New York, 2001.
J. He, A. H. Tan, C. L. Tan, and S. Y. Sung. On quantitative evaluation of clustering
systems. In W. Wu and H. Xiong, editors, Information Retrieval and Clustering. Kluwer
Academic, Dordrecht, The Netherlands, 2002.
X. He and E. C. Frey. The meaning and use of the volume under a three-class ROC surface
(vus). IEEE Transactions Medical Imaging, 27:577–588, 2008.
R. Herbrich. Learning Kernel Classifiers. MIT Press, Cambridge, MA, 2002.
T. Hill and P. Lewicki. STATISTICS Methods and Applications. StatSoft, Tulsa, OK, 2007.
P. Hinton. Statistics Explained. Routledge, London, 1995.
S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of
Statistics, 6(2):65–70, 1979.
R. C. Holte. Very simple classification rules perform well on most commonly used data
sets. Machine Learning, 11:63–91, 1993.
G. Hommel. A stagewise rejective multiple test procedure based on a modified Bonferroni
test. Biometrika, 75:383–386, 1988.
L. R. Hope and K. B. Korb. A Bayesian metric for evaluating machine learning algorithms.
In Australian Conference on Artificial Intelligence, pp. 991–997. Vol. 3399 of Springer
Lecture Notes in Computer Science. Springer, New York, 2004.
D. C. Howell. Statistical Methods for Psychology. 5th ed. Duxbury Press, Thomson Learn-
ing, 2002.
D. C. Howell. Resampling Statistics: Randomization and the Bootstrap. On-Line Notes,
2007. URL http://www.uvm.edu/ dhowell/StatPp./Resampling/Resampling.html.
J. Huang and C. X. Ling. Constructing new and better evaluation measures for machine
learning. In Proceedings of the 20th International Joint Conference on Artificial Intelli-
gence (IJCAI ’07), pp. 859–864, 2007.
J. Huang, C. X. Ling, H. Zhang, and S. Matwin. Proper model selection with significance
test. In Proceedings of the European Conference on Machine Learning (ECML-2008),
pp. 536–547. Springer, Berlin, 2008.
R. Hubbard and R.. M. Lindsay. Why p values are not a useful measure of evidence in
statistical significance testing. Theory and Psychology, 18:69–88, 2008.
J. P. A. Ioannidis. Why most published research findings are false. Public Library of Science
Medicine, 2(8):e124, 2005.
P. Jaccard. The distribution of the flora in the alpine zone. New Phytology, 11(2):37–50,
1912.
A. K. Jain, R. C. Dubes, and C. Chen. Bootstrap techniques for error estimation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 9:628–633, 1987.
N. Japkowicz. Classifier evaluation: A need for better education and restructuring. In Pro-
ceedings of the ICML’08 Third Workshop on Evaluation Methods for Machine Learning,
July 2008.
N. Japkowicz, P. Sanghi, and P. Tischer. A projection-based framework for classifier per-
formance evaluation. In Proceedings of the 2008 European Conference on Machine
Learning and Knowledge Discovery in Databases (ECML PKDD ’08) – Part I, pp. 548–
563. Springer-Verlag, Berlin, 2008.
D. Jensen and P. Cohen. Multiple comparisons in induction algorithms. Machine Learning,
38:309–338, 2000.
H. Jin and Y. Lu. Permutation test for non-inferiority of the linear to the optimal combination
of multiple tests. Statistics and Probability Letters: 79:664–669, 2009.
M. Kendall. A new measure of rank correlation. Biometrika, 30:81–89, 1938.
D. F. Kibler and P. Langley. Machine learning as an experimental science. In Proceedings
of the Third European Working Session on Learning (EWSL), pp. 81–92. Pitman, New
York, 1988.
398 Bibliography
403
404 Index
Critical value, 67, 232, 234–235, 237–238, 242, Geometric mean, 96, 100–101, 103
249, 252–256, 351, 355, 362–363, 390 Gini coefficient, 122, 139, 159
Cross-entropy, 146 Gold standard, 145
Cross-validated t-test, 188 Graphical performance measure, 112
Cross-Validation (CV), 172
10-fold Cross Validation, 10, 43, 48, 64, 84, 128, H Measure, 81, 159, 310
147–148, 159, 199–200, 203–204, 219, 223, Holdout method, 70, 161–162, 167, 169, 176,
228, 232, 242, 263, 266, 300, 341, 364 202, 226
Cumulative distribution function (CDF), 45, 61 Holdout risk bound, 325–326, 345
Honestly significant difference (HSD), 252
Data mining, 2, 6, 19–20, 206, 246, 294–297, Hypothesis testing, 42, 59–62, 64–68, 162, 184,
307, 350, 370 188, 211, 216–217, 230, 239, 290, 324
Dataset, 15
Dataset selection, 11, 21, 42, 292, 301, 306, Indicator function, 32, 85–86, 141, 230, 291, 325
336–337, 340, 342, 345, 347 Inductive inference, 23–24, 48
De-Facto approach/culture, 7–8 Information theoretic measure, 137–138, 143,
DET curve, 136, 156, 160 159
Deterministic algorithm, 75 Information Reward, 140, 143
Domain specific metric, 145 Information Score, 84, 140–142, 157, 317, 376,
Dunnett Test, 253–255, 351, 363, 382, 384 384–385, 388, 391
Iso-curves, 122
Effect size, 71–72, 211, 213, 221–222, 258, Isometrics, 119–124, 133, 314
264–265, 373 Iso-precision lines, 120
Efficiency method, 318–320
Empirical risk, 27–28 jackknife, 175, 186, 203
Empirical risk minimization, 29–30, 54, 159
Error estimation, 35, 74, 84, 161 Kappa statistic, 93
Error rate, 86–87, 89, 94, 96, 104, 109, 133–134, Kendall coefficient, 144
159, 162, 167, 175, 182, 187 k-fold Cross-Validation, 7, 14, 127–128, 162,
Evaluation framework, 162, 292, 306, 308, 317, 171–173, 189–191, 193–194, 202–204, 258,
329–330, 335–338, 340–349 260, 262, 280, 290, 344
Evaluation framework template, 336–338 Kononenko and Bratko’s Information Score, 84,
Evaluation measure (performance measure), 140–142
75–77 Kullback-Leibler divergence, 139
Evaluation metric (performance metric), 11–12,
42, 82, 84, 145, 160, 207–209, 316, 322–323, Learning bias, 31, 33, 177, 323–324, 329, 339
379–381 Leave One Out, 162, 171, 173, 175–176, 194,
Expected cost, 120, 122, 128, 135 202, 327, 341
Expected risk, 27–28, 163, 176, 202 Lift chart, 132
Expected value of a random variable, 45–46 Lift curve, 153–154
Exploratory data analysis, 66, 211 Likelihood ratio, 97–99
Loss function, 27, 32, 34–38, 40, 48, 52, 75,
False negative, 69, 79, 89, 91, 95, 116–117, 135 106, 137, 144, 168, 204
False negative rate, 95
False positive, 69 Mann-Whitney U test, 129
False positive rate, 84, 94–95, 113–116, McNemar’s Test, 68, 217, 226–231, 236,
121–128, 135, 149, 151–152, 370–371 264–265, 289, 373
F-measure, 104, 108, 157, 372 McNemar’s contingency matrix, 227–229
F-Ratio table, 351, 357–360 Machine learning, 23–42, 279–280
Friedman table, 351, 361 Matched samples design, 216
Friedman Test, 11, 239, 248–251, 255, 257–258, Matched pairs design, 216, 320–321
268, 273–275, 290, 385, 388 Measure-based method, 318,
Meta-learning, 294
Gaussian distribution, 53–54, 60 Model selection, 11–12, 29, 31–32, 39–40,
Generalization, 16, 32, 34, 39, 41, 69, 87, 90, 93, 177–178
110, 131, 144, 164–165, 170, 176, 240, 297, Monotonic performance measure, 65, 77, 216
303, 323, 344 Multiclass, 25, 38, 97, 110, 131, 144, 230, 315
Generalization error, 16, 28–29, 32, 41, 54, 170, Multi-class classification, 85, 97, 110,
323–324 174–175
Generic algorithm, 293–294, 301, 336–337, 340 Multiclass focus, 85–86, 101
Geometric distribution, 56 Multiclass ROC curve, 131
Index 405
Multiple resampling, 161–163, 178–179, 183, Random subsampling, 162, 179, 194–196,
185–186, 188, 202–203 258–259, 278–279, 290–291
Multiplicity effect, 15, 299–300, 345 Ranking classifier, 75, 118, 129, 248
Recall, 137–138
Negative predictive value (NPV), 99, 371 Receiver Operating Characteristic (ROC), 111,
Nemenyi Test, 256–257, 275, 382, 388, 390–391 112–113
Nested k-fold cross validation, 178 regression, 32–33
“No Free Lunch” theorems, 292 Regularization, 29–31, 34
noise, 36 Relative Superiority Graph, 135–136, 156, 160
Non-parametric test, 110, 131, 144, 164–165, Reliability metric, 76, 111, 137, 145, 157–159,
170, 176, 197, 240, 303, 323, 344 316–317
Normal distribution, 47, 53–57, 60, 62–63, 65, Repeated-measures design, 216
68, 71, 112, 218, 222, 224, 230, 235, 247, Replicability, 185, 203, 262–263, 291, 294
347–348 Repository approach, 294
Normalized expected cost (NEC), 135 Resampled matched pair t-test, 225
Null hypothesis, 65, 216 Resampling, 166–167, 175–176
Null hypothesis statistical testing (NHST), 207, Resampling framework, 171–172, 180
289 Resampling statistics, 14–15
Resubstitution error, 161, 164, 181, 202
Omnibus statistical test, 251 risk, 27–28
One-tailed test/One-sided test, 70, 220 r × k CV, 261–263
One-Way ANOVA, 214, 245–246 ROC Analysis, 112–113
One-Way repeated measure ANOVA, 240–245 ROC Convex Hull (ROCCH), 122
Online learning, 25, 311 ROC curve, 124–128, 134–135, 148–153
Ontology of performance measures, 81–82 ROC Curve generation, 124–126
Ontology of error-estimation methods, 163 ROCR, 112, 146–148, 152–153, 370–372
Operating point, 112, 114, 116, 119, 122–123 ROC Space, 113–120, 123, 128, 130, 133–134,
Overview of the statistical tests, 214 136, 314
Overfitting, 30, 32, 39, 167, 229, 299, 302, 325 Root Mean Squared Error (RMSE), 137–143,
157
Panacea approach to evaluation, 4, 7, 349
Parameter estimation, 34 Sample mean, 45–46, 48, 51–53, 57–58, 218,
Parameter selection, 4, 15, 76, 178 240
Parametric test, 15, 69, 215, 217, 226, 230, 239, Sample standard deviation, 47, 60–61, 176,
247, 263, 289, 347–348 218
Parametric hypothesis testing, 68 Sampling distribution, 57–59, 63, 216, 218, 247,
Passive learning, 25 332, 348
Perfect calibration, 140, 143 S Coefficient, 90–91, 110
Performance metric, 4, 21, 83, 85–89, 94, Scoring classifier, 117–118
308–309, 311 Scott’s π (pi) coefficient, 91
Permutation test, 186, 199–203, 329, 373 Semi-supervised learning, 24–25
Poisson distribution, 55–56 Sensitivity, 95–99
Positive predictive values (PPV), 13 Sign Test, 228–229, 231–233, 236–239, 267,
Power, 69–70, 145, 203, 226, 230, 256, 260, 289, 355
275 Silver standard, 89
Precision, 13, 70–71, 84, 99–100, 102, 104, 109 Simple and intuitive measure (SIM), 319–320
Precision-Recall curve (PR Curve), 132–133 Simple resampling, 161–163, 171–173,
Prior class probability, 13 176–179, 185–186, 215, 258–260, 290–291
Probabilistic algorithm, 75 Single class focus, 87, 94, 102
Probability density function (PDF), 45, 54 Skew, 104–106
Probability distribution, 44–46, 55, 58, 67, 139, Skew ratio, 110, 112, 119–120, 123, 130,
331 310
Probability space, 44 Specialization, 34, 39
Specificity, 13, 82, 95
Qualitative metric, 311–312, 319 Standard deviation, 47, 52
Quantitative metric, 309–311 Standard error, 60–61, 251
Statistical distribution, 46
R, 42 Statistical hypothesis testing, 61–62, 65–66, 68,
Random variable, 42, 44–48, 52, 54, 58, 60 162, 230, 290
Randomization, 15, 162, 183–185, 188, 202, Statistical significance testing, 6, 14–15, 21, 57,
345, 348 74, 162, 185, 188, 203, 206
406 Index
Statistical learning theory, 16, 202, 308, Type I error, 14, 69–71, 203, 208, 225–226, 231,
323–325, 329–330 239, 255, 261, 263, 291, 329
Statistical table, 22, 351 Type II error, 42, 69–70, 347
Statistical test, 206–215
Stratified k-fold Cross-Validation, 189, 191, 193 UCI Machine Learning Repository, 7
Structural risk minimization, 29, 30
Summary statistic, 128–130, 310 variance, 11, 37, 40, 46
Supervised learning, 1–3, 21, 24–25, 310 visualization approach, 321–323
visualization-based combination metric,
Task specific algorithm, 293–294 321–323
Total variation, 246
Training set bound, 126–128, 176, 324 WEKA, 7, 13, 22, 73, 83–84, 92, 107–108, 110,
True negative, 19, 96, 102, 104, 113 138, 146–147, 157, 163, 174, 187–188, 273,
True negative rate, 95–96, 113 278, 364, 373
True positive, 79, 84, 94–95, 116, 128, 132 Wilcoxon’s Signed Rank Test, 233–235, 236
True positive rate, 94–95, 128, 132 Wilcoxon’s Rank Sum Test, 129
True risk, 27–29, 61, 68, 162, 164–165, 204, 324 Wilcoxon Table, 235, 238, 267–268, 356
t-table, 261 Within-group variation, 240, 242
t-test, 224
Tukey Test, 251–253, 257, 275, 362 Zero-one loss, 27, 32, 36, 39, 163, 169
Two-tailed test/Two-sided test, 67–68 Z-table, 351