Diagnosing Acute Appendicitis With Very Simple Classification Rules

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Diagnosing Acute Appendicitis With Very

Simple Classification Rules


Aleksander hrn
Knowledge Systems Group
Dept. of Computer and Information Science
Norwegian University of Science and Technology
Trondheim, Norway
[email protected]

Abstract
A medical database with 257 patients thought to have acute appendicitis has been analyzed. Binary classifiers composed of very simple univariate if-then classification rules (1R rules) were synthesized, and are shown to
perform well for determining the true disease status. Discriminatory performance was measured by the area under the receiver operating characteristic (ROC) curve. Although an 1R classifier seemingly performs slightly
better than a team of experienced physicians when only readily available
clinical variables are employed, an analysis of cross-validated simulations
shows that this perceived improvement is not statistically significant (p <
0:613). However, further addition of biochemical test results to the model
yields an 1R classifier that is significantly better than both the physicians
(p < 0:03) and an 1R classifier based on clinical variables only (p < 0:0003).

1 Introduction
Acute appendicitis is one of the most common problems in clinical surgery in
the western world, and its diagnosis is sometimes difficult to make, even for
experienced physicians. The costs of the two types of diagnostic errors in the
binary decision-making process are also very different. Clearly, unnecessary
operations are desirable to avoid. But failing to operate at an early enough
stage may lead to perforation of the appendix. Perforation of the appendix is
a serious condition, and leads to morbidity and occasionally death. Therefore,
a high rate of unnecessary surgical interventions is usually accepted. Analysis
of collected data with the objective of improving various aspects of diagnosis
is therefore potentially valuable.
1

This paper analyzes a database of patients thought to have acute appendicitis.


The main objective of this study is to address the following two questions:
1. Based only upon readily available clinical attributes, does a computer
model perform better than a team of physicians at diagnosing acute appendicitis?
2. Does a computer model based upon both clinical attributes and biochemical attributes perform better than a model based only upon the clinical
attributes?
These two issues have previously been addressed in the medical literature by
Hallan et al. using the same database of patients [5, 4]. Multivariate logistic regression, the de facto standard method for analysis of binary data in the health
sciences, was used in those studies. This paper addresses the same issues, but
rather using one of the simplest approaches to rule-based classification imaginable, namely a collection of univariate if-then rules. Univariate if-then rules
are also referred to as 1R rules.
Section 2 of this paper provides an overview of the data material, while Section 3 reviews the applied methodology. The results are given in Section 4
and are analyzed statistically in Section 5. A discussion and conclusions can
be found in Section 6 and Section 7. Some technical details can be found in
Appendix A and Appendix B.

2 Data Material
The contents of a medical database with 257 patients thought to have acute appendicitis are summarized in Table 1. The 257 patients were referred by general
practitioners to the department of surgery at a district general hospital in Norway, and were all suspected to have acute appendicitis after an initial examination in the emergency room. Attributes fa1 ; : : : ; a14 g are readily available
clinical attributes, while attributes fa15 ; : : : ; a18 g are the results of biochemical
tests. The outcome attribute d is the final diagnosis of acute appendicitis, and
was based on histological examination of the excised appendix.
After the clinical variables were recorded the physician also gave an estimate
of the probability that the patient had acute appendicitis, based on these. The
estimated probabilities were given in increments of 10% from 0% to 100%. Nine
residents with two to six years of surgical training participated in the study.
For a detailed description of the patient group and the attribute semantics,
see [5, 4]. The same set of 257 patients was analyzed in [4], while a superset
containing 305 patients was analyzed in [5]. Logistic regression was used in
both studies.
2

a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
a11
a12
a13
a14
a15
a16
a17
a18
d

Age (years)
Male sex?
Duration of pain (hours)
Anorexia?
Nausea or vomiting?
Previous surgery?
Aggravation of pain by movement?
Aggravation of pain by coughing?
Normal micturation?
Tenderness in right lower quadrant?
Rebound tenderness in right lower quadrant?
Guarding or rigidity?
Classic migration of pain?
Rectal temperature ( C)
Erythrocyte sedimentation rate (mm)
C-reactive protein concentration (mg/l)
White blood cell count ( 109 )
Neutrophil count (%)
Acute appendicitis?

386 (22)
0.553
2-600 (22)
0.693
0.708
0.093
0.615
0.599
0.872
0.860
0.553
0.307
0.494
36.440.3 (37.7)
190 (10)
0260 (12)
2.931 (12.1)
3893 (80)
0.381

Table 1: Summary of attributes recorded for the 257 patients thought to have acute
appendicitis. For binary attributes, the prevalence is given. For numerical attributes,
the range and median are given.

3 Methodology
Let U denote the universe of patients, let A denote the set of classifier input
attributes, and let d denote the outcome attribute. Attributes can be viewed as
functions, so a(x) means the observed value of attribute a for patient x. The set
of 1R rules is then defined as follows:
1R =

[ [ fif a

x2U a2A

a(x)) then (d = d(x))g

(1)

If numerical attributes are to be properly incorporated into classification rules,


they need to be discretized. Discretization amounts to searching for intervals or
bins, where all cases that fall within the same interval are grouped together.
This enables numerical attributes to be treated as categorical ones, and several
algorithms for this purpose are available. In this study, for simplicity, all numerical attributes were discretized using a simple equal frequency binning
technique. This fully automatic approach simply divides the attribute domain
into a predetermined number of intervals such that each interval contains approximately the same number of cases.
A classification rule if then can have several numerical factors associated
with it. A fundamental notion is the support of a pattern described by a rule,
which is defined as the number of cases in U that match the pattern. From this,
3

the accuracy and coverage of the rule can be computed. Accuracy is an estimate
of Pr( j ), while coverage is an estimate of Pr( j ).
A binary classifier c realizes a decision function d^c , as shown below. The decision function is assumed to be composed of two functions  and  , where
 measures the classifiers certainty that a patient x has outcome 1. The function  is a simple threshold function that evaluates to 0 if (x) <  , and 1
otherwise. A set of unordered classification rules can be used to realize the intermediate function (x) through a process called voting. Voting is outlined in
Appendix A.

 f0; 1g
d^c : U ! [0; 1] !

(2)

A receiver operating characteristic (ROC) curve is a graphical method for assessing the discriminatory performance of a binary classifier [6], independent of
both error costs and the prevalence of disease. By varying the threshold 
across the full spectrum of possible values, one obtains several pairs of estimates for sensitivity (true positive rate) and specificity (true negative rate).
The ROC curve is a plot of the complement of the specificity (the false positive rate) on the x-axis against the sensitivity on the y -axis. The area under
the ROC curve (AUC) computed using the trapezoidal method of integration
can be shown to equal the Wilcoxon-Mann-Whitney statistic, or the probability
that  will assign a higher value to a diseased individual than to a non-diseased
one, if the pair is randomly drawn from the population the ROC curve is derived from. An AUC of 0.5 signifies that the classifier performs no better than
tossing a coin, while an area of 1.0 signifies perfect discrimination.
In order to make the most out of scarce data, cross-validation (CV) was employed. In k -fold CV the set of cases is randomly divided into k disjoint blocks
of cases, usually of equal size. We then apply a classifier trained using k ? 1
blocks to the remaining block to assess its performance. Repeating this for each
of the k blocks enables us to average the estimates from each iteration to obtain
an unbiased performance estimate.
In the training stage of the CV pipeline, the union of the k ? 1 blocks were
first discretized using an equal frequency binning technique with three bins.
Intuitively, this corresponds to labeling the values low, medium or high.
1R rules were subsequently computed from the discretized union of blocks. In
the testing stage, the hold-out block was first discretized using the same bins
that were computed in the training stage, and the cases in the discretized holdout block were then classified using standard voting among the previously
computed 1R rules. The performance on the hold-out block was logged.
In the case of the probability estimates given by the physicians, these directly
define the physicians realization of the function .

4 Results
The results from a 10-fold CV run are given in Table 2. The physicians and the
simple 1R classifier both made use of the clinical variables only, while the extended 1R classifier had additional access to the results of the biochemical tests.
On average, the extended 1R classifier performed somewhat better than both
the simple 1R classifier and the team of physicians. The simple 1R classifier and
the physicians perform approximately the same, with the former achieving a
slightly better average score. The mean AUC scores give a measure of how
well one might expect the 1R classifiers to perform on a set of unseen cases,
if the classifiers are trained using all the data. As such, CV does not as much
evaluate a particular model, but rather the method that produces the models.
Table 3 lists the results from 5 replications of 2-fold CV. With 2-fold CV the
testing sets are larger, and the Hanley-McNeil standard deviations of the AUC
estimates subsequently smaller. Also, the two folds for each replication are
completely independent since neither the training sets nor the test sets overlap.
Again, the extended 1R classifier seems to do better than both the simple 1R
classifier and the physicians. The simple 1R classifier and the physicians again
display a similar degree of performance.
It is not difficult to produce a classifier that classifies the training data perfectly. Although this would be a very optimistically biased estimate, 1R rules
are so simple they do not possess enough degrees of freedom to overfit the data
much. Reference ROC curves obtained when applying the classifiers to the full
set of 217 patients from which they were constructed are displayed in Figure 1.
All simulations were carried out using the ROSETTA software system [10].
The same set of 257 patients has been previously analyzed by Hallan et al. [4]
using multivariate logistic regression (LR). In that study the set of cases was
randomly split in two halves, and an LR-model derived from one half was
applied to the other half. This was done for 20 random splits, and the mean
AUC and the standard deviation of the 20 samples was calculated. An LRmodel based upon only the clinical attributes had a mean AUC of 0.854 (0.028),
while an LR-model based on both the clinical attributes and the biochemical
attributes had a mean AUC of 0.920 (0.024). Carlin et al. [2] have also analyzed
the same set of patients, but used rough set (RS) methods. This was also done
with 20 random splits, and the mean AUC and the standard deviation of the
20 samples was 0.850 (0.024) for a model based on clinical variables, and 0.923
(0.023) for a model based on both the clinical attributes and the biochemical
attributes. In all fairness it should be said that both the LR and RS studies
only included clinical attributes fa2 ; a8 ; a10 ; a11 ; a12 ; a13 g, and did not include
biochemical attribute a15 . However, this was done because Hallan et al. found
that adding other clinical attributes or attribute a15 did not improve the LRmodels further.

1
2
3
4
5
6
7
8
9
10

Physicians
0.837 (0.081)
0.831 (0.089)
0.872 (0.074)
0.827 (0.104)
0.639 (0.113)
0.733 (0.103)
0.729 (0.102)
0.935 (0.061)
0.914 (0.077)
0.912 (0.084)

Simple 1R
0.879 (0.071)
0.875 (0.078)
0.908 (0.064)
0.887 (0.087)
0.709 (0.106)
0.818 (0.089)
0.815 (0.088)
0.892 (0.077)
0.940 (0.065)
0.858 (0.104)

Extended 1R
0.932 (0.053)
0.916 (0.065)
0.961 (0.042)
0.891 (0.086)
0.827 (0.087)
0.915 (0.063)
0.839 (0.082)
0.967 (0.043)
0.996 (0.017)
0.951 (0.064)

Mean
Median
SD

0.823 (0.089)
0.834 (0.087)
0.095 (0.016)

0.858 (0.083)
0.877 (0.083)
0.065 (0.015)

0.920 (0.060)
0.924 (0.064)
0.055 (0.022)

Table 2: Results from a single 10-fold CV run. AUC quantities are given for each iteration, with standard deviations in parentheses. The AUC standard deviations were
computed using Hanley-McNeils formula [6].

1
2
3
4
5

1
2
1
2
1
2
1
2
1
2

Physicians
0.795 (0.045)
0.844 (0.037)
0.813 (0.040)
0.819 (0.042)
0.755 (0.045)
0.881 (0.035)
0.794 (0.042)
0.839 (0.040)
0.791 (0.043)
0.845 (0.038)

Simple 1R
0.844 (0.040)
0.840 (0.037)
0.826 (0.039)
0.838 (0.040)
0.823 (0.040)
0.868 (0.037)
0.826 (0.039)
0.865 (0.037)
0.833 (0.039)
0.821 (0.041)

Extended 1R
0.911 (0.031)
0.919 (0.027)
0.893 (0.032)
0.925 (0.028)
0.898 (0.031)
0.924 (0.028)
0.899 (0.031)
0.930 (0.027)
0.899 (0.031)
0.901 (0.031)

Mean
Median
SD

0.818 (0.041)
0.816 (0.041)
0.036 (0.003)

0.838 (0.039)
0.835 (0.039)
0.017 (0.001)

0.910 (0.030)
0.906 (0.031)
0.013 (0.002)

Table 3: Results from 5 different 2-fold CV runs, each with a different seed to the random number generator. See Table 2 for legend.

1
0.9
0.8

True positive rate

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0

0.1

0.2

0.3

0.4
0.5
0.6
False positive rate

0.7

0.8

0.9

Figure 1: Reference ROC curves obtained when applying the classifiers to the full set
of 217 patients from which they were constructed. The middle solid line represents
the simple 1R classifier, while the top dotted line represents the extended 1R classifier.
The physicians are represented by the bottom dashed line. The AUC values and their
Hanley-McNeil standard deviations [6] of the three classifiers are 0.817 (0.029) for the
physicians, 0.859 (0.026) for the simple 1R classifier, and 0.924 (0.019) for the extended
1R classifier.

5 Analysis
In order to draw any trustworthy conclusions from the results in Section 4,
a statistical analysis has been performed. The standard tool for comparing
correlated AUC values is Hanley-McNeils method [7]. However, this method
is usually employed for a single two-way split only and not in a CV setting. In
a CV setting one might very well ask what the models to assess really are. One
could of course perform the Hanley-McNeil test for each fold, but it is unclear
how to combine the collection of obtained p-values. Furthermore, one might
question the usefulness of this approach altogether, since if many folds are
used the resulting test sets might be rather small. Computing the AUC based
on only a few cases means that the resulting estimates will have a very high
degree of variability, something which in turn means that the Hanley-McNeil
test will almost certainly not detect any significant differences that might be
present. As an extreme example consider leave-one-out CV, where the test
set would consist of a single case. Then the standard deviation of the AUC
estimate would not even be defined.
The results in Table 3 have been analyzed using the method of Hanley and
McNeil on a per-fold basis. For 2-fold CV, the test sets may be large enough
for such an analysis to be viable. Table 4 contains the p-values per fold per
replication. Considering the median p-value rather than the mean in order
to be more robust for outliers, there is no significant difference between the
physicians and the simple 1R classifier (p < 0:585). However, the extended
1R classifier is significantly better than both the physicians (p < 0:026) and the
simple 1R classifier (p < 0:018).
There are statistical analysis methods that have been specifically designed for
combining CV together with detection of differences in performance. One such
method is the 5x2CV test, originally proposed by Dietterich [3] for comparing
error rates. An improved version of the 5x2CV test due to Alpaydin [1] is described in Appendix B. Applying the improved 5x2CV test to the results in
Table 3 again yields the same conclusions. There is no significant difference between the physicians and the simple 1R classifier (p < 0:613), but the extended
1R classifier is significantly better than both the physicians (p < 0:03) and the
simple 1R classifier (p < 0:0003).

6 Discussion
This study has focused on discrimination only, and has not touched upon the
issue of calibration. A binary classifier is said to well calibrated if the intermediate value (x) closely mimics the probability Pr((d = 1) j x). Calibration is one
of the issues discussed in [5], and is a very important feature if the classifier is to
be used in an interactive decision-support setting. Preliminary investigations

1
2
3
4
5

1
2
1
2
1
2
1
2
1
2

Physicians
Simple 1R
0.3043
0.9171
0.7798
0.6946
0.1634
0.7558
0.4699
0.5733
0.4098
0.5963

Simple 1R
Extended 1R
0.0323
0.0076
0.0289
0.0057
0.0064
0.0756
0.0084
0.0305
0.0250
0.0114

Extended 1R
Physicians
0.0127
0.0559
0.0558
0.0153
0.0016
0.2573
0.0112
0.0322
0.0192
0.1746

Mean
Median
SD

0.5664
0.5848
0.2325

0.0232
0.0182
0.0214

0.0636
0.0257
0.0846

Table 4: Pairwise statistical analysis of the results in Table 3. All p-values are 2-sided
and computed using the Hanley-McNeil method for comparing correlated AUC values [7].

suggest that 1R classifiers with standard voting do not exhibit good calibration. However, this might not matter much since in principle most models can
be transformed to obtain acceptable calibration while retaining their discriminatory abilities.
In Section 1, it was argued that performing a large number of unnecessary
operations was preferable to missing any cases of acute appendicitis. This corresponds to prioritizing test sensitivity before test specificity, and means selecting a threshold  that determines a point on the ROC curve that is close
to (1, 1). Selection of classifiers and thresholds under various cost scenarios is
discussed in [11].
A point that is often made in favor of inducing rule-based classifiers is the potential for knowledge discovery, since classification rules is a representation that
can be inspected and interpreted by non-experts. Although this is true to a
certain degree, it is, however, extremely rare to see any scientific papers where
the rules are used for anything else than as a black-box classifier. 1R rules may
not be very interesting for knowledge discovery since they do not relate any
attributes together in their if-part, but the computational effort to induce them
is negligible and the resulting set of rules is often quite small and manageable. Moreover, simulations by Holte [8] showed that the best individual 1R
rules were usually able to come within a few percentage points of the error
rate that more complex models can achieve, on a spread of common benchmark domains. The present study suggests that this might be true for other
performance measures, too.
The already small 1R models can probably be reduced even more without sac-

rificing the discriminatory performance, by filtering away those rules that deal
with less important attributes. The results from [2] and the attribute selection
done in [4] seem to support this conjecture.
The equal frequency binning technique used for discretizing the numerical attributes was chosen for the sake of simplicity. Other more advanced discretization techniques could potentially yield slightly better results. For example, preliminary experiments suggest that the AUC score for the extended 1R model
can be increased by about one percentage point using the automatic algorithm
outlined in [9]. However, achieving optimal results using 1R models was not
the main objective of this study.
At first glance it may seem a bit odd that it is easier to detect a significant
difference between the two 1R classifiers than between the extended 1R classifier and the physicians, since the average extended 1R performance is closer
to the average simple 1R performance than to the average performance by the
physicians. This can be attributed to the fact that the estimates made by the
physicians display greater variance than the 1R estimates, which are thus more
easily separable.
On a technical note, it should be stated that the 5x2CV test outlined in Appendix B makes a simplifying assumption when it comes to applying the test
2 , i.e., not conto AUC performance measures. The  2 in Equation 7 is really ij
2 can be computed using the lookup table
stant across all i; j . The value of ij
given by Hanley and McNeil [7] together with some additional correlation in2   2 for all i; j may not be all
formation. However, the simplification that ij
that bad, and hopefully the errors that this simplification introduces may even
cancel each other out.

7 Conclusions
Based on the results in Section 4 and the analysis in Section 5, the answers to
the two main questions raised in Section 1 are:
1. Based only upon readily available clinical attributes, does a computer
model perform better than a team of physicians at diagnosing acute appendicitis? No, not significantly, at least not with a set of very simple 1R
classification rules as the computer model.
2. Does a computer model based upon both clinical attributes and biochemical attributes perform better than a model based only upon the clinical
attributes? Yes, even with a set of very simple 1R classification rules as the
computer model there is a significant improvement when biochemical attributes
are additionally taken into account.

10

Although not directly comparable, it hardly seems likely that the results reported in the literature and repeated in Section 4 based on logistic regression [5, 4] or rough set models [2] are significantly different from the 1R results
reported in this study. Hence, based on the principle of parsimony, a collection
of very simple 1R classification rules seems like a good rule-based candidate for
diagnosing acute appendicitis as measured by the area under the ROC curve,
all other things being equal.

Acknowledgments
Thanks to Stein Hallan for sharing the appendicitis data, and to Tor-Kristian
Jenssen and Ulf Carlin for comments and stimulating discussions. Ulf Carlin
also supplied some valuable background material. This work was supported
in part by grant 74467/410 from the Norwegian Research Council.

References
[1] E. Alpaydin. Combined 5x2CV F test for comparing supervised classification
learning algorithms. Research Report 98-04, IDIAP, Martigny, Switzerland, May
1998.
[2] U. Carlin, J. Komorowski, and A. hrn. Rough set analysis of patients with suspected acute appendicitis. In Proc. Seventh Conference on Information Processing and
Management of Uncertainty in Knowledge-Based Systems (IPMU98), pages 15281533,

Paris, France, July 1998. EDK Editions


Medicales et Scientifiques.
[3] T. G. Dietterich. Statistical tests for comparing supervised classification learning
algorithms. Technical report, Oregon State University, Department of Computer
Science, Oct. 1996.

[4] S. Hallan, A. Asberg,


and T.-H. Edna. Additional value of biochemical tests in
suspected acute appendicitis. European Journal of Surgery, 163(7):533538, July 1997.

[5] S. Hallan, A. Asberg,


and T.-H. Edna. Estimating the probability of acute appendicitis using clinical criteria of a structured record sheet: The physician against the
computer. European Journal of Surgery, 163(6):427432, June 1997.
[6] J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver
operating characteristic (ROC) curve. Radiology, 143:2936, Apr. 1982.
[7] J. A. Hanley and B. J. McNeil. A method of comparing the areas under receiver
operating characteristic curves derived from the same cases. Radiology, 148:839
843, Sept. 1983.
[8] R. C. Holte. Very simple classification rules perform well on most commonly used
datasets. Machine Learning, 11(1):6391, Apr. 1993.
[9] H. S. Nguyen and A. Skowron. Quantization of real-valued attributes. In Proc. Second International Joint Conference on Information Sciences, pages 3437, Wrightsville
Beach, NC, Sept. 1995.

11

[10] A. hrn, J. Komorowski, A. Skowron, and P. Synak. The design and implementation of a knowledge discovery toolkit based on rough sets: The ROSETTA system. In L. Polkowski and A. Skowron, editors, Rough Sets in Knowledge Discovery
1: Methodology and Applications, number 18 in Studies in Fuzziness and Soft Computing, chapter 19, pages 376399. Physica-Verlag, Heidelberg, Germany, 1998.
[11] F. Provost and T. Fawcett. Analysis and visualization of classifier performance:
Comparison under imprecise class and cost distributions. In Proc. Third International Conference on Knowledge Discovery and Data Mining, pages 4348, Huntington
Beach, CA, 1997. AAAI Press.

A Voting
Let R denote an unordered set of classification rules. In the setting of a binary outcome
classification where R is to realize the certainty function , the standard voting process
goes as follows when presented with a given case x to classify:
1. The set R is scanned for rules that fire, i.e., rules that have an if-part that matches
x. Let R0 R denote the set of firing rules.
0

2. If R = , then no rules apply. Typically, a fallback output (x) = Pr((d = 1)) is


then invoked.
3. An election process among the rules in R0 is performed in order to resolve possible conflicts. The election process is performed as follows:

(a) Let each rule r


R0 cast a number in votes votes(r) in favor of the classification the rule indicates. The number of votes each rule gets to cast may
vary. Typically, votes(r) equals the support of the rule.
(b) Compute a normalization factor norm(x). Typically, the normalization factor is computed by summing up all the casted votes.
(c) Divide the accumulated number of votes for classification (d = 1) by the
normalization factor norm(x) in order to arrive at a certainty coefficient
(x).

Ri0 = fr 2 R0 j r predicts (d = i)g


votes((d = i)) =
norm(x) =

X
r2Ri

XX

i r2Ri

votes(r)

(3)
(4)

votes(r)

(5)

(x) = votes((d = 1)) = norm(x)

(6)

It is worth analyzing the special case that arises when R comprises the set of all 1R
rules (as defined by Equation 1) and voting based on support (as defined in Section 3)
is used. Let the quantity votes(r) equal the number of objects that match both the if-part
R0 , then the if-part of r
and the then-part of rule r simultaneously. Note that if r

12

reads (a = a(x)). We then have the following relationship, where all probabilities are
as estimated from the set U of training cases:

(x) = votes((d = 1)) = norm(x)


/ votes((d = 1))
X
=
votes(r)
r2R01
=
=

a2A

support((a = a(x)) and (d = 1))

X  accuracy(if (a = a(x)) then (d = 1))  

a2A support((a = a(x)))


X  Pr((d = 1) j (a = a(x)))  
=
a2A Pr((a = a(x)))  jU j
X
/
Pr((d = 1) j (a = a(x)))  Pr((a = a(x)))
a2A
X
=
Pr((a = a(x)) and (d = 1))
a2A

Hence, using 1R rules and voting based on support, the computed certainty coefficient
(x) is proportional to the summed estimated probabilities of each of the observed attribute values of x occurring together with pattern (d = 1).

B The 5x2CV F -test


The 5x2CV F -test, proposed by Alpaydin [1] as a robust improvement to a test proposed by Dietterich [3], can be used to quantitatively compare the performance of two
classifiers. As its name implies, the test is based on performing five replications of 2-fold
CV.
Let pij denote the difference between the performance measures of the two classifiers
on fold j
1; 2 of replication i
1; : : : ; 5 . The average difference in performance
on replication i is pi = (pi1 + pi2 )=2 and the estimated variance is s2i = (pi1 pi )2 +
(pi2 pi )2 .

2f g

2f

Under the null hypothesis that the two classifiers perform equally well, pij can be
treated as following a Normal distribution with mean 0 and unknown variance  2 .
Hence, pij = follows a standard Normal distribution. Then, p2ij = 2 is 2 distributed
with 1 degree of freedom, and N defined below follows a 2 distribution with 10 degrees of freedom.

N=

2
5 X
X
p2ij
2
i=1 j =1 

13

 210

(7)

If pi1 and pi2 are independent and follow Normal distributions, then s2i = 2 is 2 distributed with 1 degree of freedom. Hence, M defined below is 2 distributed with 5
degrees of freedom.

M=

5
X
s2i

2
i=1 

 25

(8)

210 and M 25 , then N=10 divided by M=5 will follow an F distribution with
If N
10 and 5 degrees of freedom.
N=10
f=
=
M=5

P5 P2 p2
i=1P j =1 ij  F
10;5
2 5 s2
i=1 i

(9)

We then reject the null hypothesis that the two classifiers perform equally well if the
statistic f is sufficiently large. For 95% confidence, f = 4:74.

14

You might also like