0% found this document useful (0 votes)
4 views

Class Prediction by Nearest Shrunken Centroids, With Applications To DNA Microarrays

Uploaded by

mert
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Class Prediction by Nearest Shrunken Centroids, With Applications To DNA Microarrays

Uploaded by

mert
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Statistical Science

2003, Vol. 18, No. 1, 104–117


© Institute of Mathematical Statistics, 2003

Class Prediction by Nearest Shrunken


Centroids, with Applications to DNA
Microarrays
Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan and Gilbert Chu

Abstract. We propose a new method for class prediction in DNA microarray


studies based on an enhancement of the nearest prototype classifier. Our
technique uses “shrunken” centroids as prototypes for each class to identify
the subsets of the genes that best characterize each class. The method
is general and can be applied to other high-dimensional classification
problems. The method is illustrated on data from two gene expression
studies: lymphoma and cancer cell lines.
Key words and phrases: Sample classification, gene expression arrays.

1. INTRODUCTION classification. This can aid in biological understand-


ing of the disease process and is also important in de-
Class prediction with high-dimensional features is
an important problem and has recently received a great velopment of clinical tests for early diagnosis. In this
deal of attention in the context of DNA microarrays. article we propose a simple approach to the problem
The task is to classify and predict the diagnostic that performs well and is easy to understand and inter-
category of a sample, based on its gene expression pret.
profile. Recent proposals for this problem include As an example, we consider data from Alizadeh
Golub et al. (1999), Hedenfalk et al. (2001), Hastie, et al. (2000), which is available from the authors’ web
Tibshirani, Botstein and Brown (2001) and the artificial site. These data consist of expression measurements on
neural network approach in Khan et al. (2001). 4,026 genes from samples of 59 lymphoma patients.
The microarray problem is a unique and challeng- The samples are classified into diffuse large B-cell
ing classification task because there are a large number lymphoma and leukemia (DLCL), follicular lymphoma
of inputs (genes) from which to predict classes and a (FL) and chronic lymphocytic leukemia (CLL). We
relatively small number of samples. It is especially im- selected a random subset of 20 samples and set them
portant to identify which genes contribute toward the aside as a test set; the remaining 39 samples formed
the training set.
Robert Tibshirani is Professor, Departments of Health We began with a nearest centroid classification.
Research and Policy, and Statistics, Stanford Univer- Figure 1 (light grey bars) shows the training-set cen-
sity, Stanford, California 94305-5405 (e-mail: tibs@ troids (average expression of each gene) within each of
stat.stanford.edu). Trevor Hastie is Professor, Depart- the three classes. The overall average expression of the
ments of Statistics, and Health Research and Policy, corresponding gene has been subtracted, so that these
Stanford University, Stanford, California 94305-5405 values are differences from the overall centroid.
(e-mail: [email protected]). Balasubramanian To apply the nearest centroid classification, we take
Narasimhan is Senior Research Associate, Depart- the gene expression profile of the test sample and
ments of Statistics, and Health Research and Pol- compute its squared distance from each of the three
icy, Stanford University, Stanford, California 94305- class centroids. The predicted class is the one whose
5405 (e-mail: [email protected]). Gilbert Chu centroid is closest to the expression profile of the test
is Professor, Departments of Biochemistry, and Med- sample. This procedure makes zero errors on the 20
ical Oncology, Stanford University, Stanford, Califor- test samples, but has the major drawback that it uses
nia 94305-5151 (e-mail: [email protected]). all 4,026 genes.

104
CLASS PREDICTION BY NEAREST SHRUNKEN CENTROIDS 105

F IG . 1. Centroids (grey) and shrunken centroids (red) for the lymphoma/leukemia data set. Each centroid has the overall centroid
subtracted; hence, what we see are contrasts. The horizontal units are log ratios of expression. Going from left to right, the number of
training samples is 27, 5 and 7. The order of the genes is determined by hierarchical clustering.

We propose the “nearest shrunken centroid” method, toward the classification. The amount of shrinkage is
which uses denoised versions of the centroids as determined by cross-validation.
prototypes for each class. The optimally shrunken In the preceding example, the (unshrunken) nearest
centroids, derived using a method described below, are centroid method had the same error rate as the nearest
shown as red bars in Figure 1. Classification is then shrunken centroid procedure. This is not always the
made to the nearest (shrunken) centroid. The resulting case. Table 1 shows results taken from Tibshirani,
procedure has zero test errors. In addition, only 81 Hastie, Narasimhan and Chu (2002) on classification
genes have a nonzero red bar for one or more classes in of small round blue cell tumors. The data are taken
Figure 1 and, hence, are the only ones that contribute from Khan et al. (2001). There are 25 test samples
106 TIBSHIRANI, HASTIE, NARASIMHAN AND CHU

TABLE 1
Results on classification of small round blue cell tumors

Method Test error rate Number of genes used

Nearest centroid 4/25 2,308


Nearest shrunken centroids 0/25 43
Neural network 0/25 96
Regularized discriminant analysis 0/25 2,308

and 2,308 genes. The neural network and regularized


discriminant analysis methods used in the table are
described in Section 7.
We gave a brief description of the nearest shrunken
centroid method in Tibshirani, Hastie, Narasimhan and
Chu (2002), focussing on the biological findings from
two different applications. Here we give a broader and
more thorough statistical treatment.
In Section 2 we describe the basic method. We
detail our procedure for adaptive choice of thresholds
in Section 3. Additional issues and comparisons are F IG . 2. Soft threshold function.
discussed in Sections 4–8, including application of the
method to capturing heterogeneity with an “abnormal” the numerator. Thus dik is a t-statistic for gene i,
class compared to a control class, in Section 6. Finally comparing class k to the average class. (In fact, we also
we conclude with a brief discussion in Section 9. add a regularization parameter s0 to the values si ; see
Section 9.) We can write
2. NEAREST SHRUNKEN CENTROIDS
(3) x̄ik = x̄i + mk si dik .
2.1 Details of the Proposal

Our proposal shrinks each dik toward zero, giving dik
Let xij be the expression for genes i = 1, 2, . . . , p and new shrunken centroids or prototypes
and samples j = 1, 2, . . . , n. Each sample belongs to  
one of K classes 1, 2, . . . , K. Let Ck be indices of (4) x̄ik = x̄i + mk si dik .
the nk samples in class k. The ith component of the The shrinkage we use is called soft thresholding: The

centroid for class k is x̄ik = j ∈Ck xij /nk , the mean absolute value of each dik is reduced by an amount
expression in class k for gene i; the ith component of  and is set to zero if the result is less than zero.

the overall centroid is x̄i = nj=1 xij /n. Algebraically, this is expressed as
Our idea is to shrink the class centroids toward the 
overall centroid. However, we first normalize by the (5) dik = sign(dik )(|dik | − )+ ,
within-class standard deviation for each gene. Let where the subscript plus means positive part (t+ =
x̄ik − x̄i t if t > 0 and zero otherwise). This is shown in
(1) dik = , Figure 2. Since many of the x̄ik will be noisy and
mk · si
close to the overall mean x̄i , soft thresholding usually
where si is the pooled within-class standard deviation produces “better” (more reliable) estimates of the true
for gene i, means (Donoho and Johnstone, 1994). The proposed
method has the attractive property that many of the
1  K 
components (genes) are eliminated as far as class
(2) si2 = (xij − x̄ik )2 ,
n − K k=1 j ∈C prediction is concerned if the shrinkage parameter  is
k
√ large enough. Specifically, if  causes dik to shrink to
and mk = 1/nk − 1/n makes the denominator in zero for all classes k, then the centroid for gene i is x̄i ,
Equation (1) equal to the estimated standard error of the same for all classes. Thus gene i does not contribute
CLASS PREDICTION BY NEAREST SHRUNKEN CENTROIDS 107

 = 0.463 there are about 3,666 active genes. The


numbers of genes with nonzero dik  in each class are

(3200, 2497, 3133).


Note that the selection of genes for a given value of
 is carried out separately for each of the 10 cross-
validation trials. This is important to avoid selection
bias and an unrealistically optimistic cross-validation
error rate. As pointed out by Ambroise and McLach-
lan (2002), a number of authors have made the mis-
take of selecting genes based on all of the training
data (expression values and classes) and then subject-
ing only the selected genes to cross-validation. This
can produce a wildly optimistic estimate for misclas-
sification error: it is easy to simulate two-class exam-
ples in which the class labels are independent of the ex-
pression values (test error = 50%), but cross-validation
after selection reports an error of zero.
Formula (1) takes into account the size of each class
and effectively applies a larger threshold to a smaller
(higher variance) class. Even after this adjustment,
some classes may be farther away than others from
the overall centroid and, hence, may be easier to
distinguish. In this case, many of the nonzero genes for
F IG . 3. Lymphoma/leukemia data: training error (tr, blue), cross- that class may not be needed for accurate classification.
validation error (cv, green) and test error (te, red) as the threshold
Thus we might try to vary the class thresholds to
parameter  is varied. In the top panel, the default soft threshold
scaling is used: a solution with  = 0.463 and 3,666 genes is minimize the total number of nonzero genes needed to
chosen. In the bottom panel, adaptive threshold scaling was used; achieve a given error rate. The details of how we do this
the value  = 4.01 is chosen, resulting in a subset of just 81 genes, are discussed in Section 3. In this case the procedure
with the same test error rate as in the top panel. increased the thresholds for the first and third classes,
and was very successful: as shown in the bottom panel
to the nearest centroid computation. We chose  by of Figure 3, it reduced the number of genes to just 81
cross-validation, as illustrated below. without increasing the test error.
Note that the standardization by si in (1) has the 2.2 Finding the Predictors that Matter
effect of giving higher weight to genes that have
stable expression within samples of the same class. Figure 4 shows the shrunken differences dik for the
This same standardization is inherent in other common 81 genes that have at least one nonzero difference.
statistical methods, such as linear discriminant analysis Figure 5 shows the heat map of the chosen 81 genes.
(see Section 7). Within each of the horizontal partitions, we have or-
The top panel of Figure 3 shows the training, 10- dered the genes by hierarchical clustering, and sim-
fold cross-validation and test errors as the shrinkage ilarly for the samples within each vertical partition.
parameter  is varied. The top of the plot indicates Clear separation of the classes is evident. The top set
the number of genes retained (for the training data) of genes characterizes CLL with some genes overex-
at that particular threshold. The left end of the figure pressed and others underexpressed. Similarly the mid-
represents no shrinkage, while the right end represents dle set of genes characterizes FL. The genes in the bot-
complete shrinkage. The test error is minimized near tom set of the figure are overexpressed in DLCL, and
 = 0.463; when the curve is flat near the minimum, underexpressed in FL and CLL.
we typically chose the largest value of  (smallest
2.3 The Log-Likelihood
number of genes) that achieves the minimal error. The
upper axis shows the number of active genes with at It is quite common to have a small number of
least one nonzero component dik  , as  is varied. At samples in each class, especially when the number of
108 TIBSHIRANI, HASTIE, NARASIMHAN AND CHU

F IG . 4. Shrunken differences dik for the 81 genes that have at least one nonzero difference.

classes is large. This can result in a cross-validation 2.4 Class Probabilities and Discriminant Functions
curve that has discrete jumps and high variability.
To help with this problem, we can use the mean We classify test samples to the closest shrunken
cross-validated log-likelihood rather than misclassifi- centroid, again standardizing by si . We also make a
cation error. Since our model produces class probabil- correction for the relative abundance of members of
ity estimates [see Equation (8) in Section 2.2], the log- each class. Details are given next.
likelihood of a test sample x ∗ with class label y ∗ is Suppose we have a test sample (vector) with expres-
log p̂y ∗ (x ∗ ). The mean log-likelihood curve is typically sion levels x ∗ = (x1∗ , x2∗ , . . . , xp∗ ). We define the dis-
smoother than the misclassification error curve. criminant score for class k as
Figure 6 shows the test set log-likelihood and mis-
(x ∗ − x̄  )2
p

classification error curves for the lymphoma data. (This (6) δk (x ∗ ) = i ik
− 2 log πk .
is for illustration only; we are not suggesting use of i=1 si2
the test error to select .) They give a similar picture,
although the choice of the smallest model where the The first part of (6) is simply the standardized squared
log-likelihood starts to dip yields more genes than that distance of x ∗ to the kth shrunken centroid. The
from the misclassification error curve. In the next sec- second part is a correction based on the class prior

tion we make use of the log-likelihood in estimation of probability πk , where K k=1 πk = 1. This prior gives
class probabilities. the overall proportion of class k in the population. The
CLASS PREDICTION BY NEAREST SHRUNKEN CENTROIDS 109

classification rule is then


(7) C(x ∗ ) =  if δ (x ∗ ) = min δk (x ∗ ).
k
If the smallest distances are close and hence ambigu-
ous, the prior correction gives a preference for larger
classes, since they potentially account for more er-
rors. We usually estimate the πk by the sample priors
π̂k = nk /n. If the sample prior is not representative of
the population, then more realistic priors or even uni-
form priors πk = 1/K can instead be used. We can
use the discriminant scores to construct estimates of
the class probabilities by analogy to Gaussian linear
discriminant analysis:

e−(1/2)δk (x )
(8) p̂k (x ∗ ) = K .
−(1/2)δ (x ∗ )
=1 e
The left panel of Figure 7 displays these probabilities
for the lymphoma data. For illustration, we used the
largest value of  (= 4.41) that minimizes the test
error in the bottom panel of Figure 3, rather than the
cross-validation-minimizing value of 4.01 used earlier.
The value  = 4.41 yields 48 genes. We derived the
probabilities using the centroids that were defined by
applying this value of  to the test set.
In Figure 6, the value  = 4.04 gives exactly the
same test error (in fact, the same class predictions)
as  = 4.41, but gives a higher log-likelihood value.
F IG . 5. Heat map of the chosen 81 genes. Within each of the The estimated probabilities resulting from  = 4.04
horizontal partitions, we have ordered the genes by hierarchical are shown in the right panel of Figure 7. These
clustering, and similarly for the samples within each vertical probabilities are more extreme than those in the left
partition. The data for all 59 samples are shown. panel. The rightmost probabilities are preferred, since
they produce a higher log-likelihood score.

F IG . 6. Test set mean log-likelihood curve (red) and test set misclassification error curve (green). The latter has been translated so that
it fits in the same plotting region. The broken line shows where the log-likelihood curve starts to dip, while the dotted line shows where the
misclassification error starts to rise.
110 TIBSHIRANI, HASTIE, NARASIMHAN AND CHU

F IG . 7. Estimated test set probabilities using the 48 gene model from minimizing misclassification error (left) and the 78 gene model from
maximizing the log-likelihood (right). Probabilities are partitioned by the true class. There are no classification errors in the test set.

3. ADAPTIVE CHOICE OF THRESHOLDS To test this procedure further, we simulated some


In this section we describe the procedure for adaptive data consisting of 10 samples in each of four classes
threshold choice in the nearest shrunken centroid and 1,000 genes. We ran two different simulations,
method. We define a scaling vector (θ1 , θ2 , . . . , θK ) with the results shown in the top and bottom pan-
and initially set θk = 1 for all k. These scalings are els of Figure 8. For a concise description, let r(a, n)
included in the denominator of expression (1), that is, represent the number a repeated n times. All ex-
x̄ik − x̄i
(9) dik = .
mk θk · si
We scale the values so that minj (θj ) = 1: values
greater than 1 mean that a larger threshold is effectively
used for class k.
We applied the following procedure:
1. Find the class k with the largest number of training
errors averaged over the grid of  values used.
2. Decrease θk by 10% and then rescale all θj so that
minj (θj ) = 1.
3. Repeat the above steps for a number of iterations
(here 10) and find the solution that gives the lowest
average error, among the values of (θ1 , θ2 , . . . , θK )
visited.
Note that this procedure is based entirely on the
training set and does not use information from cross-
validation or a test set. It is admittedly heuristic, but
does produce useful results in practice.
For the lymphoma data, we obtained the solution
(θ1 , θ2 , θ3 ) = (1.88, 1.00, 1.52), which is the value we
used to produce Figures 1 and 4. Most of the errors
in the original solution occurred in class FL; the new
thresholds are larger for classes DLCL and CLL, and F IG . 8. Simulated data: mean ± 1 standard deviation of the test
hence many fewer genes are used to discriminate these error over five simulations, for default (equal) thresholds (red) and
classes. Remarkably, the total number of genes used adaptive thresholds (green). In the setup for the top panel, the class
has decreased from 3,666 to 81 without raising the test centroids are unevenly spaced; in the bottom panel, the within-class
error. variances are unequal.
CLASS PREDICTION BY NEAREST SHRUNKEN CENTROIDS 111

F IG . 9. Centroids for each of four classes for the two simulation scenarios. The standard deviations for each class are indicated at the top
of the plot.

pression values were independent Gaussian with vari- 4. SOFT VERSUS HARD THRESHOLDING
ance 1. In the first simulation, the class centroids
An alternative to the soft thresholding (5) would be
were [r(3, 500), r(0.4, 500)], [r(0.5, 100), r(0, 900)],
to keep all differences greater in absolute value than 
[r(0, 100), r(0.5, 100), r(0, 800)] and [r(0, 100),
and discard the others; that is,
r(0, 100), r(0.5, 100), r(0, 700)]. The centroids are

shown in the top panel of Figure 9. (10) dik = dik · I (|dik | > ).
Thus the first class is far from the others, in the
This is sometimes known as hard thresholding. It dif-
space spanned by the first 500 genes. The top panel of
fers from soft thresholding in that differences greater
Figure 8 shows the mean ± 1 standard deviation of the
than  are unchanged, rather than shrunken toward
test error over five simulations. The methods used were
zero by the amount . One drawback of hard thresh-
default (equal) thresholds (red) and adaptive thresholds
olding is its “jumpy” nature: as the threshold  is in-
(green). The average values of the adaptive threshold
creased, a gene with a full contribution dik suddenly is
were 2.0, 1.0, 1.0 and 1.0. The adaptive threshold
set to zero.
method generally has lower test error.
In the second simulation, the means in the four To investigate the relative behavior of hard versus
classes were [r(0.5, 300), r(0, 700)], [r(0.5, 150), soft thresholding, we generated standard normal ex-
r(−0.5, 150), r(0, 700)], [r(−0.5, 150), r(0.5, 150), pression data for 1,000 genes and 40 samples, with
r(0, 700)] and [r(−0.5, 150), r(−0.5, 150), r(0, 700)]. 20 samples in each of two classes. For the first 100
The centroids are shown in the bottom panel of genes, we added a random effect µi ∼ N (0.0, 0.52) to
Figure 9. The standard deviations in each class were each expression level in class 2 for each gene i. Hence
2, 1.5, 1.5 and 1.0. Thus each class centroid is equidis- 100 of the 1,000 genes are differentially expressed in
tant from the overall centroid (the origin), but the the two classes by varying amounts. This experiment
within-class standard deviations are different. The bot- was repeated 10 times and the results were averaged.
tom of Figure 8 shows the results: again the adaptive The left panel of Figure 10 shows the test error for
threshold does better in terms of test error; the aver- hard and soft thresholding, as the threshold  is var-
age values of the adaptive threshold were 1.4, 1.1, 1.2 ied, while

the right panel displays themean squared
error i (µ̂i − µi )2 /p, where µ̂i = 20 
and 1.0. With equal thresholds, the majority of nonzero j =1 xij /20 −
40 
genes were in class 1: under the adaptive thresholds, j =21 xij /20. In the left panel, we see that soft thresh-
the distribution was more balanced. olding yields lower test error at its minimum; the right
112 TIBSHIRANI, HASTIE, NARASIMHAN AND CHU

F IG . 10. Simulated data in two classes. Left: Test misclassification error as the threshold  is varied, using hard thresholding (h) and soft

thresholding (s). Right: The estimation error (µ̂i − µi )2 /p, where µi and µ̂i are the true and estimated difference in expression between
class 1 and class 2 for gene i. Results are averages over 10 simulations: standard error of the average is about 0.015 in the left panel and
0.01 in the right panel.

panel shows that soft thresholding does a much better Figure 12. The values indicate average gene expres-
job of estimating the gene expression differences. sion. There are two subclasses in class 2, and each
of these can be distinguished from class 1 based on
5. NATIONAL CANCER INSTITUTE CANCER LINES a small set of genes. However, nearest shrunken cen-
AND SUBCLASS DISCOVERY troids will fail here, because the overall centroids for
each class are the same. Linear separating classifiers,
Here we describe how to use nearest centroid shrink-
such as support vector machines (SVM), and linear
age to discover subclasses. We consider data from Ross
discriminant analysis will also do poorly here. Either
et al. (2000) that consist of measurements on 6,830
could be made to work with a suitable nonlinear trans-
genes on 61 cell lines. The samples have been catego-
formation of the features (or choice of kernel for the
rized into eight different cancer classes: breast (BRE),
CNS, colon (COL), leukemia (LEU), melanoma (MEL),
non-small cell lung cancer (NSC), ovarian (OVA) and
renal (REN). We randomly chose a training set of size
40 and a test set of size 21, so that the classes were well
represented in both sets. Default (equal) soft thresh-
olding was used, with the prior probabilities set to the
sample class proportions. The results are shown in Fig-
ure 11. The best cross-validated error rate occurs at
about 5,000 genes, giving a test error of 5/21. Adap-
tive thresholding failed to improve this result.
We also tried both support vector machines (Ra-
maswamy et al., 2001) and regularized discriminant
analysis (Section 7). Both gave five errors on the test
set. However, neither method gave a simple picture of
the data.
Next we show a generalization of the nearest shrun-
ken centroid approach that facilitates the discovery of
potentially important subclasses. It may be valuable bi-
ologically to look for distinct subclasses of diseases
in microarray analyses. We can generalize the nearest
shrunken centroid procedure to facilitate the discov- F IG . 11. NCI cancer cell lines: training, cross-validation and test
ery of subclasses. Consider the problem illustrated in error curves.
CLASS PREDICTION BY NEAREST SHRUNKEN CENTROIDS 113

TABLE 2
NCI subclass results: test errors (out of 21 samples) for nearest shrunken centroid model with no subclasses (second
column from left) and two subclasses per class (third column from left); the columns on the right show the resulting
number of errors when a pair of subclasses for a given class is fused into one subclass

Number of Zero Two Fusing subclasses for each class


genes subclasses subclasses
BRE CNS COL LEU MEL NSC OVA REN

6830 5 6 6 6 6 6 6 7 5 8
6827 5 6 5 5 7 6 6 7 6 6
6122 5 5 6 5 5 5 5 5 5 5
3571 7 6 8 7 6 6 6 6 7 6
1695 9 6 8 7 6 6 6 6 7 6
696 9 7 9 6 7 7 7 7 8 8
293 9 6 8 7 7 6 6 7 7 6
119 10 6 8 8 8 6 8 7 7 8
42 10 12 13 14 14 12 12 12 12 12
17 14 14 14 14 16 14 14 14 14 13

SVM); while these may give low prediction error, they shrunken centroids to this r · K class problem. If the
may not reveal the biologically important subclasses predicted class from this large problem is h, then
that are present. our final predicted class is the class k that contains
For any class, our idea is to apply r-means clusters subclass h.
to the samples in that class, resulting in r subclasses With typical sample sizes, the choice r = 2 will
for that class. Doing this for each of the K classes be most reasonable. Table 2 shows the results on
results in a total of K · r subclasses. We apply nearest the National Cancer Institute (NCI) data. Without
subclasses, the test error rates start to rise when fewer
than 2,000 or 3,000 genes are used. Using subclasses,
we achieve about the same error rate with as few as 119
genes. The right part of the table shows that for 119 the
subclasses are most important for BRE, CNS, COL,
MEL and REN. The 119 gene solution is displayed in
Figure 13 and shows some distinct subclasses among
some of the main classes.

6. CAPTURING HETEROGENEITY
In discriminating an “abnormal” from a “normal”
group, the average gene expression may not differ
between the groups. However, the variability in expres-
sion may be greater in the abnormal group, due to het-
erogeneity in the abnormal population. This is illus-
trated in Figure 14. Nearest centroid classification will
not work in this case, since the class centroids are not
separated. The subclass method of the previous section
might help: we propose an alternative approach here.
We define new features xij = |xij − m̄i |, where m̄i
is the mean expression for gene i in the normal group.
Then we apply nearest shrunken centroids to the new
features xij .
F IG . 12. Two class problem with distinct subclasses. Numbers To illustrate this, we generated the expression of
indicate the average gene expression. 1,000 genes in 40 samples—20 from a normal group
114 TIBSHIRANI, HASTIE, NARASIMHAN AND CHU

F IG . 13. NCI subclass results. Shown are pairs of centroids for each class for the genes that survived the thresholding.

and 20 from an abnormal group. All expression values Nearest centroid shrinkage on the transformed features
were generated independently as standard Gaussian xij showed a test error rate of near zero, with 150 or
except for the first 200 genes in the abnormal group, more nonzero genes. Figure 15 compares the results
which had mean zero, but standard deviation 2. An of nearest shrunken centroids on the raw expression
independent test set of size 200 was also generated. values xij and the transformed expression values xij .
Nearest centroid shrinkage on the raw values does
poorly with an error rate greater than 40%, while use
of the transformed values reduces the error rate to near
zero.
By transforming to the distance from the normal
centroid, the use of the features xij might also provide
discrimination in situations where the abnormal class
is not heterogeneous, but is instead mean-shifted.
The right panel of Figure 15 investigates this. The
expression of the first 200 genes in the abnormal class
has mean 0.5 and standard deviation 1 (versus 0 and 1
for the normal class). Now nearest shrunken centroids
on the raw features is much more powerful, while use
of the transformed features works poorly. We conclude
that use of neither the raw nor transformed features
dominates the other, and both should be tried on a given
problem.
We have successfully used the heterogeneity model
F IG . 14. Illustration of heterogeneity in gene expression. Abnor- to predict toxicity from radiation sensitivity using
mal group A has the same average gene expression as the normal transcriptional responses to DNA damage in lymphoid
group N, but shows larger variability. cells (Rieger et al., 2003).
CLASS PREDICTION BY NEAREST SHRUNKEN CENTROIDS 115

F IG . 15. Left: Test error for data simulated from the heterogeneous two-class problem, using nearest shrunken centroids on raw expression
values (red) and transformed expression values |xij − m̄i | (blue). Right: Same as in left panel, but data are simulated from the mean-shifted
homogeneous two-class problem.

7. RELATIONSHIP TO OTHER APPROACHES which is linear in xi∗ . Because of the sign change, our
The discriminant scores (6) are similar to those used rule classifies to the largest δ̃k (x ∗ ). Likewise the LDA
in linear discriminant analysis (LDA), which arise from discriminant scores have the equivalent linear form
using the Mahalanobis metric to compute the distance (13) δ̃kLDA (x ∗ ) = x ∗ T W −1 x̄k − 12 x̄kT W −1 x̄k +log πk .
to centroids:
Regularized discriminant analysis (RDA; Friedman,
(11) δkLDA (x ∗ ) = (x ∗ − x̄k )T W −1 (x ∗ − x̄k ) − 2 log πk . 1989) leaves the centroids alone and modifies the
Here we use vector notation and W is the pooled covariance matrix in a different way,
within-class covariance matrix. With thousands of (14) δkRDA (x ∗ ) = (x ∗ − x̄k )T (W + λI )−1 (x ∗ − x̄k ),
genes and tens of samples (p  n), W is huge and
where λ is a parameter (like our ). The fattened
any sample estimate will be singular (and hence its
W + λI is nonsingular, and as λ gets large, this proce-
inverse is undefined). Our scores can be seen to be a
dure approaches the nearest centroid procedure (with
heavily restricted form of LDA, necessary to cope with
no variance scaling or centroid shrinking). A slightly
the large number of variables (genes). The differences
modified version uses W +λD, where D = diag(s12 , s22 ,
are the following:
. . . , sp2 ). As λ gets large, this approaches the variance
• We assume a diagonal within-class covariance ma- weighted nearest centroid procedure. In practice, we
trix for W ; without this, LDA would be ill-condition- normalize this regularized covariance by dividing by
ed and fail. 1 + λ, leading to the convex combination (1 − α)W +
• We use shrunken centroids rather than centroids as a αD, where α = λ/(1 + λ). Although the relative dis-
prototype for each class. tances do not change, this is important when making
• As the shrinkage parameter  increases, an increas- the adjustment for the class priors.
ing number of genes will have all their dik  = 0,
Although RDA shows some promise, it is more com-
k = 1, . . . , K, due to the soft thresholding in (5). plicated than our nearest shrunken centroid procedure.
Such genes contribute no discriminatory information Furthermore, in the process of its regularization, it does
in (6), and in fact cancel in Equation (8). not select a subset of genes as the shrunken centroid
Both our scores (6) and the LDA scores (11) are procedure does. We are considering other hybrid ap-
linear in xi∗ . If we expand the square in (6), discard proaches of RDA and nearest centroids in ongoing re-
search projects.
the terms involving xi∗ 2 (since they are independent of
the class index k and hence do not contribute toward 8. NEAREST CENTROID CLASSIFIER
class discrimination) and multiply by −1/2, we get VERSUS LDA
x ∗ x̄  2
p
 p
1  x̄ik
(12) δ̃k (x ∗ ) = i ik
− + log πk , As discussed in the previous section, the nearest cen-
i=1 si2 2 i=1 si2 troid classifier is equivalent to Fisher’s linear discrimi-
116 TIBSHIRANI, HASTIE, NARASIMHAN AND CHU

F IG . 16. Simulation results: bias and variance (top panels) and mean-squared error and misclassification error (bottom panels) for linear
discriminant analysis and the nearest centroid classifier. Details of the simulation are given in the text. The nearest centroid classifier
outperforms LDA because of its smaller variance.

nant analysis if we restrict the within-class covariance multivariate manner, and hence will tend to have higher
matrix to be diagonal. When is this restriction a good variance. What is the resulting bias–variance tradeoff
one? and how does it translate into misclassification error?
Consider a two class microarray problem with p We did an experiment with p = 30 and n = 40, with
genes and n samples. For simplicity we consider the 20 samples in each of two classes. We set the ij th ele-
standard (unshrunken) nearest centroid classifier and ment of & to ρ |i−j | , where ρ was varied from 0 to 0.8.
standard (full within covariance) LDA. The recent the- Each of the components of the mean vector µ was set
sis of Levina (2002) did some theoretical comparisons to ±1 at random: such a mixed vector is needed to give
of these methods. She assumed p → ∞, n → ∞ and full LDA a potential advantage over LDA with a di-
p/n → γ ∈ (0, 1), and analyzed the worst case error agonal covariance. For each simulation, an indepen-
of each method. The relative performance of the two
dent test set of size 500 was also generated. The re-
methods depends on the correlation structure of the
sults of 100 simulations from this model are shown in
features (samples). Her results show that if p is a large
Figure 16. Bias, variance and mean-squared error refer
fraction of n, for a large class of correlation structures,
nearest centroid classification outperforms full LDA. to estimation of & −1 µ. For small correlations, the un-
Now in our problem, usually we have p  n: in that derlying (diagonal covariance) model for nearest cen-
case, LDA is not even defined without some regular- troids is approximately correct and the method wins;
ization. Hence to proceed we assume that p is a little LDA shows a small improvement in bias for larger
less than n and hope that what we learn will extend correlations, but this is more than offset by the in-
to the case p > n. Let xj be a p-vector of gene ex- creased variance. Overall the nearest centroid method
pression values in class j . Suppose x1 ∼ N (0, &) and has lower mean-squared error and test misclassification
x2 ∼ N (µ, &), where & is a full (nondiagonal) matrix. error in all cases.
Then LDA uses the maximum likelihood unbiased esti- Now for real microarray problems, p  n, and both
mate of & −1 µ, while nearest centroid uses a biased es- LDA and nearest centroid methods can be improved
timate. However, the LDA method estimates & −1 µ in a by appropriate regularization or shrinkage. We have
CLASS PREDICTION BY NEAREST SHRUNKEN CENTROIDS 117

not included regularization in the above comparison, D ONOHO, D. and J OHNSTONE , I. (1994). Ideal spatial adaptation
but the above results suggest that the bias–variance by wavelet shrinkage. Biometrika 81 425–455.
tradeoff will cause the nearest centroid method to E ISEN, M. B., S PELLMAN, P. T., B ROWN, P. O. and B OTSTEIN, D.
outperform full LDA. (1998). Cluster analysis and display of genome-wide expres-
sion patterns. Proc. Natl. Acad. Sci. U.S.A. 95 14 863–14 868.
F RIEDMAN, J. (1989). Regularized discriminant analysis. J. Amer.
9. DISCUSSION
Statist. Assoc. 84 165–175.
The nearest shrunken centroid classifier is poten- G OLUB, T. R., S LONIM, D. K., TAMAYO, P., H UARD, C.,
tially useful in any high-dimensional classification G AASENBEEK, M., M ESIROV, J. P., C OLLER, H., L OH, M.,
problem. In addition to its application to gene expres- D OWNING , J. R., C ALIGIURI , M. A., B LOOMFIELD , C. D.
and L ANDER, E. S. (1999). Molecular classification of cancer:
sion arrays, it could also be applied to other kinds of Class discovery and class prediction by gene expression mon-
emerging genomic data, including mass spectroscopy itoring. Science 286 531–537.
for protein measurements, tissue arrays and single nu- H ASTIE , T., T IBSHIRANI , R., B OTSTEIN , D. and B ROWN, P.
cleotide polymorphism arrays. (2001). Supervised harvesting of expression trees. Genome
Our proposal can also be applied in conjunction Biology 2 (1) research/0003.
with unsupervised methods. For example, it is now H EDENFALK , I., D UGGAN , D., C HEN , Y., R ADMACHER , M.,
standard to use hierarchical clustering methods on B ITTNER , M., S IMON , R., M ELTZER , P., G USTER -
SON , B., E STELLER , M., R AFFELD , M., YAKHINI , Z.,
expression arrays to discover clusters in the samples
B EN -D OR , A., D OUGHERTY, E., KONONEN , J., B UBEN -
(Eisen, Spellman, Brown and Botstein, 1998). The
DORF, L., F EHRLE , W., P ITTALUGA , S., G RUVBERGER , S.,
methods described here can identify subsets of the L OMAN , N., J OHANNSSON , O., O LSSON , H., W ILFOND , B.,
genes that succinctly characterize each cluster. S AUTER , G., K ALLIONIEMI , O., B ORG , A. and T RENT , J.
Finally, we touch on computational issues. The (2001). Gene-expression profiles in hereditary breast cancer.
computations involved in the nearest shrunken centroid New England Journal Medicine 344 539–548.
method are straightforward. One important detail: in K HAN , J., W EI , J., R INGNER , M., S AAL , L., L ADANYI , M.,
the denominator of the statistics dik in Equation (1) W ESTERMANN , F., B ERTHOLD , F., S CHWAB , M., A N -
TONESCU , C., P ETERSON , C. and M ELTZER , P. (2001). Clas-
we add the same positive constant s0 to each of the
sification and diagnostic prediction of cancers using gene ex-
si values. This guards against the possibility of large
pression profiling and artificial neural networks. Nature Medi-
dik values arising by chance from genes at very low cine 7 673–679.
expression levels. We set s0 equal to the median value L EVINA, E. (2002). Statistical issues in texture analysis. Ph.D.
of the si over the set of genes. A similar strategy was dissertation, Dept. Statistics, Univ. California, Berkeley.
used in the significance analysis of microarrays (SAM) R AMASWAMY, S., TAMAYO , P., R IFKIN , R., M UKHERJEE , S.,
methodology of Tusher, Tibshirani and Chu (2001). Y EANG , C., A NGELO , M., L ADD , C., R EICH , M., L AT-
We have developed a package in the Excel and R ULIPPE , E., M ESIROV, J., P OGGIO , T., G ERALD , W.,

language called prediction analysis for microarrays. L ODA , M., L ANDER , E. and G OLUB, T. (2001). Multiclass
cancer diagnosis using tumor gene expression signatures. Proc.
It implements all of the nearest shrunken centroids
Natl. Acad. Sci. U.S.A. 98 15 149–15 154.
methodology discussed in this article and is available at
R IEGER , K., H ONG , W., T USHER , V., TANG , J., T IBSHIRANI , R.
the website http://www-stat.stanford.edu/∼tibs/PAM. and C HU , G. (2003). Toxicity of radiation therapy associated
with abnormal transcriptional responses to DNA damage.
Submitted.
REFERENCES
ROSS , D., S CHERF, U., E ISEN , M., P EROU , C., R EES , C.,
A LIZADEH , A. A., E ISEN , M. B., DAVIS , R. E., M A , C., L OS - S PELLMAN , P., I YER , V., J EFFERY, S., VAN DE R IJN , M.,
SOS , I. S., ROSENWALD , A., B OLDRICK , J. C., S ABET, H., WALTHAM , M., P ERGAMENSCHIKOV, A., L EE , J., L ASH -
T RAN , T., Y U , X., P OWELL , J. I., YANG , L., M ARTI , KARI , D., S HALON , D., M YERS , T., W EINSTEIN , J., B OT-
G. E., M OORE , T., H UDSON , J R ., J., L U , L., L EWIS , STEIN , D. and B ROWN , P. (2000). Systematic variation in gene
D. B., T IBSHIRANI , R., S HERLOCK , G., C HAN , W. C., expression patterns in human cancer cell lines. Nature Genet-
G REINER , T. C., W EISENBURGER , D. D., A RMITAGE , J. O., ics 24 227–235.
WARNKE , R., L EVY, R., W ILSON , W., G REVER , M. R.,
T IBSHIRANI , R., H ASTIE , T., NARASIMHAN , B., and C HU,
B YRD , J. C., B OTSTEIN , D., B ROWN , P. O. and S TAUDT,
G. (2002). Diagnosis of multiple cancer types by shrunken
L. M. (2000). Distinct types of diffuse large B-cell lymphoma
identified by gene expression profiling. Nature 403 503– centroids of gene expression. Proc. Natl. Acad. Sci. U.S.A. 99
511. 6567–6572.
A MBROISE , C. and M C L ACHLAN, G. (2002). Selection bias in T USHER , V. G., T IBSHIRANI , R. and C HU, G. (2001). Signifi-
gene extraction on the basis of microarray gene-expression cance analysis of microarrays applied to the ionizing radiation
data. Proc. Natl. Acad. Sci. U.S.A. 99 6562–6566. response. Proc. Natl. Acad. Sci. U.S.A. 98 5116–5121.

You might also like