Two-Stage Classification Methods For Microarray Data: Tzu-Tsung Wong, Ching-Han Hsu

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Expert Systems

with Applications
Expert Systems with Applications 34 (2008) 375–383
www.elsevier.com/locate/eswa

Two-stage classification methods for microarray data


Tzu-Tsung Wong *, Ching-Han Hsu
Institute of Information Management, National Cheng Kung University, 1 Ta-Sheuh Road, Tainan City 701, Taiwan, ROC

Abstract

Gene expression data are a key factor for the success of medical diagnosis, and two-stage classification methods are therefore devel-
oped for processing microarray data. The first stage for this kind of classification methods is to select a pre-specified number of genes,
which are likely to be the most relevant to the occurrence of a disease, and passes these genes to the second stage for classification. In this
paper, we use four gene selection mechanisms and two classification tools to compose eight two-stage classification methods, and test
these eight methods on eight microarray data sets for analyzing their performance. The first interesting finding is that the genes chosen
by different categories of gene selection mechanisms are less than half in common but result in insignificantly different classification accu-
racies. A subset-gene-ranking mechanism can be beneficial in classification accuracy, but its computational effort is much heavier.
Whether the classification tool employed at the second stage should be accompanied with a dimension reduction technique depends
on the characteristics of a data set.
 2006 Elsevier Ltd. All rights reserved.

Keywords: Dimension reduction; Gene selection; Microarray data; Two-stage classification method

1. Introduction classification methods for filtering genes that are critical to


the occurrence of some disease.
Gene expression data are a key factor for the success of A two-stage classification method first selects a pre-spec-
medical diagnosis (Quackenbush, 2001), and classification ified number of genes that is much less than the number of
methods are therefore developed for processing microarray genes in an instance. The selected genes are then passed to
data. In a microarray data set, an instance usually contains the second stage for classification. There exist a wide vari-
the expression data of several thousand genes that are the ety of techniques for feature selection and classification,
features for identifying the occurrence of a specific disease. and most of these techniques have been studied deeply
Even the techniques for obtaining microarray data have and thoroughly. Any gene selection mechanism and classi-
been improved, the cost for inquiring such an instance is fication tool can be combined to form a two-stage classifi-
still expensive. Thus, the number of instances in a micro- cation method for microarray data. This demonstrates the
array data set is generally far less than the number of genes flexibility and applicability of the two-stage classification
in an instance. Since most of the genes are irrelevant to the method. In particular, biologists can adopt the available
disease of interest, using all features in classification can tools on hand to analyze microarray data without studying
slow down the performance and make the results contam- new classification tools specifically developed for micro-
inated by lots of noise. Most traditional classification tools array data. This study attempts to provide some insights
are infeasible in processing data sets with few instances and and guidelines in designing such a two-stage classification
a huge number of features. So, many studies developed new method.
Like the way to categorize feature selection tools, there
are two alternatives in designing the gene selection mecha-
*
Corresponding author. Tel.: +886 6 2757575x53722. nism for the first stage: individual gene ranking or subset
E-mail address: [email protected] (T.-T. Wong). gene ranking (Lu & Han, 2003). The tools for classification

0957-4174/$ - see front matter  2006 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2006.09.005
376 T.-T. Wong, C.-H. Hsu / Expert Systems with Applications 34 (2008) 375–383

can be categorized in many ways. In this study, we need the only improve the classification efficiency, but also remove
classification tools that are naturally designed for process- the interference caused by the irrelevant genes. Experimen-
ing continuous features like gene expression data. Since the tal results also showed that a feature selection mechanism
expression data of thousands of genes are likely to contain could increase the accuracy of most classification methods
noise, a dimension reduction technique is an appropriate (Dudoit, Fridlyand, & Speed, 2002; Golub et al., 1999).
tool to filter such noise. We therefore would like to inves- Lu and Han (2003) divided the gene selection mecha-
tigate whether a classification tool should be preceded by nisms into two categories: individual gene ranking and
a dimension reduction technique in analyzing microarray gene subset ranking. Individual-gene-ranking mechanisms
data. calculate the correlation between each gene and the class
This paper is organized as follows. The literatures rele- value and select the genes that have the correlations larger
vant to classifying microarray data are reviewed in Section than a pre-defined threshold. An example of this approach
2. Section 3 presents the possible designs of two-stage clas- is the feature selection mechanism adopted by Antoniadis,
sification methods, and eight two-stage classification meth- Lambert-Lacroiix, and Leblance (2003). This approach is
ods that will be investigated in this study are further usually simpler and more efficient but may exclude the
introduced. In Section 4, we will propose a procedure for genes that are important for disease detection only when
evaluating the performance of the eight two-stage classifi- they work together. It may also find the discerning genes
cation methods. This procedure will be used in Section 5 with redundant information for classification. Gene-sub-
to analyze the experimental results obtained from eight set-ranking mechanisms remove the gene with the smallest
microarray data sets for cancer detection. The conclusions impact on disease detection one by one to find a group of
and the direction for future study of this paper are genes that serve together to achieve the best classification
addressed in Section 6. result. The gene selection method proposed by Li, Wein-
berg, Darden, and Pedersen (2001) is an example of this
approach. This approach can find a discerning gene subset
2. Related works but its computational complexity is high.
There are many ways to categorize classification meth-
Traditional classification methods are generally inappli- ods, such as whether its learning scheme is lazy or whether
cable to microarray data that possess some special charac- it is probability-based. Without considering the noise con-
teristics as pointed out in Section 1. New classification tained in the available data, any classification tool can be
methods are therefore developed for processing microarray applied on microarray data directly, and Li et al. (2001)
data in these years, as summarized in Table 1. Some or part is an example for this case. Some studies attempted to
of the methods listed in Table 1 are brand new and specif- remove the noise before classifying, such as the method
ically developed for processing microarray data. However, presented by Jörnsten and Yu (2003).
most of them are just to modify well-known techniques and
assemble them together to deal with microarray data. The
latter approach can construct a method that is easier to 3. Structure of two-stage classification methods
learn and apply for analyzing microarray data.
A microarray instance usually contains thousands of A two-stage classification method includes a gene selec-
genes or features, and most of them are irrelevant to the tion mechanism at the first stage and a classification tool
disease of interest. A feature selection mechanism cannot that predicts the class of a new instance based on the genes

Table 1
Recent classification methods for microarray data
Source Tools
Friedman et al. (2000) Bayesian network
Li et al. (2001) Genetic algorithm and k-nearest neighbors
Khan et al. (2001) Artificial neural network
Zhang et al. (2001) Binary decision tree
Nguyen and Rocke (2002) t statistics, partial least squares, and logistic discrimination analysis
Albrecht et al. (2003) Perceptron and simulation annealing
Antoniadis et al. (2003) Wilcoxon score, minimum average variance estimation, and logistic discrimination analysis
Jörnsten and Yu (2003) Between-to-within-class sum of squares, Rissanen’s minimum description length, and linear discrimination analysis
Lee et al. (2003) Bayesian model
Desper et al. (2004) Phylogenetic method
Simek et al. (2004) Singular value decomposition and support vector machine
Asyali and Alci (2005) Fuzzy c-means and normal mixture model
Georgii et al. (2005) Quantitative association rule
Qiu et al. (2005) Ensemble dependence model
Martella (2006) Factor mixture models
Wu (2006) Penalized linear regression model
T.-T. Wong, C.-H. Hsu / Expert Systems with Applications 34 (2008) 375–383 377

chosen at the first stage. Since the number of genes in a 3.1. Gene selection mechanisms
microarray instance is generally more than 1000, and most
of the genes cannot provide useful information in classifica- A microarray instance i with n genes and one class value
tion, a gene selection mechanism is necessary for processing can be represented by (xi1, xi2, . . . , xin, yi). Suppose that a
microarray data. microarray instance with class value 1 comes from an
As pointed out in Section 2, the mechanism for gene abnormal tissue, and this class value is 0 if it comes from
selection can be either individual gene ranking or subset a normal tissue. Let Ni be the number of training instances
gene ranking. Since gene expression data are generally con- with class i for i = 0, 1, and let xjk and s2jk be the mean and
taminated by noise, a technique that can filter the noise the variance of gene j calculated from the training instances
contained in the data may enhance the prediction accuracy with class k, respectively. According to Nguyen and Rocke
of a classification tool. Thus, we divide classification tools (2002), the t value of gene j is
into two categories based on whether they include some sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
way for dimension reduction in this study. The goal of s2j0 s2j1
dimension reduction is to transform the information con- tj ¼ ðxj0  xj1 Þ= þ :
N0 N1
tained in the original data into another form that can be
represented by a smaller number of new features. An obvi- The genes are then sorted in a descending order according
ous benefit of dimension reduction is the improvement on to their t values. If the number of genes for classification is
the speed of learning. It sometimes can also remove the set to be r, the t-statistics mechanism will select the first r/2
noise from the data to increase prediction accuracy. At and the last r/2 genes in the order.
the same time, we might worry that useful information Define the BW ratio (the ratio of between-group sums of
for prediction may lose after the data transformation. squares to within-group sums of squares) of gene j as
Thus, it should be of interest to know whether a classifica- follows:
tion tool preceded by a dimension reduction technique will PP 2
improve the prediction accuracy in analyzing microarray Iðy ¼ kÞðxjk  xj Þ
BWj ¼ P iP k i ;
data. i xjk Þ2
k Iðy i ¼ kÞðxij  
In summary, we can have two options in designing the
mechanism for gene selection at the first stage, and pick a where xj is global mean of gene j and I(Æ) is an indicator
classification method with or without a dimension reduc- function with value 1 if the condition in I holds, or 0 other-
tion procedure for the second stage. So, there are four pos- wise. Dudoit et al. (2002) proposed that a gene with a lar-
sible ways to compose a two-stage classification method for ger BW ratio can interpret a larger proportion of the
microarray data, as shown in Fig. 1. The following two variance of the class. Thus, this mechanism orders the
subsections will introduce four gene selection mechanisms genes in an ascending order and chooses the first r genes
and two classification tools for designing two-stage classifi- for the next stage.
cation methods. Based on these tools, we can compose Park, Pagano, and Bonetti (2001) designed a non-para-
eight two-stage classification methods for analyzing micro- metric mechanism for gene selection. They defined the
array data. score of gene j as follows:

Microarray data

Stage 1: Gene selection Individual gene Gene subset


mechanism ranking ranking

Stage 2: Classification With a dimension Without a dimension


tool reduction procedure reduction procedure

Learning results

Fig. 1. Possible designs of two-stage classification methods for processing microarray data.
378 T.-T. Wong, C.-H. Hsu / Expert Systems with Applications 34 (2008) 375–383
X X
Sj ¼ Iðxij  xmj Þ; Since the gene expression data are numeric, classification
i2D0 m2D1 tools that favor continuous attributes will be more appro-
priate for applying on microarray data. We therefore pick
where Dk is the set of training instances with class k and the k-nearest neighbors and the logistic discrimination
I(xij  xmj) = 1 if xij > xmj, or 0 otherwise. If the value of analysis combined with the partial least squares for dimen-
Sj is larger than N0 · N1/2, it will be replaced by sion reduction as the classification tools in this study, and
N0 · N1  Sj. Consider that the expression values of gene they will be denoted by K and PL, respectively.
j are sorted in an ascending order. In fact, Sj is the minimal The k-nearest neighbors will use the genes selected at the
number of interchanges on the values in the sorted order first stage to find k training instances that are the closest to
such that the instances with the same class are grouped to- a new instance. The major class of the k training instances
gether. A gene with score = 0 implies that the training in- will be the predicting class of the new instance. It can be
stances with different class values can be divided into two applied for classification regardless of the number of genes
groups without performing any interchange on the values for calculating the Euclidean distance between a pair of
in the sorted order, hence its value is determinant in know- instances. However, when the number of genes is large,
ing the class value of a microarray instance. Thus, this say more than 20, it will be inappropriate to apply the
mechanism will choose the r genes with the smallest score logistic discrimination analysis.
values. Suppose that the class value can be either 0 or 1. Define
The above three gene selection mechanisms by comput- p = p{y = 1jd} for an instance, where d is a row vector
ing the t statistics, the BW ratio, and the score value will be derived from the expression values of the genes chosen at
denoted by mechanisms T, BW, and S, respectively. They the first stage from the instance. The logistic
all apply some measure to determine which gene is the most  p  discrimination
analysis assumes that the values of ln 1p for all instan-
relevant to the class value, hence they are individual-gene- ces can be fitted well by a regression line db, where
ranking mechanisms. Li et al. (2001) applied the genetic bT = (b0, b1, . . . , bq). Parameter bi can be estimated by the
algorithm combined with the k-nearest neighbors for gene maximum likelihood estimate b ^i for i = 0, 1, . . . , q. Then
selection, that is a subset-gene-ranking mechanism. It will for any new instance, we first derive its d and calculate
be denoted by GK. ^
expðdbÞ
^ ¼ 1þexpðd
p ^ . If p

^ > 1=2, then the prediction class of the
The genetic algorithm has a set of initial solutions called new instance is 1, and 0 otherwise. Note that in applying
chromosomes constituted by a fixed number of randomly the logistic discrimination analysis, we need to calculate
selected genes. A fitness function is then applied to evaluate the values of b ^i for i = 0, 1, . . . , q. Like the linear regression,
the feasibility of the chromosomes. For each chromosome, the number of independent variables should be as small as
every pair of instances has a Euclidean distance derived by possible. Thus, a dimension reduction technique, such as
the genes in the chromosome. If the k nearest neighbors of the partial least squares, can transform the original r-
an instance has the same class, this class will be the pre- dimensional space composed by the r genes chosen at the
dicted class of the instance. Otherwise, the class of this first stage into a smaller q-dimensional space to make the
instance is undetermined. logistic discrimination analysis more applicable.
Let nj be the number of correctly classified instances The partial least squares is to find q column vectors,
when we use the genes in chromosome j to find the nearest called PLS components, such that the covariance between
neighbors. Then nj/(N0 + N1) is the fitness of chromosome the class values and the linear combination of the gene
j. A pre-specified number of chromosomes that have the expression values and a PLS component is maximized.
highest fitness values will be passed to generate the chro- These PLS components will be used to transform the
mosomes for the next iteration. The mutation can occur expression values of the r genes in an instance into q values
in this process. A chromosome with a fitness larger than for plugging into the logistic regression model for classifi-
the pre-specified threshold is stored in a list. The genetic cation. Nguyen and Rocke (2002) noted that the PLS
algorithm stops when the number of chromosomes in the components usually result in a better performance than
list reaches a pre-specified number. The frequency of a gene the components generated from the principal component
is the number of chromosomes in the list containing the analysis.
gene. All genes are sorted in a descending order according
to their frequencies, and the GK mechanism will choose the 4. Performance evaluation
first r genes in the order.
A two-stage classification method composed by gene
3.2. Classification tools selection mechanism A and classification tool B will be
denoted by A/B. For example, the methods proposed by
Our interest on classification tools is to see whether a Li et al. (2001) and Nguyen and Rocke (2002) can be rep-
dimension reduction procedure involved in a classification resented by GK/K and T/PL, respectively. In this study, we
tool can affect the prediction accuracy of microarray data. are going to test the two-stage classification methods com-
Most classification tools favor in processing discrete attri- posed by the four gene selection mechanisms and the two
butes, such as naı¨ve Bayesian classifiers and decision trees. classification tools introduced in the previous section.
T.-T. Wong, C.-H. Hsu / Expert Systems with Applications 34 (2008) 375–383 379

The testing results will be able to provide some guidelines population, the leave-one-out cross-validation is inappro-
for designing a two-stage classification method for micro- priate in comparing the performance of two classification
array data. In this section, we will present the evaluation methods. When the number of instances in a data set is less
method for the eight methods that will be tested by eight than 100, the number of instances in a fold for the 5-fold
microarray data sets. cross-validation will be less than 20 in average. This means
As mentioned at the beginning of this paper, the special that every instance can affect more than 5% of the accuracy
characteristics of microarray data are: available instances of this fold. This is the reason why most researchers divided
are few, the number of features is large, and most features available data into 3 folds instead of 5 folds in analyzing
are irrelevant or noisy. The gene selection mechanism in a microarray data. Note that only one repetition of the 3-fold
two-stage classification method can be a proper tool to deal cross-validation will not generate enough classification
with the last two characteristics of microarray data. The accuracies for comparison. In this study, the 3-fold cross-
first characteristic may cause some problems in perfor- validation for every data set will be repeated five times.
mance evaluation. When the number of available instances Let aj and bj be the accuracies of methods A and B,
is small, the classification result of every instance will have respectively for data fold j for jP = 1, 2, . . . , 15. Set
a significant impact in evaluating the prediction accuracy dj = aj  bj for j = 1, 2, . . . , 15 and d ¼ 15 j¼1 d j =15. Then d

of a classification method. Since not all genes are relevant is an estimate of the accuracy difference d between methods
to the occurrence of some disease, the number of genes A and B, andffi the standard deviation of d equals
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P15
selected at the first stage can also affect the classification j¼1
 2
ðd j dÞ
accuracy. It is of interest to evaluate the accuracies when sd ¼ 210
. By the paired t test, the accuracies of A
either the tools or the numbers of genes for classification and B will be significantly different if the p-value corre-
are different. sponding to the test statistics t ¼ d=s  d is less than the signif-
Most classification tools use either the 3-fold or the icance level a.
leave-one-out cross-validation to evaluate the accuracy of According the sensitivity analysis performed by Li et al.
a microarray data set, as summarized in Table 2. The 3- (2001), the appropriate number of genes for classifying the
fold cross-validation randomly divides the instances in a lymphoma data should be between 50 and 200. In order to
data set into three folds. Each fold is then in turn for test- investigate the impact of the number of genes for classifica-
ing, and the other two folds are for training. Thus, the 3- tion, we will pick either 50, 100, 150, or 200 genes at the
fold cross-validation will produce three prediction accura- first stage and pass them to the second stage. Let zij be
cies for comparison. The leave-one-out cross-validation the testing accuracy of data fold j when the number of
holds an instance out to test the learning results derived genes for classification is 50 · i, and let li = (zi1 +
from the remaining instances. Every instance is in turn zi2 +    + zi15)/15 for i = 1, 2, 3, 4. Based on the analysis
for testing, and the accuracy is estimated by the number of variance (ANOVA) for single factor, if the p-value cor-
of correct classifications over the total number of instances. responding to the test statistic F is smaller than a, the li are
Traditionally, we will apply either the 10-fold or the 5- not all the same. This implies that the number of genes for
fold cross-validation on a data set to obtain 10 or 5 esti- classification does affect the accuracy. The value of a will be
mated classification accuracies in a repetition. Under 0.10 for the significance tests performed in the remainder of
proper control, the accuracies resulting from different clas- this article.
sification methods are matched samples, and the paired t
test can be used to identify whether their prediction accura- 5. Experimental study
cies are significantly different. The leave-one-out cross-val-
idation can produce only one estimate about the accuracy In this study, the gene selection mechanism applied at
of a data set. Since a data set is just a sample of the entire the first stage will be either T, BW, S, or GK, and the tool
for classification employed at the second stage will be either
K or PL. So, the number of two-stage classification meth-
Table 2 ods investigated in this paper is eight. In this section, we
Methods for evaluating classification performance on microarray data will introduce the characteristics of eight microarray data
Source Evaluation method sets, the procedure of data pre-processing, and the param-
Li et al. (2001) Randomly select N instances for training
eter settings for the methods. Then the eight methods will
Nguyen and Rocke Randomly select N instances for training be tested by the eight data sets for performance evaluation.
(2002)
Antoniadis et al. (2003) Leave-one-out cross-validation 5.1. Data pre-processing and parameter settings
Jörnsten and Yu (2003) 3-Fold cross-validation
Lee et al. (2003) Leave-one-out cross-validation
Albrecht et al. (2003) Repeated 3-fold cross-validation with equal The logistic discrimination analysis is a classification
fold size tool for data sets with only two possible class values. We
Desper et al. (2004) Repeated 3-fold cross-validation with equal therefore downloaded eight microarray data sets from the
fold size web site http://sdmc.lit.org.sg/GEDatasets/Datasets.html
Simek et al. (2004) Leave-one-out cross-validation
to test the eight two-stage classification methods. The char-
380 T.-T. Wong, C.-H. Hsu / Expert Systems with Applications 34 (2008) 375–383

Table 3
The characteristics of microarray data sets
Data set No. of instances No. of genes Description
Breast 97 24481 46 Instances with class ‘‘relapse’’ and 51 instances with class ‘‘non-relapse’’
CNS 60 7129 21 Instances with class ‘‘0’’ and 39 instances with class ‘‘1’’
Colon 62 2000 40 Instances from tumor tissue and 22 from normal tissue
Leukemia 72 7129 47 Instances with class ‘‘ALL’’ and 25 instances with class ‘‘AML’’
Lung 181 12,533 31 Instances with class ‘‘MPM’’ and 150 instances with class ‘‘ADCA’’
Lymphoma 47 4026 24 Instances with class ‘‘germinal’’ and 23 instances with class ‘‘activated’’
Ovarian 253 15,154 91 Instances from normal tissue and 162 instances from tumor tissue
Prostate 136 12,600 77 Instances from tumor tissue and 59 instances from normal tissue

acteristics of the eight data sets are summarized in Table 3, Table 4


where CNS represents central nervous system. Note that The fitness thresholds of the data sets
each one of the eight data sets contains at least two thou- Data sets Fitness threshold
sand genes and less than 260 instances. These characteris- Lymphoma, leukemia, colon One instance misclassified
tics make traditional classification methods, such as CNS, prostate 0.90 Classification accuracy
decision trees, infeasible for processing the eight data sets. Lung 0.92 Classification accuracy
Since the data sets are donated by different organiza- Ovarian 0.95 Classification accuracy
Breast 0.82 Classification accuracy
tions, their pre-processing procedures are different. The
data in set ‘‘breast’’ are transformed by the logarithm func-
tion with base 10, the data in set ‘‘colon’’ are transformed of a testing instance will be the predicted class of the testing
by the logarithm function with base 2, and the data in set instance.
‘‘ovarian’’ are normalized to be between 0 and 1. The data Let r be the number of genes selected at the first stage.
in sets ‘‘CNS’’, ‘‘lung’’, and ‘‘prostate’’ are standardized to The partial least squares transforms the original r features
be the observations of a standard normal distribution. The that spans an r-dimensional space into r new features that
data in sets ‘‘lymphoma’’ and ‘‘leukemia’’ are transformed spans the same space such that the new features are
by the logarithm function with base 2 and then stan- arranged in a descending order by their degrees of covari-
dardized. ance interpretability. We will pick the first q new features
In the genetic algorithm, one population is set to have in the order that can explain more than 90% of the data
100 chromosomes, and 10 populations are processed inde- covariance, and the value of q is as small as possible. Then
pendently. The chromosome with the highest fitness in the maximum likelihood estimates of the q + 1 parameters
each population will be transmitted to the next generation, in constructing a linear regression model for the logistic
and the other 99 chromosomes for the next generation are discrimination analysis are derived. In the eight data sets,
selected by sampling such that a chromosome with we found that q = 5 is a proper choice.
a higher fitness has a larger probability to be chosen. As
pointed out by Li et al. (2001), the results of the gene 5.2. Experimental results
selection are insensitive to the choice of the chromosome
length when this length is between 20 and 50. For the sake The number of genes selected at the first stage can affect
of computational efficiency, each chromosome is set to computational efficiency and prediction accuracy. We
have 20 genes. The mutation mechanism is the same as therefore first test whether the prediction accuracy are sig-
the one employed in Li et al. (2001). Once a chromosome nificantly different when the number of genes for classifica-
with a high fitness is found, it is saved as a high-fitness tion is either 50, 100, 150, or 200. Table 5 summarizes the
chromosome, and the genetic algorithm is restarted. This p-values of the ANOVA tests, where the bold values indi-
procedure repeats until 2000 high-fitness chromosomes cate that various numbers of genes for classification can
are found. However, the threshold for determining result in significantly different prediction accuracies. We
whether a chromosome has a high fitness depends on the can see from Table 5 that in most cases, the prediction
characteristics of a data set (Li et al., 2001). When this accuracy is insensitive to the number of genes for classifica-
threshold is too low, the prediction accuracy will decrease. tion when this value is either 50, 100, 150, or 200. The two
On the contrary, when this threshold is too high, the most significant p-values in the columns corresponding to
genetic algorithm may not be able to find enough high- T/PL and S/PL are due to significant small prediction
fitness chromosomes. Table 4 summarizes the fitness thresh- accuracies when the number of genes for classification is
olds for these data sets. 50. For the sake of computational efficiency, the number
For the k-nearest-neighbors classification, the value of k of genes for classification will be 100 in the remainder of
is set to be 3. A larger value of k will result a similar clas- this section.
sification accuracy and increase the computational effort Before studying the prediction capability of the two-
(Li et al., 2001). The majority class of the three neighbors stage classification methods, it should be of interest to
T.-T. Wong, C.-H. Hsu / Expert Systems with Applications 34 (2008) 375–383 381

Table 5
The p-values of the ANOVA tests on the number of genes for classification
Data set T/K BW/K S/K GK/K T/PL BW/PL S/PL GK/PL
Breast 0.5041 0.8641 0.9880 0.6244 0.3986 0.4637 0.8918 0.8275
CNS 0.1755 0.1102 0.5363 0.8604 0.9878 0.9397 0.9566 0.7893
Colon 0.9617 0.8338 0.8286 0.9970 0.9330 0.1940 0.1414 0.6657
Leukemia 0.7611 0.9980 0.9846 0.9885 0.8693 0.6144 0.9529 0.7475
Lung 0.4670 0.9981 0.8174 0.9872 0.0001 0.9243 0.6322 0.5682
Lymphoma 0.1189 0.0698 0.3270 0.9897 0.8060 0.7101 0.5359 0.9632
Ovarian 0.2955 0.1277 0.0575 0.5384 0.7586 0.2733 0.0000 0.6552
Prostate 0.0057 0.5416 0.1229 0.8938 0.6564 0.8310 0.5864 0.7255

Table 6
The common gene percentages of the four gene selection mechanisms
Data set T vs. BW T vs. S T vs. GK BW vs. S BW vs. GK S vs. GK
Breast 0.94 0.55 0.16 0.56 0.16 0.49
CNS 0.70 0.58 0.14 0.57 0.13 0.14
Colon 0.88 0.84 0.34 0.83 0.34 0.33
Leukemia 0.54 0.61 0.39 0.65 0.36 0.41
Lung 0.07 0.52 0.32 0.46 0.37 0.49
Lymphoma 0.96 0.74 0.31 0.73 0.31 0.35
Ovarian 0.83 0.62 0.58 0.61 0.56 0.46
Prostate 0.76 0.30 0.11 0.25 0.08 0.23
Average 0.71 0.60 0.29 0.58 0.29 0.36

know whether the various gene selection mechanisms will The two classification tools applied at the second stage are
filter similar genes for classification. Table 6 summarizes the k-nearest neighbors and the logistic discrimination
the percentages of common genes of the four gene selection analysis combined with the partial least squares for dimen-
mechanisms. We can see that the three individual-gene- sion reduction, and Tables 7 and 8 show the results of the
ranking mechanisms will generally find over 50% but lower hypothesis testing on the resulting mean prediction accura-
than 80% common genes for classification. However, the cies, respectively, where a bold value indicates that the
common gene percentage between any one of the three mean prediction accuracies are significantly different and
individual-gene-ranking mechanisms and the GK, the rep- its super index shows the gene selection mechanism with
resentative of subset-gene-ranking mechanisms, is usually a larger mean prediction accuracy. When the classification
lower than 40%. We also calculated the common gene per- tool is the k-nearest neighbors, we can see from Table 7
centages for the four gene selection mechanisms when the that the GK mechanism outperforms the other three
genes selected at the first stage is either 50, 150, or 200, mechanisms in almost all data sets. However, when the
and obtained similar results. Conservatively speaking, classification tool is the PL, the advantage of the GK mech-
more than half of the genes chosen by different categories anism appears only in classifying data set ‘‘prostate’’, as
of gene selection mechanisms for classification will be shown in Table 8. This suggests that the classification tool
different. applied at the first stage for filtering genes is likely to
In comparing the prediction accuracies of the eight two- enhance its prediction accuracy at the second stage. Both
stage classification methods, we first fixed the classification Tables 7 and 8 show that the T, BW, and S mechanisms
tool applied at the second stage to perform the paired t test. result in similar mean prediction accuracies in most cases,

Table 7
The p-values of the paired t tests when the classification tool applied at the second stage is the k-nearest neighbors
Data set T vs. BW T vs. S T vs. GK BW vs. S BW vs. GK S vs. GK
GK GK
Breast 0.1781 0.2631 0.0083 0.8396 0.0086 0.0050GK
CNS 0.0960B 0.1232 0.0000GK 0.7680 0.0085GK 0.0036GK
Colon 0.2531 0.8017 0.1475 0.6026 0.0751GK 0.1220
Leukemia 0.5572 0.3576 0.0047GK 0.4580 0.0006GK 0.0067GK
Lung 0.0237B 0.0019S 0.0008GK 0.6866 0.0996GK 0.1627
Lymphoma 0.5501 0.4215 0.3593 0.4886 0.1890 0.1379
Ovarian 0.8363 0.0467S 0.0031GK 0.1039 0.0129GK 0.1951
Prostate 0.3415 0.7654 0.0000GK 0.4441 0.0017GK 0.0000GK
382 T.-T. Wong, C.-H. Hsu / Expert Systems with Applications 34 (2008) 375–383

Table 8
The p-values of the paired t tests when the classification tool applied at the second stage is the logistic discrimination analysis combined with the partial
least squares
Data set T vs. BW T vs. S T vs. GK BW vs. S BW vs. GK S vs. GK
Breast 0.2582 0.0317S 0.1193 0.0511S 0.1855 0.4390
CNS 0.7959 0.9870 0.4042 0.7904 0.5487 0.4420
Colon 0.7926 0.0639T 0.8659 0.2524 0.7726 0.3428
Leukemia 0.5782 0.2025 0.3341 0.6532 0.5495 0.7370
Lung 0.7006 0.8675 0.5393 0.4072 0.9087 0.2765
Lymphoma 0.8887 0.3257 0.1289 0.2935 0.1016 0.0243S
Ovarian 0.5518 0.0884T 0.3275 0.0727B 0.5850 0.0746GK
Prostate 0.2150 0.5283 0.0195GK 0.7686 0.0258GK 0.0462GK

Table 9 data set is critical in determining whether a dimension


The p-values of the paired t tests for different classification tools applied at reduction technique should be applied at the second stage.
the second stage If properly used, a dimension reduction technique can sig-
Data set T BW S GK nificantly improved the predication accuracies of some
Breast 0.0011K 0.0005K 0.3363 0.0019K microarray data sets.
CNS 0.9570 0.4932 0.3148 0.0015K
Colon 0.0000K 0.0001K 0.0000K 0.0000K
Leukemia 0.6818 0.5521 0.5197 0.1403 6. Conclusions
Lung 0.0008PL 0.3301 0.2454 0.8095
Lymphoma 0.0042PL 0.0183PL 0.0019PL 0.7709 The causality of a disease is believed to be highly depen-
Ovarian 0.0123PL 0.0162PL 0.9004 0.0898PL
dent on the gene expression data. Many classification
Prostate 0.0000K 0.0000K 0.0000K 0.0000K
methods are therefore developed for extracting such infor-
mation from microarray data. With respect to the other
classification methods for microarray data, two-stage clas-
even the common gene percentages of these three gene sification methods have a higher applicability and under-
selection mechanisms are usually less than 80%. standability. The gene selection mechanisms at the first
Next, we fixed the gene selection mechanism to investi- stage can be either individual gene ranking or subset gene
gate the impact of the classification tool applied at the sec- ranking, and the classification tools for the second stage
ond stage. The experimental results are summarized in can include a dimension reduction technique or not. In this
Table 9, where a bold value indicates that the mean predic- study, eight two-stage classification methods, composed by
tion accuracies are significantly different and its super index four gene selection mechanisms and two classification
shows the classification tool with a larger mean prediction tools, corresponding to these four possible designs are
accuracy. The k-nearest neighbors is superior in three data tested by eight microarray data sets.
sets ‘‘breast’’, ‘‘colon’’, and ‘‘prostate’’, the PL tool outper- Based on the testing results, the genes chosen by differ-
forms in two data sets ‘‘lymphoma’’ and ‘‘ovarian’’, and ent categories of gene selection mechanisms are less than
these two classification tools perform almost equally in half in common. At the first stage for gene selection, a sub-
the other three data sets. As a general rule, the k-nearest set-gene-ranking mechanism can gain limited improvement
neighbors is slightly better than the PL tool, but their dif- on the classification accuracy, while its computational
ference is insignificant. In particular, dimension reduction effort is generally much heavier. When a two-stage classifi-
can be a beneficial technique for some data sets in improv- cation method includes a classification tool at the first stage
ing the classification accuracy. for gene selection, this method is likely to have a higher
In summary, the common gene percentage between two prediction accuracy if the same classification tool is used
individual-gene-ranking mechanisms is larger than the at the second stage. A dimension reduction technique has
common gene percentage between an individual-gene-rank- the potential risk of losing information but can filter some
ing mechanism and a subset-gene-ranking mechanism. A noise contained in the data. Thus, whether the classifica-
subset-gene-ranking mechanism seems to be beneficial in tion tool applied at the second stage should be preceded
filtering genes for classification. However, in our experi- by a dimension reduction procedure depends on the char-
ments, the GK mechanism is over 100 times slower than acteristics of a data set.
the other three individual-gene-ranking mechanisms in A subset-gene-ranking mechanism usually employs a
selecting genes. An analyst must consider whether spending classification tool to determine whether a gene subset is
a lot of time to gain limited accuracy improvement is wor- highly discernible to a disease. A classification tool needs
thy. When the mechanism for gene selection remain to introduce some inductive bias for predicting unseen
unchanged, the k-nearest neighbors is slightly better than instances (Mitchell, 1997). The inductive bias can be either
the combination of the partial least squares and the logistic a language bias, a search bias, or a combination of these
discrimination analysis. However, the characteristics of a two kinds of bias. When a classification tool is employed
T.-T. Wong, C.-H. Hsu / Expert Systems with Applications 34 (2008) 375–383 383

at the first stage for selecting genes, no matter what its Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M.,
inductive bias is, the inductive bias will confine the search Mesirov, J. P., et al. (1999). Molecular classification of cancer: class
discovery and class prediction by gene expression monitoring. Science,
space of the genes for classification. This means some dis- 286, 531–537.
cernible gene subsets better than the best gene subsets Jörnsten, R., & Yu, B. (2003). Simultaneous gene clustering and subset
found by the classification tool may be excluded from the selection for sample classification via MDL. Bioinformatics, 19,
search space by its inductive bias. Thus, it should be desir- 1100–1109.
able to develop a subset-gene-ranking mechanism without Khan, J., Wei, J. S., Ringnér, M., Saal, L. H., Ladanyi, M., Westermann,
F., et al. (2001). Classification and diagnostic prediction of cancers
any classification tool involved. Such kind of gene selection using gene expression profiling and artificial neural networks. Nature
mechanisms can also let us to fairly judge the performance Medicine, 7, 673–679.
of a classification tool applied at the second stage. Lee, K. E., Sha, N., Dougherty, E. R., Vannucci, M., & Mallick, B. K.
Our testing results show that a dimension reduction (2003). Gene selection: a Bayesian variable selection approach.
technique can filter the noise for some microarray data sets Bioinformatics, 19, 90–97.
Li, L., Weinberg, R. C., Darden, T. A., & Pedersen, L. G. (2001). Gene
to increase the classification accuracy, but not for all data selection for sample classification based on gene expression data: study
sets. So, the conditions under which a dimension reduction of sensitivity to choice of parameters of the GA/KNN method.
technique should be applied on a microarray data set will Bioinformatics, 17, 1131–1142.
be interesting to biologists. Note that those conditions Lu, Y., & Han, J. (2003). Cancer classification using gene expression data.
are based on the expression data of the genes selected at Information Systems, 28, 243–268.
Martella, F. (2006). Classification of microarray data with factor mixture
the first stage, not on the original data. models. Bioinformatics, 22, 202–208.
Mitchell, T. M. (1997). Machine Learning. McGraw Hill.
References Nguyen, D. V., & Rocke, D. M. (2002). Tumor classification by partial
least squares using microarray gene expression data. Bioinformatics,
Albrecht, A., Vintwebo, S. A., & Ohno-Machado, L. (2003). An 18, 39–50.
Epicurean learning approach to gene-expression data classification. Park, P., Pagano, M., & Bonetti, M. (2001). A nonparametric scoring
Artificial Intelligence in Medicine, 28, 75–87. algorithm for identifying informative genes from microarray data.
Antoniadis, A., Lambert-Lacroiix, S., & Leblance, F. (2003). Effective Proceedings of the Pacific Symposium on Biocomputing, 6, 52–63.
dimension reduction methods for tumor classification using gene Qiu, P., Wang, Z. J., & Liu, K. J. R. (2005). Ensemble dependence model
expression data. Bioinformatics, 19, 563–570. for classification and prediction of cancer and normal gene expression
Asyali, M. H., & Alci, M. (2005). Reliability analysis of microarray data data. Bioinformatics, 21, 3114–3121.
using fuzzy c-means and normal mixture modeling based classification Quackenbush, J. (2001). Computational analysis of microarray data.
methods. Bioinformatics, 21, 644–649. Nature Review Genetic, 2, 418–427.
Desper, R., Khan, J., & Schäffer, A. A. (2004). Tumor classification using Simek, K., Fujarewicz, K., Swierniak, A., Kimmel, M., Jarzab, B.,
phylogenetic methods on expression data. Journal of Theoretical Wiench, M., et al. (2004). Using SVD and SVM methods for selection,
Biology, 228, 477–496. classification, clustering and modeling of DNA microarray data.
Dudoit, S., Fridlyand, J., & Speed, T. (2002). Comparison of discrimi- Engineering Application of Artificial Intelligence, 17, 417–427.
nation methods for the classification of tumor using gene expression Wu, B. L. (2006). Differential gene expression detection and sample
data. Journal of the American Statistical Association, 97, 77–87. classification using penalized linear regression models. Bioinformatics,
Friedman, N., Linial, M., Nachman, I., & Pe’er, D. (2000). Using 22, 472–476.
Bayesian networks to analyze expression data. Journal of Computa- Zhang, H., Yu, C., Singer, B., & Xiong, M. (2001). Recursive partitioning
tional Biology, 7, 601–620. for tumor classification with gene expression microarray data.
Georgii, E., Richter, L., Ruckert, U., & Kramer, S. (2005). Analyzing Proceedings of the National Academy Sciences of the United States of
microarray data using quantitative association rules. Bioinformatics, America, 98, 6730–6735.
21, 123–129.

You might also like