Cancer Classification of Bioinformatics Data Using ANOVA: A. Bharathi, Dr.A.M.Natarajan
Cancer Classification of Bioinformatics Data Using ANOVA: A. Bharathi, Dr.A.M.Natarajan
Cancer Classification of Bioinformatics Data Using ANOVA: A. Bharathi, Dr.A.M.Natarajan
3, June, 2010
1793-8201
variable for various combinations of the classification of the dataset considered. Redundant or highly correlated
variables. An analysis of variance may be written as a linear features can be replaced with a smaller uncorrelated number
model. The two-way analysis of variance is an extension to of features capturing the entire information. This is done by
the one-way analysis of variance. There are two independent applying a method called Principal Component Analysis
variables. Two-way ANOVA determines how a response is (PCA) before using the SVM algorithm. The method is
affected by two factors. The two independent variables in a performed by solving an eigenvector problem or by using
two-way ANOVA are called factors. The idea is that there iterative algorithms and the result is a set of orthogonal
are two variables, factors, which affect the dependent vectors called principal components. The mapping of the
variable. Each factor will have two or more levels within it, larger set into the new smaller set is done by projecting the
and the degrees of freedom for each factor is one less than initial instances on the principal components. The first
the number of levels. In the 2 way ANOVA interactions principal component is defined as the direction given by a
between row and column. These are differences between linear regression fit through the input data. This direction
rows that are not the same at each column, equivalent to will hold the maximum variance in the input data. The
variation between columns that is not the same at each row. second component is orthogonal on the first vector,
For each component in the 2 way ANOVA table consists of uncorrelated and it is defined to maximize the remaining
sum-of-squares, degrees of freedom, mean square, and the F variance. This procedure is repeated until the last vector is
ratio. Each F ratio is the ratio of the mean-square value for obtained.
that source of variation to the residual mean square (with The envisioned research will follow the main steps of
repeated-measures ANOVA, the denominator of one F ratio knowledge discovery processes:-
is the mean square for matching rather than residual mean Gene selection - the irrelevant attributes (genes) are
square). [26] removed and the selected data is represented as a two-
2.2 Step 2: Finding the minimum gene subset dimensional table.
Preprocessing - if the selected table contains missing
After selecting some top genes in the important ranking
values or empty cell entries, the table must be preprocessed
list, we attempt to classify the data set with one gene. We
in order to remove some of the incompleteness. Statistics
input each selected gene into our classifiers. If no good should be run to obtain more information about the data.
accuracy is obtained we go on classifying the data set with
Training and validation sample - the initial table is
all possible 2 gene combinations within the selected genes.
divided into at least two tables by using a cross validation
If still no good accuracy is obtained, we repeat this
procedure. One will be used in the training step, the other in
procedure with all of the 3-gene combinations and so on the validation or testing step.
until we obtain a good accuracy. In this paper, we used the
Interpretation and evaluation - the validation or test data
following classifier to test 2-gene combinations.
set is then used to test the classificatory performance of the
methods in terms of efficiency and accuracy.
2.2.1 Support Vector Machines (SVMs)
Support Vector Machines (SVMs) [21] were originally 2.2.2 Algorithm Description
designed for binary classification. Recently, SVM [22] have We used five fold cross validation in the experiments
become a popular tool for learning methods since they because formal training and test datasets are not available
translate the input data into a larger feature space where the for this data set. More specifically, we randomly divide data
instances are linear separable, thus increasing efficiency. In in each class into five groups. In each fold, data points in
the SVM methods a kernel which can be considered a four groups are used as a training set, the data points in the
similarity measure is used to recode the input data. The remaining group is used as a test set. Hence, we have five
kernel is used accompanied by a map function .Even if folds of the data. The training and test sets in each fold are
the mathematics behind the SVM is straight forward, independent. Moreover, the experiment using data in each
finding the best choices for the kernel function and fold is done independently. Hence, cross validation is used
parameters can be challenging, when applied to real data here for separating the data set into several groups of
sets. We will use the Libsvm developed by Chang [23]. training and testing sets, not for avoiding over fitting [1].
Usually, the recommended kernel function [24] for Fig.1 shows the procedure for cross validation.
nonlinear problems is the Gaussian radial basis function,
because it resembles the sigmoid kernel for certain
parameters and it requires less parameters than a polynomial III. RESULTS
kernel. The kernel function parameter γ and the parameter C,
In the lymphoma data set [13] there are 42 samples
which controls the complexity of the decision function
derived from Diffuse Large B-cell Lymphoma (DLBCL),
versus the training error minimization, can be determined by
nine samples from Follicular Lymphoma (FL), and 11
running a 2 dimensional grid search, which means that the
samples from Chronic Lymphocytic Leukemia (CLL). The
values for pairs of parameters (C, γ) are generated in a
entire data set includes the expression data of 4026 genes. In
predefined interval with a fixed step. The performance of
this data set, a small part of the data is missing. A k-nearest
each combination is computed and used to determine the
neighbor algorithm was applied to fill those missing values
best pair of parameters.
[10]. In the first step, we randomly divided the 62 samples
The non-sparse property of the solution leads to a really
into 2 parts: 31 samples for testing, 31 samples for training.
slow evaluation process. Thus, for the microarray datasets a
We ranked the entire set of 4,026 genes according to their
data reduction [25] can be done in terms of genes or features
370
International Journal of Computer Theory and Engineering, Vol. 2, No. 3, June, 2010
1793-8201
ANOVA in the training set. Then we picked out the 20 five fold CV accuracy for the training data reached 97.15
genes with 2 gene combinations with 190 iterations (see percent for the SVM. The corresponding testing accuracies
table 1) and picked the highest ANOVA. (See the table 2). varied from 96.77 to 100 percent. We are comparing all
possible combination of tests. The results are shown in table
TABLE1. 2. Comparing existing method, our approach obtained very
Gene Correct rate Error rate
good accuracy.
1,4 1 0 TABLE 2.
1,8 1 0 Knnimpute No. No.of No.of CV Acc Acc
1,9 1 0 fold Genes Comb.
1,14 1 0 (Data,3) 5 20 2 91.7 96.77
1,15 1 0
1,16 1 0
(Data,3) 5 20 3 93.97 97.6
1,18 1 0
2,4 1 0
2,8 1 0 (Data,3) 5 10 2 92.11 96.77
2,9 1 0
2,11 1 0
2,14 1 0 (Data,3) 5 10 3 93.31 100
2,15 1 0
2,16 1 0
2,18 1 0 (Data,3) 10 20 3 93.42 97.3
4,7 1 0
4,12 1 0
4,17 1 0 (Data,3) 10 20 2 91.26 96.77
7,8 1 0
7,9 1 0
7,18 1 0 (Data,3) 10 10 2 91.25 96.77
8,17 1 0
9,12 1 0
9,17 1 0 (Data,3) 10 10 3 92.47 100
11,17 1 0
12,14 1 0
(Data,5) 5 20 2 93.11 98.39
12,18 1 0
14,17 1 0
17,18 1 0
(Data,5) 5 20 3 96.4 98.4
18,20 1 0
(Data,5) 5 10 2 94.62 98.38
TABLE 4. [9] Alter O., Brown P.O. , and Botstein D., Singular value decomposition
Knnimpute No.fold No.of No.of CV Accuracy for genome-wide expression data processing and modeling,
Genes Comb. Accuracy Proceedings of Natural academic Science, USA, 97(18), 10101-10106.
(Data,3) 5 20 2 89.85 96.77 [10] Alter O., Brown P.O. , and Botstein D., Generalized Singular value
(Data,3) 5 10 3 90.29 97.77 decomposition for comparative analysis of genome-scale expression
(Data,3) 5 10 2 89.16 96.77 datasets of two different organisms, Proceedings of Natural academy
(Data,5) 5 10 2 89.66 98.39 of Science, USA, 100(6), 3351-3356.
(Data,5) 5 10 3 90.08 96.77 [11] Troyanskanya I. et al., Missing value estimation methods of DNA
(Data,5) 5 20 2 88.75 98.36 micro array, Bioinformatics, 17(6), 520-525.
(Data,5) 5 20 3 90.05 96.87 [12] Oba S. et al., A Bayesian missing value estimation method for gene
(Data,3) 10 20 2 88.66 96.77 exoresion profile data, Bioinformatics, 19(16), 2088-2096.
[13] Friedland S., Niknejad A., and Chihara L.,A Simultaneous
Knnimpute No.fold No.of No.of CV Accuracy reconstruction of missing data in DNA microarrays, Institute of
Genes Comb. Accuracy Mathematics and its Applications preprint series, No.1948.
(Data,3) 5 20 2 89.85 96.77 [14] A. A. Alizadeh et al., Distinct types of diffuse Large b-cell lymphoma
(Data,3) 5 10 3 90.29 97.77 identified by gene expression profiling, Nature, 403,503-511.
(Data,3) 5 10 2 89.16 96.77
[15] Y. Lee and C. K. Lee, Clasication of multiple cancer types by
(Data,5) 5 10 2 89.66 98.39
Multicategory Support Vector Machines using gene expression data,
(Data,5) 5 10 3 90.08 96.77 Bioinformatics, 19, 1132-1139.
(Data,5) 5 20 2 88.75 98.36
[16] M. P. Brown et al., Knowledge- based analysis of micro array gene
(Data,5) 5 20 3 90.05 96.87
expression data by using support vector machines, Proceedings of
(Data,3) 10 20 2 88.66 96.77 Natural academy of Science, USA, 97, 262-267.
[17] Roseberg S. A., Classification of lymphoid neoplasm’s, Blood 84,
Using T-test, we obtain the 93.85 percent accuracy 1359-1360.
Comparing all the three classifiers shown in table 5. [18] Schena M. Shalon D, Davis R. W, and Brown P.O., Quantitative
monitoring of gene expression pattern with a complementary DNA
micro array, Science 270, 467-470.
TABLE 5
[19] J.M. Khan et al., Classification and diagnostic prediction of cancers
Classifiers Accuracy using gene expression profiling and Artificial Neural Networks,
Nature Medicine,7,673-679.
T-test 93.85
[20] C. Ambroise and G. J. MaKachlan, Selection Bias in Gene Extraction
SVM 97.91 on the Basis of Micro array Gene-Expression data, Proceedings of
BPN 97.43 National Academy of Sciences USA, 99, 6562-6566.
[21] C. Cortes and V. Vapnik, “Support-vector network,” Machine
Comparing all the three classifiers, our SVMs classifier Learning, vol. 20, pp. 273–297, 1995.
obtained very good accuracy. [22] T. Joachims, “Making large-scale SVM learning practical.”, In B.
Scholkopf, C. J. C. Burges and A. j.Smola, editors, Advances in
IV. CONCLUSION Kernel Methods – Support Vector Learning, pp. 169-184, MIT Press,
Cambrige, MA, 1999.
For our purpose of finding the smallest gene subsets for [23] C.-C. Chang, and C.-J. Lin, “LIBSVM: a library for support vector
accurate cancer classification, both ANOVA and CV are machines”, Software available at
highly effective ranking schemes, whereas SVM is http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001.
sufficiently good classifiers. As we have known from the [24] N. Cristianini and J. Shawe -Taylor, An Introduction to Support
Vector Machines and Other Kernel-based Learning Methods,
results in the lymphoma dataset, the gene combination that Cambridge University Press, Cambridge, England, 2000.
gives good separation may not be unique. In the lymphoma [25] Y.-J. Lee and O.L. Mangasarian, “RSVM: Reduced Support Vector
data set, we clustered the 20 selected genes using K-means Machines”, Proc. of the First SIAM International Conference on Data
method. Mat lab 7.0 is used to implement this procedure. Mining, Chicago, April 5-7, 2001.
[26] H Zhang, N Ye, J He, A Roontiva and J Aguay,”Two-way ANOVA
Finally we obtained very good accuracy compared to T-Test to identify impacts of multiple interactive behavioral factors on the
method. neuronal population dependency during the reaching motion”, 30th
Annual International IEEE EMBS Conference Vancouver, British
Columbia, Canada, August 20-24, 2008
REFERENCES
[1] Lipo Wang, Feng Chu, and Wei Xie, Accurate Cancer Classification
using expressions of very few genes, IEEE/ ACM transactions on Mrs.A.Bharathi received her Bachelor of Engineering Degree from
computational Biology and Bioinformatics, 4, 40-52,2007 Kongu Engineering College in 1998, Perunduai, Master of Engineering
Degree from Bannari Amman Institute of Technology, Sathyamangalam, in
[2] Gloub et al., Molecular Classification of cancer: class discovery and 2007 and she is doing Doctor of Philosophy in Computer Science and
class prediction by gene expression monitoring, Science, 286,531-537 Engineering from Anna University, Coimbatore. She is currently the
[3] Perou et al., Molecular portraits of human breast tumors. Nature, Assistant Professor, Department of IT, Bannari Amman Institute of
406,747-752 Technology, Sathyamangalam. Her Professional activities include…Guided
[4] Cho J. H, Lee J.H, and Lee I.B, New gene selection method for Ten UG projects and guiding Seven UG and Three PG projects. Published
classification of cancer subtypes considering within –class variation. and presented 10 papers in International and National Conferences and also
FEBS Letters, 551, 3-7 published 2 international and 3 national journals.
[5] Kim H, and Park H, Multi class gene selection for classification of
cancer subtypes based on generalized LDA Dr. A. M. Natarajan received his Bachelor of Engineering Degree from
[6] Shipp M. A et al,.Diffuse large B-cell lymphoma outcome prediction PSG College of Technology in 1968, Coimbatore, Master of Engineering
by gene expression profiling and supervised machine learning, Nat. Degree from PSG College of Technology in 1970, Coimbatore and Doctor
Med., 8,68-74. of Philosophy in Systems Engineering from Bharathiar University,
[7] Van’t Veer L.J et al., Gene expression profiling predicts clinical Coimbatore in 1984. He was the Principal in Kongu Engineering College at
outcome of breast cancer, Nature, 415, 530-536. the time of relieving. He is currently the Chief Executive and Professor in
Bannari Amman Institute of Technology, Sathyamangalam. His
[8] Vapnik V.,The nature of Statistical Learning theory, Springer-Verlag,
New York. Professional activities include…Guided 15 Ph.Ds and guiding 21 Ph.Ds in
the field of CSE, EEE and ECE. Published and presented more than 150
Papers in International and National Journals and also in Conferences
372
International Journal of Computer Theory and Engineering, Vol. 2, No. 3, June, 2010
1793-8201
F1
No
Tested all the
5 folds in
Yes
No
combinations another
Yes
samples in F2
373