Fusion Feature Selection: New Insights Into Feature Subset Detection in Biological Data Mining

Fusion Feature Selection: New Insights into feature
subset detection in biological data mining
R.Rajavel 1, R.Athilakshmi 1*, Shomona Gracia Jacob 2

1
Electronics and Communication Engineering, SSN College of Engineering, Chennai
2
Computer science and Engineering, SSN College of Engineering, Chennai
*
[email protected]
ABSTRACT: In Bioinformatics, the constantly increasing gene expression samples and the associated dimensions
in the DNA microarray research data create a very large vacuity that necessitates the development of a more
efficient and improved classification algorithm for selecting optimal features in gene expression data. This study
presents a new fusion feature selection algorithm that combines the Correlation Feature Selection (CFS) and the
Velocity Clamping Particle Swarm Optimization (VCPSO) algorithm. This hybrid model capitalizes on both the
filters and the wrappers and selects the optimal feature subsets to classify genes using different classifiers. The
proposed hybrid mechanism was evaluated on two datasets, viz, neurodegenerative brain disorder protein data and
microarray cancer data. Experimental results reveal that the proposed CFS-VCPSO-SVM Selection method
eliminates the redundant features significantly and lowers the dimension of gene expression data for classification.
From the study, it was observed that the proposed CFS-VCPSO-SVM approach also achieved high classification
accuracy in categorizing the different classes of disorders and cancer.
characterized by innumerable gene features which in

1. INTRODUCTION turn raise a new need for identifying prognostic
The traditional method of classifying microarray gene (significant) genes that play an important role in
expression data using a pure filter or pure wrapper detecting the disease earlier. Moreover, the presence of
method is modified through a fusion of feature inessential, insignificant and undesirable genes in the
selection algorithms. The performance of the pure dataset deteriorates the computing efficiency and also
filter is fast, however the learning results are not the classification accuracy of machine learning
sufficient for wide feature datasets. On the other hand, algorithms (Golub et al, 1999; Wu, Dai, Shi, Yan,
the wrappers guarantee better learning results. But, it is &Zhang, 2012). The proposed fusion feature selection
very slow process when carried out with wide feature strategy combines the correlation calculation and
sets that contain hundreds or even thousands of swarm optimization techniques to reduce the
features. The purpose of a fusion of algorithms is to computational cost and search complexity. We also
incorporate the benefits of the efficient wrapper into demonstrate the performance of the proposed fusion
the efficient filter for extracting the informative genes feature selection method by comparing the
that improve the classification of gene expression data classification accuracy with other existing feature
and to intensify the learning process of classifiers to selection methods.
get stable results (Hsu.Hsieh,&Lu,2011). Following this, the strength of the proposed
In this research, the authors attempted to utilize the CFS-VCPSO algorithm is tested on the selected
benefits of both filters and wrappers by modifying the datasets by four different classifiers. The paper is
existing feature selection techniques as detailed in the organized as follows: literature survey is discussed in
methodology. Here, the feature sets are first filtered Section 2; the proposed fusion feature selection
out based on minimum redundancy and maximum methodology is then described in Section 3; the gene
relativity to target class and the reduced feature sets expression datasets and parameter settings are defined
are further tuned by a wrapper procedure named in Section 4. The experimental results are presented in
Velocity Clamping Particle Swarm Optimization Section 5; and the conclusions are given in Section 6.
(VCPSO). This fusion methodology was applied to
select relative features from the neurodegenerative 2. LITERATURE SURVEY
brain disorder protein data and the benchmark The main task in machine learning is identifying an
microarray cancer datasets. Both datasets are optimal set of features for a given problem from which
1
to construct a classification model. Here we first concluded that their methods were effective in
review early research concerning the filters followed predicting gene features of benchmark cancer datasets.
by the wrapper techniques. Hoque et al. (2014) The above-specified algorithms have made use of
introduced a filter based feature selection framework some filters and/or wrappers for feature selection. For
to find the ideal subset of features. Their method microarray gene expression data classification, several
combines both gene-gene mutual information and approaches were done with classical filter-wrapper
gene-class mutual information and shows significant approaches. However, the filter could not guarantee
results in classifying high-dimensional data. Ding and the learning results of the classifier because it ignores
Peng(2005) proposed a minimum redundancy the peculiar heuristics and preferences of the classifier
maximum relevance (MRMR) feature selection that might lower the classification accuracy. On the
framework which considers maximum relevance other side, the wrappers detect high feature
towards class and minimum redundancy among genes dependencies and become computationally inefficient
in classifying microarray gene data. The MRMR when the search space grows.
method selects features with improved classification In this study, a novel fusion feature selection strategy
accuracy and better generalization property. Another is proposed which integrates the boon of CFS
work on filters was reported by Wald et al. (2014). (Correlation Feature Selection) algorithm with the
They applied the CFS technique to a diverse collection VCPSO (Velocity Clamping Particle Swarm
of bioinformatics datasets and build the effective Optimization) algorithm to select the non-redundant
classification model that reduces the problem of class and relevant genes of the gene expression data and
inconsistency, high-dimensionality, and information categorize the data more accurately. The mechanism
redundancy. Tan et al. (2008) proposed a hybrid capitalizes on the effectiveness of the filters and the
framework for feature selection that binds a genetic learning accuracy of the wrappers. Moreover, the
algorithm (GA) with different existing feature combination of the efficient filter with a different
selection algorithms for selecting an optimal number search strategy is required since it accelerates the
of genes from DNA microarray gene expression data. learning process of classifiers and stabilizes the
They concluded that hybrid approaches are more classification accuracy. CFS measures the correlation
effective in finding optimal genes with better between feature subsets and finds feature subsets that
classification accuracy. Zhang et al. (2009) proposed are highly relevant to the class and least relevant with
an effective feature selection scheme by employing one other while the VCPSO solves the early
mutual information maximization (MIM) method in convergence issue of PSO by clamping the velocity
improving the classification accuracy of multi-label within a search space. The main contribution of this
problems. Shen et al. (2008) integrated tabu search and proposed fusion framework is highlighted as follows:
PSO in a hybrid feature selection algorithm to select an  Novelty: The proposed CFS-VCPSO is a
optimal number of genes on microarray datasets but computationally efficient method used to solve the
the results were not satisfactory. Kagg et al. (2017) early convergence issues of PSO by limiting the
introduced a new filter based on robust weighting velocity within a search space. A Velocity clamping
schemes for microarray gene data selection. They method is utilized which controls the searching ability
utilized the SAM (Significance Analysis of of PSO in such a way that significantly improves the
Microarrays) technique in which the genes with the performance of the algorithm.
statistically significant difference in expression were  Effectiveness: The proposed hybrid algorithm
identified. Different variants of SAM were formed capitalizes on the advantages of the CFS and VCPSO.
using modified t-tests and named as M-SAM1 and M- The optimal gene set selected by our algorithm
SAM2. Their results demonstrated that modified improves classification accuracy compared with
methods identified target genes better than the original existing feature selection approaches.
method, SAM. Zhou et al. (2006) developed a Mutual  Robustness: To analyze the performance of
Information Rough Set (MIRS) model along with our proposed fusion approach, four different classifiers
BPSO for classification of cancer genes in microarray were executed on the selected gene expression data
gene expression data. They concluded that their model with the selected feature sets.
is superior in classifying microarray data when
compared to classical feature selection methods. 3. FUSION FEATURE SELECTION
Ramani et al. (2013) proposed a gene selection ALGORITHM CFS-VCPSO SELECTION
algorithm for microarray cancer datasets in which
genes are identified based on Rank-Weight Feature- 3.1. Correlated Feature Selection
selection (RWFS) approach. Here, they assigned In order to decrease the search space of the high
weights to attributes selected by different feature dimensionality datasets, we used correlation-based
selection methods followed by ranking. They Feature Selection (CFS) algorithm which filters
feasible feature subset from a given sample space by
including only those features which are mutually
2
uncorrelated but have greater predictive ability toward ìï Vij (t + 1), Vij (t + 1) < Vmax, j
a class (Hall,1999). If the features are mutually Vijnew (t + 1) = ïí (3)
ïï Vmax, j , Otherwise
uncorrelated then redundancy is eliminated whereas î
the greater relevance of the features with class ensures
better prognostic ability. If the particle’s minimum velocity is smaller than Vmin,
then it is clamped to the velocity by the following
CFS criterion is defined as follows: equation.
é ù
ê rcf1 + rcf 2 + ... + rcf n ú ìï Vij (t + 1), Vij (t + 1) > Vmin, j
CFS= max HMs ê ú (1)
ê k + 2(rf f + ... + rf f + ... + rf f ú Vijnew (t + 1) = ïí (4)
êë 1 2 i j k l ú
û ïï Vmin, j , Otherwise
î
To measure the worth of a feature subset S consisting The initializations of maximum and minimum
of n features, the algorithm uses following merit velocities are given by the equation (5) and (6).
criteria:
Vmax, j = l (xmax, j - xmin, j ) (5)
The heuristic merit of a subset,
nr g,c Vmin, j = l (x min, j - x max, j ) (6)
HMs= (2)
n + (n - 1)r g,g
where xmax,j, and xmin,j are the maximum and minimum
position of particles in jth dimension obtained from
where r g,c is the average value of gene-class initial test runs of a particle and 𝜆 is a constant factor
correlation. r g,g is the average value of gene-gene falls between [0,1]. The complete framework of the
correlation. proposed model CFS-VCPSO is shown in Fig. 1 and
The high value of HMs indicates better the correlations explained in detail in section 3.3.
between the gene features and the class and lowers the
redundancy among the genes in the subset. 3.3. CFS-VCPSO-Selection
In the pre-selection phase, we started with a data pre-

3.2. Particle Swarm Optimization
processing step wherein every dataset, we replaced the
Particle Swarm Optimization (PSO) is a computational
missing values of gene expressions with average
technique supported swarm intelligence that generates
values. The whole dataset was standardized to possess
a sequence of improved solutions from a given set of
an average value equivalent to zero and standard
solutions (Eberhart & Kennedy, 1995). Velocity
deviation equal to one by using following equation:
Clamping PSO (VCPSO) is a distinct version of PSO
x- m
which prevents the particle from leaving the search x new = (7)
space by clamping its velocity within the search limit. s
In standard particle swarm optimization, parameters Here 𝜇 is the average value and 𝜎 is the standard
such as swarm size, inertia weight, and neighborhood deviate value for the given data. xnew is the new value
size play an important role in order to observe the for a gene expression e.
convergence. A number of variations to the standard Following that, we applied the multivariate CFS filters
PSO are developed to enhance the convergence speed that separated the features that are well related to the
and quality solutions discovered by the PSO (Shi & class, however irrelevant to each other. The reduced
Eberhart, 1998, 2000; Zhan et al.2009; Zhang, 2015). feature set DR was further optimized in the Gene
This work proposes an additional velocity clamping optimization phase using a meta-heuristic wrapper
strategy based on boundary handling mechanisms approach Velocity Clamping Particle Swarm
when the velocity of a particle goes out from the Optimization (VCPSO) for which the model achieved
defined boundaries. In PSO, an innumerable particle maximum classification accuracy. Four different
flowed through the search space to obtain a best classifiers are used to estimate the predictive ability of
feasible solution for a given objective function with a the proposed approach in which the stratified 10-fold
limited number of iterations. Before moving the cross-validation was performed. Specifically, in the
particle to a new position, the algorithm confirms Gene optimization phase, the objective is to find an
whether the particle lies within the search space. If the optimal gene subset x̂ such that
maximum velocity of a particle lies within the search
space then the particle is allowed to move to the new ˆ ³ f (x), " x Î DR
f (x) (8)
position, or else the velocity of the particle is set to where ‘x’ is a dimensional vector of reduced feature
Vmax, j. This strategy is mathematically represented by set DR and f(x) is the fitness function. In PSO, a swarm
the following equation. of particles is represented as potential solutions, and
3
each particle is associated with two vectors, i.e., the position ‘x’ of the particle constitutes the first
velocity vector (Vt) and position vector (Xt) and it is component whereas the difference of best fitness value
defined as ‘gbest’ and the current position ‘x’ of the particle
constitutes the second component along with learning
t
Vm t
= (Vm1 t
, Vm2 t
,.., Vmn ) (9) constants c1, c2 and random numbers r1,r2. Using this
computed velocity, the forthcoming position of the
t t t t particle is updated using the following equation.
Xm = (Xm1 , Xm2 ,.., Xmn ) (10)
X t+ 1 = X t + V t+ 1 (12)
The position of the particle with the present fitness
value is denoted by ‘pbest’ and the position of the
The movements of each particle are guided by their
particle with the best fitness value of the swarm is
own known position (pbest) within the search-space in
denoted by ‘gbest’. The following formula is used to
addition to swarm's best-known position (gbest).In each
compute the velocity. iteration, every particle updates its current position and
velocity based on the pbest and gbest value of the
V t + 1 = wt V t + r1c1 (pbest - x) + r2c2 (g best - x) (11) particle. Once improved positions are being discovered
these can then return to guide the movements of the
swarm. This amalgam of techniques comprising of
Dataset
correlation calculation and swarm optimization is
proposed as the fusion methodology. The key aspect of
the model is that it integrates the benefits of fast and
Pre-process Dataset
efficient dimensionality reduction of the multivariate
(Standardization)
filter (CFS) and simple yet intelligent Particle Swarm
Optimization approach. The feature subsets returned
by VCPSO were used to train the SVM classifier
CFS Filter for Gene model. The performance of the model was evaluated
subset selection by 10-fold cross-validation by giving test data as input.
Finally, feature subsets with high classification
accuracy and high AUC value were taken as the
optimal genes returned by the model.
VCPSO based Gene
Subset Optimization
Algorithm 1: CFS-VCPSO feature selection:
1: Read the gene attributes g1, g2, g3... gn into an

array f[].CFS
Training data Testing data
2: Pre-process the dataset according to Eq. (7)
3: Apply CFS algorithm to remove the redundant
feature.
Classifier a) Find the correlation between the gene attributes
in the subset ρg,g
Model Model
b) Find the correlation between the gene attributes
learning Validation
(10-fold)
and the class 𝜌𝑔,𝑐
c) Measure the worth of a feature subset according
to equation (2)
4: Repeat step 3 with different feature subsets.
5: Print the reduced dataset DR that contains the feature
Classification Accuracy attributes with high class and low feature correlation
value based on equation (1).
6: Read DR and set VCPSO Parameters according to
table 1.
7: Initialize swarm population
Fig.1. Proposed model of CFS-VCPSO selection 8: repeat
9: for all particles i in the swarm do
The current velocity Vt+1 are computed by adding two 10: Evaluate the fitness function f (Xi)
components to the previous velocity Vt of the particle. 11: if f (Xi) > f (Pbest (i)) then
The difference of present fitness ‘pbest’ and the current 12. Update the swarm’s best position, Pbest (i) =Xi
13: end if
4
14: if f (Xi) > f (gbest (i)) then the CFS algorithm with bio-inspired PSO along with
15: Update the swarm’s global position, gbest(i) =Xi the modifications (summarized in Algorithm 1). The
16: end if advantage of CFS is that does not require the user to
17: end for specify any thresholds or the number of features to be
18: for all particles i in the swarm do selected, although both are simple to incorporate if
19: for all features j in the swarm do desired. Most importantly, CFS is a filter, and, as
20: Vijt+1=ω *Vijt+c1r1 (Pbest(ij) – xij) + c2r2 (gbest(ij)-xij) such, does not incur the high computational cost
21: Limit particle velocity Vijt+1 according to Eq. (3) associated with repeatedly invoking a learning
and Eq. (4) algorithm. The advantage of using PSO is that it is
22: if Vijt+1> Vmax then based on the swarm intelligence and the speed of the
23: Vijt+1= Vmax search process is very fast. The proposed model
24: end if attempts to select relevant and non-redundant genes by
combining the fast computational CFS filter and
25: if Vijt+1< Vmin then efficient wrapper VCPSO in a single approach. In our
26: Vijt+1= Vmin solution, the values of ‘ω’ are assigned between 0.9
27: end if and 0.4. During each iteration, ‘ω’ is decremented
28: Update next position of particle till fitness function gradually from 0.9 to 0.4. In order to analyze the
converges according to Eq. (12) performance of our proposed model, stratified 10-fold
29: end for cross-validation was employed. The parameters used
30: end for in the Velocity Clamping PSO of Gene optimization
31: Until all iterations are not done phase are listed in table 1. The main characteristics of
32 Train the features returned by PSO using SVM the gene expression data of neuro-degenerative brain
classifier. disorder protein dataset and microarray cancer datasets
33 Test the trained model by giving test data to the are tabulated in table 2.
model.
34: return optimal gene set x̂ according to Eq. (8)
Table 1 Parameters for Velocity Clamping PSO
35: return maximum classification accuracy achieved
by the model. algorithm
4. GENE EXPRESSION DATASETS AND Parameters Values

PARAMETER SETTINGS
Swarm size(ω) 20
This research aims at achieving maximum
Total number of 20
classification accuracy in high dimensional biological
iterations
datasets.The first data set is the neuro-degenerative
ω 0.4-0.9
brain-disorder Gene expression data of Alzheimer’s
C1 2
and Parkinson’s disease based on structural and
C2 2
physicochemical properties of proteins and peptides
Vmin -6
from amino-acid sequences. This is a sequel to the
Vmax 6
research on detecting lung cancer tumors based on
structural and physicochemical properties of proteins
(Ramani&Jacob, 2013; Hosseinzadeh et al., 2012).
Their method discovered feasible number of features 5. EXPERIMENTAL RESULTS
in differentiating small cell lung cancer from non- In this section, the effectiveness of our proposed fusion
small cell lung cancer with improved classification methodology for feature selection is evaluated by
accuracy. The predicted class information of lung comparing the experimental results obtained on
cancer tumor could reveal the protein function, genetic Neurodegenerative Brain Disorder Dataset and
markers for the two diseases thereby enabling the Microarray Cancer Datasets with four other algorithms
development of possible targets for drug formulation. ReliefF (Kononenko,1994), MIM-Mutual Information
The complete description of the data extraction and Maximization(Lu et al.,2017), RWFS-
feature nomenclature is described in the paper [21]. RankWeightFeatureSelection (Ramani &Jacob,2013)
The second dataset is the gene selection in microarray and CFS-PSO (Yang et al.,2008) on the basis of
cancer data of five benchmark datasets which includes classification accuracy and the number of genes
brain tumor, glioblastoma, Lung cancer, Leukemia and selected. The SVM with the same parameter setting is
gastric cancer. These datasets were downloaded from applied to the selected gene subsets of all the five
the Biolabs Data Set Repository which stores both algorithms. The parameters for SVM execution were
experimental values and the gene names [22]. The set as follows: the penalty coefficient and the gamma
proposed fusion feature selection algorithm combined value were set at 0.12 and 0.13; while the kernel
5
function was Sigmoid. The metrics used for 6. CONCLUSIONS
measuring the performance of the classifier were AUC In this work, a fusion technique for feature selection
and Classification Accuracy. The classification has been proposed that combines the multivariate filter
accuracy rates are shown in table 4. On two datasets CFS, wrapper approach VCPSO and also the Support
(Gastric Cancer and Lung cancer) the proposed Vector Machine classifier with stratified cross-
method managed to improve classification accuracy to validation. We compared our model with four popular
nearly 100%. The next high classification accuracy ~ methods namely ReliefF, MIM, RWFS and CFS-PSO
98% was noticed in Brain tumor, Glioblastoma and on two large datasets.
Leukemia datasets. For neurodegenerative brain From the results, it is evident that CFS-VCPSO-
disorder dataset, the estimated classification accuracy SVM algorithmic rule is capable of reducing the
by our proposed method was 94.9%. In all datasets, on dimensions in the original gene expression datasets
average, a small number of features were selected by and removes the data redundancies while yielding
the proposed method and it is shown in table 3. The acceptable classification accuracy for huge feature
next significant performance on all six datasets is sets. The reduced gene subset of CFS-VCPSO
reported by CFS-PSO. It is evident from the table 4, approach is further evaluated by three classification
that the proposed method is superior when compared algorithms along with SVM. From the results, we
to all feature-selection methods. observed that the CFS-VCPSO together with SVM
Next, the performance of the selected feature subset of classifier performs well on the datasets considered in
all five feature selection algorithms is evaluated by the this research. The proposed fusion model reveals the
second metric AUC, on all datasets. The results of importance of an integrated feature selection approach
AUC indicate that the feature subset selected by the prior to classification for complex datasets. Extensions
proposed algorithm is able to produce classifier with to this work will focus on enhancing the efficiency of
larger AUC than that of other methods and is showed the velocity updates that influence the particle’s
in table 5. The ROC curve of five feature selection position using different variants of PSO.
methods based on the 10-fold cross-validation on all
datasets is shown in figures 2,3,4,5,6 and 7 Table 2 Characteristics of Gene expression Data of
respectively. The curve plots a false positive rate on Neuro-degenerative Brain Disordered protein and
the x-axis, the true positive rate on the y-axis and the Microarray Cancer.
diagonal line corresponds to the random classifier. For
the neurodegenerative brain disorder dataset, the Datasets No. of Total No. of Refer
estimated AUC by our proposed method was 0.94. For total no. of Classe -
Brain tumor and Gastric cancer dataset, it was 0.98. genes sample s ences
For Lung cancer and Leukemia it was 0.99 and for s
Glioblastoma it was 0.97. From the plot, it was noted
that CFS-VCPSO method results in a higher running (a) Neurodegenerati 1437 111 3 (Jacob
curve and the larger AUC value for all datasets when ve Brain &Ath
compared to all feature selection methods. This means Disordered ilaksh
the feature subsets selected by the proposed approach protein data mi,201
have a better classification performance among all five (b) Microarray 6)
feature selection methods. Cancer
The fitness of the selected gene subset of CFS-VCPSO Brain Tumor 7129 40 5 (Hram
selection is further evaluated by three classification or et
algorithms Random Forest (RF), Naïve Bayes (NB) al.,200
7)
and Decision Tree (DT) along with SVM. The SVM
Glioblastoma 12625 50 4 [22]
classifier showed the best performance in all datasets
Lung Cancer 12600 203 3 [22]
and achieved the highest accuracy of 99% in Lung
Leukemia 5147 72 4 [22]
Cancer and Gastric cancer datasets as shown in table 6.
Gastric Cancer 4522 30 3 [22]
Next to SVM, the RF classifier showed equivalent
optimal performance in Neurodegenerative Brain
Disordered protein data while the NB classifier
showed equivalent best performance in Leukemia
Dataset. In this experiment, we concluded that the
SVM is the most suitable classifier for the CFS-
VCPSO-Selection algorithm. In summary, all four
classifiers reach the classification accuracy rates higher
than 90% for all datasets, which demonstrates the
robustness of the CFS-VCPSO-Selection method.
6
Table 3 Comparison of average number of Genes selected for six datasets
Datasets ReliefF MIM RWFS CFS-PSO CFS-VCPSO
(a) Neurodegenerative Brain 954 923 54 122 33

Disordered protein Dataset
(b) Microarray Cancer Datasets
Brain Tumor 6323 6129 3 199 9
Glioblastoma 6745 5625 5 255 31
Lung Cancer 5401 4941 3 317 22
Leukemia 4203 3980 6 276 27
Gastric Cancer 3412 3122 4 145 12
Table 4 Comparisons of mean classification accuracy obtained for six datasets
(a) Neurodegenerative Brain 55.2 66.7 71.9 91.5 94.9

Brain Tumor 74.3 79.3 77.5 98.1 98
Glioblastoma 66.1 68.3 90 92.2 98
Lung Cancer 74.3 79.3 94.1 98.3 99.1
Leukemia 71.6 77.7 65 92.9 98.3
Gastric Cancer 80.3 74.2 93.3 97.7 99
Table 5 Comparisons of AUC obtained for six datasets
(a) Neurodegenerative Brain 0.64 0.82 0.86 0.89 0.94

Brain Tumor 0.68 0.79 0.87 0.91 0.99
Glioblastoma 0.71 0.78 0.88 0.90 0.99
Lung Cancer 0.79 0.81 0.82 0.87 0.96
Leukemia 0.51 0.77 0.81 0.88 0.97
Gastric Cancer 0.76 0.79 0.83 0.92 0.98
Table 6 Performance of classifiers on selected gene subset of CFS-VCPSO selection
Datasets Support Decision Tree Naive-Bayes Random Forest

Vector
Machine (%) (%) (%)
(%)
(a) Neurodegenerative Brain Disordered

94.9 78.4 75.7 94.9
protein Dataset
Brain Tumor 98 95 95 95
Glioblastoma 98 94 94 96
Lung Cancer 99.1 97.1 97.1 91.2
Leukemia 98.3 87.5 98.3 94.4
Gastric Cancer 99 96.7 75.7 78.4
7
Fig.2.Neurodegenerative Brain Disorder protein Fig.5.Lung cancer dataset
Fig.3.Brain Tumor dataset Fig.6. Leukemia cancer dataset
Fig.4.Glioblastoma dataset Fig.7.Gastric cancer dataset
8
References
[1] Hsu, H.H., Hsieh, C.W., & Lu, M.D.(2011).Hybrid feature selection by combining filters and wrappers. Expert Systems
Applications, 38, 8144-8150.
[2] Golub, T.R., Slonim, D.K.., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., ... Caligiuri, M.A.(1999).Molecular
classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286,531–537.
[3] Wu MY, Dai DQ, Shi Y., Yan H,Zhang, X.F.(2012).Biomarker identification and cancer classification based on
microarray data using Laplace Naive Bayes model with mean shrinkage. IEEE/ACM Transactions on computational
Biology and Bioinformatics, 9, 1649–1662.
[4] Hoque, N., Bhattacharyya, D.K., Kalita, J.K.(2014).MIFS-ND: a mutual information-based feature selection method,
Expert Systems Applications, 41,.6371–6385.
[5] Ding, C., Peng, H. (2005).Minimum Redundancy Feature Selection from Microarray Gene Expression Data. Journal of
Bioinformatics and Computational Biology,3,185-205.
[6] Wald, R., Khoshgoftaar, T.M., Napolitano, A. (2014).Using Correlation-Based Feature Selection for a diverse
collection of Bioinformatics Datasets. In Proceedings of International Conference on Bioinformatics and
Bioengineering, Boca Raton, FL, USA.
[7] Tan, F., Fu, X., Zhang, Y., Bourgeois, A.G.(2008).A genetic algorithm-based method for feature subset selection. Soft
Computing, 12, 111–120.
[8] Zhang, M.L., Peña, J.M., Robles, V. (2009).Feature selection for multi-label naive Bayes classification. Information
Sciences, 179, 3218–3229.
[9] Shen, Q., Shi, W.M., Kong, W. (2008).Hybrid particle swarm optimization and tabu search approach for selecting
genes for tumor classification using gene expression data. Computational Biology and Chemistry, 32, 53–60.
[10] Kang, S., Song, J.(2017).Robust Gene selection methods using weighting schemes for microarray data analysis.BMC
Bioinformatics,18, 1-15.
[11] Zhou, W., Zhou, C., Zhu, H., Liu, G., Chang, X. (2006).Feature Selection for Microarray Data Analysis Using Mutual
Information and Rough Set Theory. In Proceedings of International Conference Intelligent Computing Computational
Intelligence and Bioinformatics, Berlin, Heidelberg.
[12] Ramani, R.G., Jacob, S.G. (2013).Benchmarking classification models for cancer prediction from gene expression
data: A novel approach and new findings. Studies in Informatics and Control, 22, 133–142.
[13] Hall, M.A. (1999).Correlation-based feature selection for machine learning (Doctoral dissertation).The University of
Waikato, Hamilton, Newzealand.
[14] Eberhart, R., Kennedy, J. (1995).A new optimizer using particle swarm theory. In Proceedings of International
Symposium on Micro Machine and Human Science, Nagoya, Japan.
[15] Shi, Y., Eberhart, R.C. (1998).In Proceedings of International Conference on Evolutionary Computation, Anchorage,
AK, USA, 69–73.
[16] Eberhart, R.C., Shi, Y. (2000).Comparing inertia weights and constriction factors in particle swarm optimization. In
Proceedings of International Conference on Evolutionary Computation, La Jolla, CA, USA.
[17] Zhan, Z.H., Zhang, J., Li, Y, Chung, H.S.H. (2009). IEEE Transactions on Systems, Man, and Cybernetics,39, (6),
1362–1381.
[18] Zhang, Y. (2015).A Comprehensive Survey on Particle Swarm Optimization Algorithm and Its Applications,
Mathematical Problems in Engineering. 931256, 1-38.
[19] Ramani, R.G.., Jacob, S.G.(2013).Improved classification of lung cancer tumors based on structural and
physicochemical properties of proteins using data mining models. PloS one 8(3): e58772.
[20] Hosseinzadeh, F., Ebrahimi, M., Goliaei, B., Shamabadi, N.(2012)Classification of Lung Cancer Tumors Based on
Structural and Physicochemical Properties of Proteins by Bioinformatics Models’, PLoS one,7(7): e40017.
[21] Jacob, S.G., Athilakshmi, R. (2016). Extraction of Protein Sequence features for Prediction of Neuro-degenerative
Brain Disorders: Pioneering the CGAP database. In Proceedings of the International Conference on Informatics and
Analytics, Pondicherry, India. Aug 2016.
[22] Mramor, M., Leban, G., Demsar, J., Zupan, B. (2007) Visualization-based Cancer Microarray Data Classification
Analysis, Bioinformatics, 23, 2147-2154.
[23] Kononenko I. (1994).Estimating attributes: Analysis and extensions of RELIEF, In Proceedings of European
Conference on Machine Learning, Springer, Berlin, Heidelberg.
[24] Lu, H., Chen, J., Yan, k., Jin, Q., Xue,, Y., Yu,X., Gao, Z. (2017).A hybrid feature selection algorithm for gene
expression data classification. Neurocomputing, 256, pp.56-62.
[25] Yang, C.S., Chuang, L.Y., Ke, C.H., Yang, C.H.(2008). A Hybrid Feature Selection Method for Microarray Classification,
IAENG International Journal of Computing,35, IJCS_35_3_05.

Fusion Feature Selection: New Insights Into Feature Subset Detection in Biological Data Mining

Uploaded by

Copyright:

Available Formats

Fusion Feature Selection: New Insights Into Feature Subset Detection in Biological Data Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fusion Feature Selection: New Insights Into Feature Subset Detection in Biological Data Mining

Uploaded by

Copyright:

Available Formats

Fusion Feature Selection: New Insights into feature

subset detection in biological data mining

R.Rajavel 1, R.Athilakshmi 1*, Shomona Gracia Jacob 2

characterized by innumerable gene features which in

In the pre-selection phase, we started with a data pre-

1: Read the gene attributes g1, g2, g3... gn into an

4. GENE EXPRESSION DATASETS AND Parameters Values

Datasets ReliefF MIM RWFS CFS-PSO CFS-VCPSO

(a) Neurodegenerative Brain 954 923 54 122 33

Table 4 Comparisons of mean classification accuracy obtained for six datasets

Datasets ReliefF MIM RWFS CFS-PSO CFS-VCPSO

(a) Neurodegenerative Brain 55.2 66.7 71.9 91.5 94.9

Table 5 Comparisons of AUC obtained for six datasets

Datasets ReliefF MIM RWFS CFS-PSO CFS-VCPSO

(a) Neurodegenerative Brain 0.64 0.82 0.86 0.89 0.94

Table 6 Performance of classifiers on selected gene subset of CFS-VCPSO selection

Datasets Support Decision Tree Naive-Bayes Random Forest

(a) Neurodegenerative Brain Disordered

Fig.3.Brain Tumor dataset Fig.6. Leukemia cancer dataset

Fig.4.Glioblastoma dataset Fig.7.Gastric cancer dataset

You might also like