Fusion Feature Selection: New Insights Into Feature Subset Detection in Biological Data Mining
Fusion Feature Selection: New Insights Into Feature Subset Detection in Biological Data Mining
Fusion Feature Selection: New Insights Into Feature Subset Detection in Biological Data Mining
ABSTRACT: In Bioinformatics, the constantly increasing gene expression samples and the associated dimensions
in the DNA microarray research data create a very large vacuity that necessitates the development of a more
efficient and improved classification algorithm for selecting optimal features in gene expression data. This study
presents a new fusion feature selection algorithm that combines the Correlation Feature Selection (CFS) and the
Velocity Clamping Particle Swarm Optimization (VCPSO) algorithm. This hybrid model capitalizes on both the
filters and the wrappers and selects the optimal feature subsets to classify genes using different classifiers. The
proposed hybrid mechanism was evaluated on two datasets, viz, neurodegenerative brain disorder protein data and
microarray cancer data. Experimental results reveal that the proposed CFS-VCPSO-SVM Selection method
eliminates the redundant features significantly and lowers the dimension of gene expression data for classification.
From the study, it was observed that the proposed CFS-VCPSO-SVM approach also achieved high classification
accuracy in categorizing the different classes of disorders and cancer.
1
to construct a classification model. Here we first concluded that their methods were effective in
review early research concerning the filters followed predicting gene features of benchmark cancer datasets.
by the wrapper techniques. Hoque et al. (2014) The above-specified algorithms have made use of
introduced a filter based feature selection framework some filters and/or wrappers for feature selection. For
to find the ideal subset of features. Their method microarray gene expression data classification, several
combines both gene-gene mutual information and approaches were done with classical filter-wrapper
gene-class mutual information and shows significant approaches. However, the filter could not guarantee
results in classifying high-dimensional data. Ding and the learning results of the classifier because it ignores
Peng(2005) proposed a minimum redundancy the peculiar heuristics and preferences of the classifier
maximum relevance (MRMR) feature selection that might lower the classification accuracy. On the
framework which considers maximum relevance other side, the wrappers detect high feature
towards class and minimum redundancy among genes dependencies and become computationally inefficient
in classifying microarray gene data. The MRMR when the search space grows.
method selects features with improved classification In this study, a novel fusion feature selection strategy
accuracy and better generalization property. Another is proposed which integrates the boon of CFS
work on filters was reported by Wald et al. (2014). (Correlation Feature Selection) algorithm with the
They applied the CFS technique to a diverse collection VCPSO (Velocity Clamping Particle Swarm
of bioinformatics datasets and build the effective Optimization) algorithm to select the non-redundant
classification model that reduces the problem of class and relevant genes of the gene expression data and
inconsistency, high-dimensionality, and information categorize the data more accurately. The mechanism
redundancy. Tan et al. (2008) proposed a hybrid capitalizes on the effectiveness of the filters and the
framework for feature selection that binds a genetic learning accuracy of the wrappers. Moreover, the
algorithm (GA) with different existing feature combination of the efficient filter with a different
selection algorithms for selecting an optimal number search strategy is required since it accelerates the
of genes from DNA microarray gene expression data. learning process of classifiers and stabilizes the
They concluded that hybrid approaches are more classification accuracy. CFS measures the correlation
effective in finding optimal genes with better between feature subsets and finds feature subsets that
classification accuracy. Zhang et al. (2009) proposed are highly relevant to the class and least relevant with
an effective feature selection scheme by employing one other while the VCPSO solves the early
mutual information maximization (MIM) method in convergence issue of PSO by clamping the velocity
improving the classification accuracy of multi-label within a search space. The main contribution of this
problems. Shen et al. (2008) integrated tabu search and proposed fusion framework is highlighted as follows:
PSO in a hybrid feature selection algorithm to select an Novelty: The proposed CFS-VCPSO is a
optimal number of genes on microarray datasets but computationally efficient method used to solve the
the results were not satisfactory. Kagg et al. (2017) early convergence issues of PSO by limiting the
introduced a new filter based on robust weighting velocity within a search space. A Velocity clamping
schemes for microarray gene data selection. They method is utilized which controls the searching ability
utilized the SAM (Significance Analysis of of PSO in such a way that significantly improves the
Microarrays) technique in which the genes with the performance of the algorithm.
statistically significant difference in expression were Effectiveness: The proposed hybrid algorithm
identified. Different variants of SAM were formed capitalizes on the advantages of the CFS and VCPSO.
using modified t-tests and named as M-SAM1 and M- The optimal gene set selected by our algorithm
SAM2. Their results demonstrated that modified improves classification accuracy compared with
methods identified target genes better than the original existing feature selection approaches.
method, SAM. Zhou et al. (2006) developed a Mutual Robustness: To analyze the performance of
Information Rough Set (MIRS) model along with our proposed fusion approach, four different classifiers
BPSO for classification of cancer genes in microarray were executed on the selected gene expression data
gene expression data. They concluded that their model with the selected feature sets.
is superior in classifying microarray data when
compared to classical feature selection methods. 3. FUSION FEATURE SELECTION
Ramani et al. (2013) proposed a gene selection ALGORITHM CFS-VCPSO SELECTION
algorithm for microarray cancer datasets in which
genes are identified based on Rank-Weight Feature- 3.1. Correlated Feature Selection
selection (RWFS) approach. Here, they assigned In order to decrease the search space of the high
weights to attributes selected by different feature dimensionality datasets, we used correlation-based
selection methods followed by ranking. They Feature Selection (CFS) algorithm which filters
feasible feature subset from a given sample space by
including only those features which are mutually
2
uncorrelated but have greater predictive ability toward ìï Vij (t + 1), Vij (t + 1) < Vmax, j
a class (Hall,1999). If the features are mutually Vijnew (t + 1) = ïí (3)
ïï Vmax, j , Otherwise
uncorrelated then redundancy is eliminated whereas î
the greater relevance of the features with class ensures
better prognostic ability. If the particle’s minimum velocity is smaller than Vmin,
then it is clamped to the velocity by the following
CFS criterion is defined as follows: equation.
é ù
ê rcf1 + rcf 2 + ... + rcf n ú ìï Vij (t + 1), Vij (t + 1) > Vmin, j
CFS= max HMs ê ú (1)
ê k + 2(rf f + ... + rf f + ... + rf f ú Vijnew (t + 1) = ïí (4)
êë 1 2 i j k l ú
û ïï Vmin, j , Otherwise
î
To measure the worth of a feature subset S consisting The initializations of maximum and minimum
of n features, the algorithm uses following merit velocities are given by the equation (5) and (6).
criteria:
Vmax, j = l (xmax, j - xmin, j ) (5)
The heuristic merit of a subset,
nr g,c Vmin, j = l (x min, j - x max, j ) (6)
HMs= (2)
n + (n - 1)r g,g
where xmax,j, and xmin,j are the maximum and minimum
position of particles in jth dimension obtained from
where r g,c is the average value of gene-class initial test runs of a particle and 𝜆 is a constant factor
correlation. r g,g is the average value of gene-gene falls between [0,1]. The complete framework of the
correlation. proposed model CFS-VCPSO is shown in Fig. 1 and
The high value of HMs indicates better the correlations explained in detail in section 3.3.
between the gene features and the class and lowers the
redundancy among the genes in the subset. 3.3. CFS-VCPSO-Selection
3
each particle is associated with two vectors, i.e., the position ‘x’ of the particle constitutes the first
velocity vector (Vt) and position vector (Xt) and it is component whereas the difference of best fitness value
defined as ‘gbest’ and the current position ‘x’ of the particle
constitutes the second component along with learning
t
Vm t
= (Vm1 t
, Vm2 t
,.., Vmn ) (9) constants c1, c2 and random numbers r1,r2. Using this
computed velocity, the forthcoming position of the
t t t t particle is updated using the following equation.
Xm = (Xm1 , Xm2 ,.., Xmn ) (10)
X t+ 1 = X t + V t+ 1 (12)
The position of the particle with the present fitness
value is denoted by ‘pbest’ and the position of the
The movements of each particle are guided by their
particle with the best fitness value of the swarm is
own known position (pbest) within the search-space in
denoted by ‘gbest’. The following formula is used to
addition to swarm's best-known position (gbest).In each
compute the velocity. iteration, every particle updates its current position and
velocity based on the pbest and gbest value of the
V t + 1 = wt V t + r1c1 (pbest - x) + r2c2 (g best - x) (11) particle. Once improved positions are being discovered
these can then return to guide the movements of the
swarm. This amalgam of techniques comprising of
Dataset
correlation calculation and swarm optimization is
proposed as the fusion methodology. The key aspect of
the model is that it integrates the benefits of fast and
Pre-process Dataset
efficient dimensionality reduction of the multivariate
(Standardization)
filter (CFS) and simple yet intelligent Particle Swarm
Optimization approach. The feature subsets returned
by VCPSO were used to train the SVM classifier
CFS Filter for Gene model. The performance of the model was evaluated
subset selection by 10-fold cross-validation by giving test data as input.
Finally, feature subsets with high classification
accuracy and high AUC value were taken as the
optimal genes returned by the model.
VCPSO based Gene
Subset Optimization
Algorithm 1: CFS-VCPSO feature selection:
7
Fig.2.Neurodegenerative Brain Disorder protein Fig.5.Lung cancer dataset
8
References
[1] Hsu, H.H., Hsieh, C.W., & Lu, M.D.(2011).Hybrid feature selection by combining filters and wrappers. Expert Systems
Applications, 38, 8144-8150.
[2] Golub, T.R., Slonim, D.K.., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., ... Caligiuri, M.A.(1999).Molecular
classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286,531–537.
[3] Wu MY, Dai DQ, Shi Y., Yan H,Zhang, X.F.(2012).Biomarker identification and cancer classification based on
microarray data using Laplace Naive Bayes model with mean shrinkage. IEEE/ACM Transactions on computational
Biology and Bioinformatics, 9, 1649–1662.
[4] Hoque, N., Bhattacharyya, D.K., Kalita, J.K.(2014).MIFS-ND: a mutual information-based feature selection method,
Expert Systems Applications, 41,.6371–6385.
[5] Ding, C., Peng, H. (2005).Minimum Redundancy Feature Selection from Microarray Gene Expression Data. Journal of
Bioinformatics and Computational Biology,3,185-205.
[6] Wald, R., Khoshgoftaar, T.M., Napolitano, A. (2014).Using Correlation-Based Feature Selection for a diverse
collection of Bioinformatics Datasets. In Proceedings of International Conference on Bioinformatics and
Bioengineering, Boca Raton, FL, USA.
[7] Tan, F., Fu, X., Zhang, Y., Bourgeois, A.G.(2008).A genetic algorithm-based method for feature subset selection. Soft
Computing, 12, 111–120.
[8] Zhang, M.L., Peña, J.M., Robles, V. (2009).Feature selection for multi-label naive Bayes classification. Information
Sciences, 179, 3218–3229.
[9] Shen, Q., Shi, W.M., Kong, W. (2008).Hybrid particle swarm optimization and tabu search approach for selecting
genes for tumor classification using gene expression data. Computational Biology and Chemistry, 32, 53–60.
[10] Kang, S., Song, J.(2017).Robust Gene selection methods using weighting schemes for microarray data analysis.BMC
Bioinformatics,18, 1-15.
[11] Zhou, W., Zhou, C., Zhu, H., Liu, G., Chang, X. (2006).Feature Selection for Microarray Data Analysis Using Mutual
Information and Rough Set Theory. In Proceedings of International Conference Intelligent Computing Computational
Intelligence and Bioinformatics, Berlin, Heidelberg.
[12] Ramani, R.G., Jacob, S.G. (2013).Benchmarking classification models for cancer prediction from gene expression
data: A novel approach and new findings. Studies in Informatics and Control, 22, 133–142.
[13] Hall, M.A. (1999).Correlation-based feature selection for machine learning (Doctoral dissertation).The University of
Waikato, Hamilton, Newzealand.
[14] Eberhart, R., Kennedy, J. (1995).A new optimizer using particle swarm theory. In Proceedings of International
Symposium on Micro Machine and Human Science, Nagoya, Japan.
[15] Shi, Y., Eberhart, R.C. (1998).In Proceedings of International Conference on Evolutionary Computation, Anchorage,
AK, USA, 69–73.
[16] Eberhart, R.C., Shi, Y. (2000).Comparing inertia weights and constriction factors in particle swarm optimization. In
Proceedings of International Conference on Evolutionary Computation, La Jolla, CA, USA.
[17] Zhan, Z.H., Zhang, J., Li, Y, Chung, H.S.H. (2009). IEEE Transactions on Systems, Man, and Cybernetics,39, (6),
1362–1381.
[18] Zhang, Y. (2015).A Comprehensive Survey on Particle Swarm Optimization Algorithm and Its Applications,
Mathematical Problems in Engineering. 931256, 1-38.
[19] Ramani, R.G.., Jacob, S.G.(2013).Improved classification of lung cancer tumors based on structural and
physicochemical properties of proteins using data mining models. PloS one 8(3): e58772.
[20] Hosseinzadeh, F., Ebrahimi, M., Goliaei, B., Shamabadi, N.(2012)Classification of Lung Cancer Tumors Based on
Structural and Physicochemical Properties of Proteins by Bioinformatics Models’, PLoS one,7(7): e40017.
[21] Jacob, S.G., Athilakshmi, R. (2016). Extraction of Protein Sequence features for Prediction of Neuro-degenerative
Brain Disorders: Pioneering the CGAP database. In Proceedings of the International Conference on Informatics and
Analytics, Pondicherry, India. Aug 2016.
[22] Mramor, M., Leban, G., Demsar, J., Zupan, B. (2007) Visualization-based Cancer Microarray Data Classification
Analysis, Bioinformatics, 23, 2147-2154.
[23] Kononenko I. (1994).Estimating attributes: Analysis and extensions of RELIEF, In Proceedings of European
Conference on Machine Learning, Springer, Berlin, Heidelberg.
[24] Lu, H., Chen, J., Yan, k., Jin, Q., Xue,, Y., Yu,X., Gao, Z. (2017).A hybrid feature selection algorithm for gene
expression data classification. Neurocomputing, 256, pp.56-62.
[25] Yang, C.S., Chuang, L.Y., Ke, C.H., Yang, C.H.(2008). A Hybrid Feature Selection Method for Microarray Classification,
IAENG International Journal of Computing,35, IJCS_35_3_05.