- Research
- Open access
- Published:
Optimized hybrid investigative based dimensionality reduction methods for malaria vector using KNN classifier
Journal of Big Data volume 8, Article number: 29 (2021)
Abstract
RNA-Seq data are utilized for biological applications and decision making for the classification of genes. A lot of works in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in the transformation of these data. In this study, a novel optimized hybrid investigative approach is proposed. It combines an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses KNN on the reduced mosquito Anopheles gambiae dataset, to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based on the high-dimensional input feature space. A fast algorithm for feature ranking is used to select relevant features. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for selecting relevant genes and classifying pertinent gene expression data analysis by indicating that the approach is capable of adding to prevailing machine learning methods.
Introduction
A significant problem in the field of gene expression analysis is the collection of genes from high-throughput biological data [1]. The gene expression data are known for having small samples with large irrelevant and redundant noisy genes [2]. Gene expression data analysis comprises of small and large samples with irrelevant and redundant gene sequences [2]. These gene sequences depreciate classification learning model performances [1]. Dimensionality reduction techniques have been used severally [3]. It has been used to fetch relevant discriminative subsets from the gene expression data; it also assists in saving computational burdens and improving classification prediction accuracy [3].
In gene expression data analysis, overfitting and curse of dimensionality have been known to deteriorate the classification capabilities [3]. It comprises of high dimensional input space called the curse of dimensionality. Overcoming the curse of dimensionality challenges, several dimensionality reduction techniques have been exploited in literature [4]. Determining the optimal subset genes helpful for revealing hidden features of genes and enhance their interpretability is crucial [2]. The dimensionality reduction aim is to discover the trivial subset of genes that can help improve prediction performances, which will be helpful to clinicians in decision making and treatments [4].
Several authors have addressed the problems of the curse of dimensionality. Metaheuristics have also been proposed, yet approaches suffer from correlations, high throughputs, and increase in computational time for fetching gene subsets [5, 6]. A systematic approach to fetching an optimal subset gene is a crucial issue.
Feature selection (filter, wrapper and embedded) [7,8,9] and feature extraction [10,11,12] (supervised and unsupervised) are dimensionality reduction approaches that have been established, these approaches have overcome several problems such as performance enhancement, yet there is need for improvements hybrid model and optimization for getting better results [13]. Finding an optimal subset of genes proficient at handling high dimension optimization difficulties with reasonable solutions is required [5].
Genetic algorithm (GA), is a feature selection technique; it is a wrapper-based which is represented by an optimization technique. GA is said to be adaptive heuristic search approach that finds an optimal subset of features in complex problems such as high dimensionality [14]. An associated problem with high dimensional data is overfitting “comprising of more model parameters” and curse of dimensionality “increase in basic error” [15]. To predict classification, a discriminatory genes subset is essential to avoid overfitting and to fit “correlate” irreplicable training sets. This helps in achieving predictive accuracy [13].
GA is proficient for finding optimal subsets on high dimensional data and have been used extensively, yet they are computationally expensive and prone to overfitting [14]. Overcoming this limitation, optimization strategies have been used to ensure better performances for finding optimum feature subsets and classification accuracy [14].
Principal component analysis (PCA) (non-linear) and Independent component analysis (ICA) (linear) are appropriate feature extraction methods that have been extensively used [10, 15] are standard capable methods for fetching subset of gene samples for classification and have received growing attention in recent time [16]. The hybrid approach has proven to be significant, due to their excellent performances and advantages for solving dimensionality problems that halt classification, it is of the essence to come up with efficient models that are computationally fast and easy to implement for classification of gene expression data analysis [17].
Several experiments have been carried out in literature [3, 4, 7,8,9, 18,19,20,21,22]. However, these experiments necessitate enhancements that can help in making decisions on how to eradicate the transmission of malaria in West Africa, as it is a scourge in Africa [23].
This study proposes a hybrid dimensionality reduction model for the classification of malaria vector data. Based on the approaches, an optimized genetic algorithm (GA-O) is used to fetch out subset relevant genes. The PCA and ICA are used on the subset data, to fetch latent components in the data. Combining GA-O with PCA and GA-O with ICA, are classified using KNN on a Mosquito Anopheles gambiae dataset. This study proposes to improve the classification complexities such as the computational cost, fetching relevant subset genes and relationship among genes that can be used by clinicians for decision making.
The rest of this paper discusses the; Literature review, the methods and materials, result discussions and conclusions.
Literature review
A lot of dimensionality reduction and classification approaches have been explored in literature, they are based of several measures, such as computational complexity, accuracy, among others [24]. Some of the few works done have been studies in the literature and have indicated emerging research fields such as optimization and hybridization investigations [25].
A dimensionality reduction was proposed with class prediction approach for gene expression data, by suggesting an innovative procedure using feature extraction and feature selection, for gaining correlation of the reduced data and eliminating redundancy respectively. Their approach was tested and compared with the state of the art [26].
A distributed feature selection approach was proposed for classifying gene expression data, by trying to detect possible infected genes in a dispersed way that can help classify samples effectively. The huge data considered subdivided and distributed features among processors, a filter-based approach with fuzzy inference system were applied on the subset data. the result features were ranked and showed a better performance [27].
An optimizing feature selection and classifiers was proposed for gene expression data, by carrying out a hybrid model using mutual information with genetic algorithm on a high dimensional data, the reduced data is passed into partial least square using t-score mechanism. The reduced data is classified to attain an improved classification accuracy of about 93 %, yet calls for enhancement in terms of maximizing the accuracies and minimizing further, the features of genes [25].
A hybridized neuro-fuzzy with feature reduction approach was proposed for classification of data analysis, for dealing with uncertainty issues. Their result showed a considerable improvement in terms of accuracy and elimination of redundancy of information, and proposed solving real life gene expression classification problems [28].
A computational approach for integrating single cell data analysis was proposed, by describing the focus on joint analytical multimodal signals from respective cells, they proposed that years to come, studies of integrative multimodal single-cell data will pose a significant method and application and will be used extensively for characterizing innovative mechanisms [29].
A machine learning technique for malaria outbreak detection was proposed, by utilizing and comparing several techniques such as decision tree, gaussian and logistic regression methods. They found a binary classification question and outbreak results or no outbreak outcomes from the test data samples obtained from Indian Maharashtra. The results of comparable experiments are contrasted with the performance of the models. They were able to detect the samples based on the sample data used. Malaria outbreak in the testing dataset without any false positive or false negative errors [30].
Materials and methods
Datasets
RNA-Seq for gene expression data analysis uses the mosquito Anopheles gambiae (Ag.) larvae, from Kenya western region. It comprises of deltamethrin susceptible and resistant mosquitoes’ profile with considerate resistance devices; with 7 attributes relating to the Tests, Genes, Genes identities, Locus, Susceptible, Resistant, Status and a predictor from 2457 instances [31] (Table 1).
Methodology
RNA-Seq gene expression data is a widely utilized technology for diagnostic of several diseases, such as cancer, malaria, among others. It recognizes several aspects of transcriptomes, which is a principal existing technology for high-throughput genetic factors. Providing enhanced insight for transcriptome cells, alternative therapies and improved determinations [32], it identifies early secret variations occurring in conditions of disease by reacting to different environments and other training therapeutics, producing sufficient quantities of sequencing data [7, 33]. Gene expression Classification of RNA-Seq data has provided valuable evidence to classify and assess German medications for diseases [34]. The expression of genes is genomic factors in the predominant method of RNA-Seq quantifying and gaining a better understanding of various biological tissues. The problem of diagnostic challenges is a significant challenge for RNA-Seq, and owing to the high dimensional gene data expression, it gives unfitting results.
In this study, the dataset uses a mosquito Anopheles gambiae. The samples of the genes are normalized using the MATLAB 2015 tool. The samples are passed into the optimized genetic algorithm. A reduced sample is then achieved and passed into the PCA and ICA separately. The further reduced data are split into training and testing sets. Classification is conducted using KNN.
Dimension reduction
A recognized technique for eliminating unwanted noise and unnecessary features is dimensionality reduction. Gene expression data comprises of high dimensional features that amount to computational weightiness, depriving the performance of classification models. Eliminating redundancy and obtaining irrelevant features that interrupt efficiency with activity by reducing the samples of feature ratios, dimensionality reduction procedures are essential. This method helps in reducing risks of overfitting. Reducing the dimensionality is a crucial method known as the collection of features and extraction of features [35, 36].
Feature selection
Technologies such as RNA-seq transcriptomes, constructing relevant particular feature identifiers for sequences transcript is essential, to train and test models. Feature selection is essential to create a better classification performance. Selection of features allows choosing of suitable elements for in classification model performances by removing irrelevant and redundant features which minimize the curse of dimensionality. It helps to make the classification phase learning procedure successful and increases the success model. For example, extensive information feature selection process; RNA-Seq data involves supervised and unattended decision-making learning. For classification problems, rank characteristics conferring significance are essential, and selecting the best will advance the prediction model’s performance. The collection of feature selection is an efficient technique identified as a filter, wrapper and embedded types [37].
Genetic algorithm (GA)
Genetic algorithm is a wrapper based evolutionary algorithm for selecting relevant features, used in investigating engine optimization problems. In the survival of the fittest base, GA is based on actual activities linked to human genetic factors. GA is made up of initial population development, fitness assessment, parent selection, crossover and mutation [38, 39].
GA is an investigative discovery method, in a simple procedure, with a sample of randomly generated outcomes (phenotypes or entities) offering an acceptable value for the primary purpose of computing the beneficial results. Respective chromosomes or genotypes typically comprises of sets of properties categorized as binary strings of 0’s and 1’s [40]. While very sensitive to the initial population, GA has a weakness of optimality. Its result quality declines as problem dimensions rise, it has been shown to produce reasonable quality solutions to boost it for gene sampling.
Feature extraction
Extraction of features is a technique used in the identification of essential features, characteristics or features existing in data. Feature extraction technique examples are the identification of patterns and the detection of public instances in a set of identifications. Data with dimensional loads include the use of feature extraction, for producing a more precise explanation of characterizations. Feature extraction allows revolutionary selected feature variables to decrease the presence of the curse of dimensionality. There are two broad collections of feature extraction procedures, explicitly: non-linear (assumes data on low-dimensional subspace, such as PCA) and linear (assumes a low-dimensional subspace, characterized with a high-dimensional feature, such as ICA) for a non-linear relation between features [18].
Principal component analysis (PCA)
PCA is a method of non-linear feature extraction; it is commonly used primarily in genetic studies. Through reformatting the k-dimensional discrete features from exclusive n-dimension feature field, PCA projects feature spaces from high to lower dimensions. PCA has acknowledged that it is an essential method for the exploration of high-dimensional knowledge on gene expression. It is widely used for RNA-seq data. By transforming a set of correlated variables into a set of uncorrelated variables, investigating orthogonal alteration. PCA for the study of experimental results. PCA may be used to analyze the relationships between a set of variables and to minimize dimensionality [41].
Independent component analysis (ICA)
Disintegrating multivariate signs into independent non-gaussian for statistically independent components, ICA supports finding hidden features from multidimensional details. By decorrelating the data, ICA seeks a connection between information by manipulating or lessening the relevant data. As a linear combination of the independent components I, ICA adopts Opinion X. If B means columns of B define the separate weighted matrix R, the basis feature vectors of observation X.
For biological information, recognition and other reasons, ICA have been used extensively [42].
PCA is a non-linear alteration technique, used to minimize the dimension and number of features. It is a “non-linear” algorithm, while ICA is “linear,“ if a data is preprocessed, ICA has been shown to perform better [18].
Classification
In data mining techniques, classification is a supervised learning method. It is a common, supportive task that gives and predicts class labels specified from the predefined class label to current data. The building of classification is comprised of two steps [43]:
-
The learning process, in which the classification model was developed with a class label giving a collection of training data.
-
The model predicts the class labels for concealed data and to calculates the accuracy of the KNN classifier.
K-Nearest Neighbor (KNN)
A supervised learning Kth nearest neighbour classification technique for gene datasets performs the benefit of creative application event assessment of neighbourhood classification. The KNN algorithm classifies creative entities based on examples, characteristics and training models. KNN classifiers do not train models to suit but are retention-based. The selected features are assumed to be inputs for segments. The K value of the closest neighbours is selected nearest to the spot of the question. Based on the minimum determined distance of Kth, detachment between query-instance and training models is taken into account and sorted. Group Y is taken from the closest neighbours. The unassuming prevalence of groups of nearest neighbours is used as the approximate number of instances of question. Bonds can fragment randomly [44].
Increasing the dimensionality of biological data is a significant problem for simple, predictable research methods. It is important to use traditional approaches for learning complex strategies on several layers moved by morphological processes interested in processing. Several complexities are involved in most typical procedures used to deal with high-dimensional data, such as the RNA-Seq data. The combination of different methods for reducing dimensionality will, in essence, take advantage of unique advantages where subset genes obtained from a procedure is supported as an input to the other. In general, feature extraction techniques support feature selection proficiently, by using feature selection to pick the original subset of genes, or by taking advantage of redundant gene elimination. Extracting primary subset features, combining various feature extraction methods can be useful [31, 39, 45, 46].
An effective dimension reduction method to classify malaria vector data was suggested in this report.
RNA-Seq has tremendous potential for finding, defining and tracing cell lines. Still, the reduction of dimensionality helps to perceive the structures. Still, data remains challenging, and current algorithms need the correct development to reveal suitable characteristics, fusion approach proves to be healthy but necessitates effective procedures to model.
The classification technique proposed consists of three Phases, namely:
-
Selection of features.
-
Extraction of features.
-
Category of category.
Figure 1 illustrates the projected hybrid system for classifying malaria gene expression dataset. The framework consists of three subsystems, a subsystem for feature collection, a class-based subsystem for feature extraction, and a subsystem for classification.
By adopting one algorithm below to pick an optimum subset by assessing the chromosome fitness, the function selection sub-system uses an optimized GA. The function extraction subsystem uses PCA and ICA because of its data projection of efficiency invariance along with impertinent orders. The standard of the researches is categorized using KNN.
Significances of genetic algorithm optimization are its evolutionary dispensation of the algorithm’s features; it helps numerous search point which simultaneously and independently explores the optimal result to produce a good result. In this study, an optimization of the collection of genetic algorithm features to minimize numbers of features and maintain discriminant features. The extraction of features is ideal; it transforms reduced data to latent components; the productivity is to lessen prosperity and suffer from both methods of reduction of dimensionality used for classification of malaria.
Algorithm 1:
Experiments in this study are performed using Intel Core 5 with a 16 GB RAM, and 64-bit Operating system. All algorithms were coded in C + + on MATLAB 2015 environment platform.
The confusion matrices were used as the classification evaluation to certify comparable training and testing performances of the experiments in terms of accuracy, sensitivity, among other metrics [31] (Fig. 2).
Results and discussion
This study proposes a malaria vector dataset classification, using a public dataset, with 2457 samples and 7 features [31], on a MATLAB tool. The dataset was investigated using an optimized genetic algorithm to pick pertinent features in the data, using 0.5 thresholds, 708 significant subset features were selected. Classifier ability associated with the state-of-the-art was used for required evaluations.
The selected 708 features by the Optimized Genetic algorithm is first conceded into PCA algorithm with an extracted output of 10 latent variables in 1.4623 seconds. The results of the extracted features are classified using the KNN classification algorithm with 10-fold cross-validation “technique required in evaluating predictive models, it partitions the given sample into training and testing sets for evaluation”. The KNN Confusion matrix was then evaluated using the performance metrics analysis.
The 708 selected features furthermore were conceded into the ICA algorithm and extracted 25 latent variables in 0.42794 seconds. The latent features were classified on KNN with 10-folds cross-validation, and the confusion matrix is evaluated.
GAO + PCA + KNN and GAO + ICA + KNN algorithms are carried out on the malaria vector dataset, and the performance evaluations of the experiments are tabulated below.
This study shows numerous significant suggestions for analyzing data gene expressions. The potential application of this experiment is to give relevant understanding into genetic and technical deliberations that can clarify revealed structures and elucidations for genes appropriate for predictions, analysis, detections of malaria infections, transmissions and drug designs (Fig. 3).
The GAO with PCA with K-NN results
The GAO with ICA with K-NN results
As stated in Table 2, this study attained reliable performances with useful algorithms comparatively.
This study proposed a hybrid dimension reduction approach using an optimized Genetic algorithm. PCA and ICA algorithms were used as on the selected features. KNN algorithm, using 10-fold cross-validation parameter, was used to classify the experiment. The result showed an enhanced result, as revealed in Table 2. Compared to the state-of-the-art, the accuracies presented an improvement (Fig. 4).
Providing a dependable discovery and prediction method for malaria infection and transmission, numerous investigators have studied underlying classification problems using machine learning methods. Results achieved can be proposed to train prevalent malaria infections by clinicians, through the use of this procedure to compile curated diagnostic dataset to train classifiers and increase approaches for datasets to increase the dataset size significantly, concerning the overfitting difficulties related the training of datasets. The study of illustrating thousands of genes suggests unfathomable understanding into malaria classification complications with ample of data discoveries, for drug finding, prediction and diagnosis of malaria treatments as well as understanding roles of genes with the communication between the genes in daily and irregular situations. This study grew the classification performance results and demonstrated a less dependence training set (Table 3).
Conclusions
Data analysis of RNA-Seq offers valuable and significant benefits to the technology ‘s success, with tremendous helps to evolve the problems of gene expression profiling. RNA-Seq’s related applications include the reduction of dimensionality and classification approaches. Due to the curse of dimensionality bound in the data of gene expression, it is a critical problem. Several strategies have been proposed to develop the technology, predict and detect diseases extracted from samples, and the reduction of dimensionality has proved to overcome these challenges. Yet, there is a need to undertake further inquiries. Recently, hybrid methods have also been used to classify gene expression results. GA + ICA + KNN outperformed the GA + PCA + KNN based method by performing a dimensionality reduction method using GA with ICA and GA with PCA algorithms discretely and evaluating their performance on KNN classification kernels.
The purpose of this study is to present a method to reduce number of variables, while keeping informative ones for enhanced classification, which can be used by clinicians for decision making. This study proposed enhanced dimensionality and classification approach in stages using malaria gene expression data. relevant features retrieved to obtain a better performance measure.
Future work proposes to utilized hybrid dimensionality reduction procedures on other classifiers such as the deep learning, to identify the relevant classification of the gene expression data.
Availability of data and materials
The datasets for this study are available on request to the corresponding author.
Abbreviations
- RNA-Seq:
-
Ribonucleic acid sequencing
- GA:
-
Genetic algorithm
- GAO:
-
Optimized genetic algorithm
- PCA:
-
Principal component analysis
- ICA:
-
Independent component analysis
- KNN:
-
Kth Nearest Neighbor
- NN:
-
Neural network
- DNA:
-
Deoxyribonucleic acid
- MATLAB:
-
Mathematical laboratory
- ID:
-
Identity
- CCA:
-
Canonical component analysis
- GLM:
-
Generalized linear model
References
Al Haggar M. Bioinformatics in high throughput sequencing: application in evolving genetic diseases. J Data Mining Genomics Proteom. 2013. https://doi.org/10.4172/2153-0602.1000131.
Liu S, Xu C, Zhang Y, Liu J, Yu B, Liu X, Dehmer M. Feature selection of gene expression data for cancer classification using double RBF-kernels. BMC Bioinform. 2018;19:1. https://doi.org/10.1186/s12859-018-2400-2.
Pashaei E, Pashaei E, Aydin N. Gene selection using hybrid binary black hole algorithm and modified binary particle swarm optimization. Genomics. 2019;111(4):669–86.
Shukla AK, Singh P, Vardhan M. A new hybrid wrapper TLBO and SA with SVM approach for gene expression data. Inf Sci. 2019;503:238–54.
Cai J, Luo J, Wang S, Yang S. Feature selection in machine learning: a new perspective. Neurocomputing. 2018;300:70–9. https://doi.org/10.1016/j.neucom.2017.11.077.
Mafarja M, Mirjalili S. Whale optimization approaches for wrapper feature selection. Appl Soft Comput. 2018;62:441–53. doi:https://doi.org/10.1016/j.asoc.2017.11.006.
Tadist K, Najah S, Nikolov NS, Mrabti F, Zahi A. Feature selection methods and genomic big data: a systematic review. J Big Data. 2019;6:1. https://doi.org/10.1186/s40537-019-0241-0.
Liu Y, Ju S, Wang J, Su C. A new feature selection method for text classification based on independent feature space search. Math Prob Eng. 2020. https://doi.org/10.1155/2020/6076272.
Chen CW, Tsai YH, Chang FR, Lin WC. Ensemble feature selection in medical datasets: Combining filter, wrapper, and embedded feature selection results. Exp Syst. 2020;37:5. https://doi.org/10.1111/exsy.12553.
Aziz R, Verma CK, Srivastava N. Dimension reduction methods for microarray data: a review. AIMS Bioeng. 2017;4(1):179–97.
Wenric S, Shemirani R. Using supervised learning methods for gene selection in RNA-Seq case-control studies. Front Genet. 2018. https://doi.org/10.3389/fgene.2018.00297.
Bajaj V, Taran S, Khare SK, Sengur A. Feature extraction method for classification of alertness and drowsiness states EEG signals. Appl Acoustics. 2020;163:107224. https://doi.org/10.1016/j.apacoust.2020.107224.
Li M, Wang H, Yang L, Liang Y, Shang Z, Wan H. Fast hybrid dimensionality reduction method for classification based on feature selection and grouped feature extraction. Expert Syst Appl. 2020;150:113277. https://doi.org/10.1016/j.eswa.2020.113277.
Chiesa M, Maioli G, Colombo GJ, Piacentini L. GARS: Genetic Algorithm for the identification of a Robust Subset of features in high-dimensional datasets. BMC Bioinformatics. 2020;21:1. https://doi.org/10.1186/s12859-020-3400-6.
Kong W, Vanderburg CR, Gunshin H, Rogers JT, Huang X. A review of independent component analysis application to microarray gene expression data. Biotechniques Future Science. 2018;45(5):501–20. https://doi.org/10.2144/000112950.
Mohan A, Rao MD, Sunderrajan S, Pennathur G. Automatic classification of protein structures using physicochemical parameters. Interdisciplinary Sciences: Computational Life Sciences. 2014;6(3):176–86. https://doi.org/10.1007/s12539-013-0199-0.
Chuang L, Chu Y, Li JC, Yang C. A hybrid BPSO-CGA approach for gene selection and classification of microarray data. J Comput Biol. 2012;19(1):68–82. https://doi.org/10.1089/cmb.2010.0064.
Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinform. 2015. 1–13. https://doi.org/10.1155/2015/198363.
Wang J, Du P, Niu T, Yang W. A novel hybrid system based on a new proposed algorithm multi-objective whale optimization algorithm for wind speed forecasting. Appl Energy. 2017;208:344–60.
Arowolo MO, Abdulsalam SO, Isisaka RM, Gbolagade KA. A hybrid dimensionality reduction model for classification of microarray dataset. Int J Inform Technol Computer Sci. 2017;9(11):57–63.
Pragadeesh C, Jeyaraj R, Siranjeevi K, Abishek R, Jeyakumar G. Hybrid feature selection using micro genetic algorithm on microarray gene expression data. J Intell Fuzzy Syst. 2019;36(3):2241–6. https://doi.org/10.3233/jifs-169935.
Lin Z, Zhang G. Genetic algorithm-based parameter optimization for EO-1 Hyperion remote sensing image classification. Eur J Remote Sens. 2019;53(1):124–31.
Hodgson SH, Muller J, Lockstone HE, Hill AVS, Marsh K, Draper SJ, Knight JC. Use of gene expression studies to investigate the human immunological response to malaria infection. Malaria J. 2019;18:1. https://doi.org/10.1186/s12936-019-3035-0.
Rashid AN, Ahmed M, Sikos LF, Haskell-Dowland P. Cooperative co-evolution for feature selection in big data with random feature grouping. J Big Data. 2020;7:1. https://doi.org/10.1186/s40537-020-00381-y.
Lakshmanan B, Jenitha T. Optimized feature selection and classification in Microarray gene expression cancer data. Indian J Public Health Res Dev. 2020;11(1):347. https://doi.org/10.37506/v11/i1/2020/ijphrd/193842.
Badaoui F, Amar A, Ait Hassou L, Zoglat A, Okou CG. Dimensionality reduction and class prediction algorithm with application to microarray big data. J Big Data. 2017;4:1. https://doi.org/10.1186/s40537-017-0093-4.
Ayyad SM, Saleh AI, Labib LM. A new distributed feature selection technique for classifying gene expression data. Int J Biomath. 2019;12(04):1950039. https://doi.org/10.1142/s1793524519500396.
Das H, Naik B, Behera HS. A hybrid neuro-fuzzy and feature reduction model for classification. Adv Fuzzy Syst. 2020. https://doi.org/10.1155/2020/4152049.
Forcato M, Romano O, Bicciato S. Computational methods for the integrative analysis of single-cell data. Brief Bioinform. 2020. https://doi.org/10.1093/bib/bbaa042.
Comert G, Begashaw N, Turhan-Comert A. Malaria outbreak detection with machine learning methods. 2020. https://doi.org/10.1101/2020.07.21.214213.
Arowolo MO, Adebiyi MO, Adebiyi AA, Okesola JO. PCA Model For RNA-Seq Malaria Vector Data Classification Using KNN And Decision Tree Algorithm. 2020 International Conference in Mathematics, Computer Engineering and Computer Science (ICMCECS). 2020. 1–8.
Zhao S, Leung WPF, Bottner A, Ngo K, Liu X. Comparison of RNA-Seq and microarray in transcriptome profiling of activated t-cells, PLoS One, 2014. 9(1).
Fan J, Slowikowski K, Zhang F. Single-cell transcriptomics in cancer: computational challenges and opportunities. Exp Mol Med. 2020;52:1452–65. https://doi.org/10.1038/s12276-020-0422-0.
Raddatz BB, Spitzbarth I, Matheis KA, Kalkuhl A, Deschl U, Baumgärtner W, Ulrich R. Microarray-based gene expression analysis for veterinary pathologists: A review. Vet Pathol. 2017;54(5):734–55. https://doi.org/10.1177/0300985817709887.
Shen L, Jiang H, He M, Liu G. Collaborative representation-based classification of microarray gene expression data. PLoS ONE. 2017;12:2.
Sahu B, Dehuri S, Jagadev A. A study on relevance of feature selection methods in microarray data. Open Bioinform J. 2018;11:117–39.
Jabeen A, Ahmad N, Raza K. Machine Learning-based State-of-the-art Methods for the Classification of RNA-Seq Data. 2017 https://doi.org/10.1101/120592.
Uma SM, Kirubakaran E. A hybrid heuristic dimensionality reduction technique for microarray gene expression data classification: a blending of GA, PSO and ACO. International Journal of Data Mining Modelling Management. 2016;8(2):160–79.
Motieghader H, Najafi A, Sadeghi B, M-Nejad A. A Hybrid gene selection algorithm for microarray cancer classification using genetic algorithm and learning automata. Inform Med Unlocked. 2017;9:246–54.
Wang L, Wang Y, Chang Q. Feature selection methods for big data bioinformatics: a Survey from the search perspective. Methods. 2017;111:21–31.
Jain D, Singh V. An efficient hybrid feature selection model for dimensionality reduction,” International Conference on Computational Intelligence and Data Science, Procedia Computer Science. 2018. 123: 333–341.
Hashemi FSG, Ismail MR, Yusop MR, Hashemi MSG, Shahraki MHN, Rastegari H, Miah G, Aslani F. Intelligent mining of large-scale bio-data: bioinformatics applications. Reviews. 2018;2020(28):1.
Arowolo MO, Adebiyi MO, Adebiyi AA. An efficient PCA Ensemble learning approach for prediction of RNA-Seq malaria vector gene expression data classification. Int J Eng Res Technol. 2020;13(1):163–9.
Bose J. Hybrid GA/KNN/SVM algorithm for classification of data. BioHouse J Computer Sci. 2016;2(2):5–11.
Sun L, Kong X, Xu J, Xue Z, Zhai R, Zhang S. A hybrid gene selection method based on Refief-F and Ant colony optimization algorithm for tumor classification. Nat Res Acad. 2019;9:8978.
Hyung PC, Nguyen VH, Do T. Novel hybrid DCNN-SVM model for classifying RNA-Sequencing gene expression data. 2019. 533–547.
Feng C, Liu C, Zhang H, Guan R, Li D, Zhou F, Liang Y, Feng X. Dimension reduction and clustering models for single-cell RNA-Seq data: A comparative study. Int J Mol Sci. 2020;21(2181):1–21.
Susmi SJ, Nehimiah HK. Hybrid dimensionality reduction techniques with genetic algorithm and neural network for classifying leukemia gene expression data. Indian J Sci Technol. 2018;9(1):1–8.
Acknowledgements
The author would like to thank Landmark University for supporting this work with all the needful experiments in this research.
Funding
There is no funding presently for this work.
Author information
Authors and Affiliations
Contributions
MO Arowolo contributed by carrying out the research as a PhD student under the Supervision and mentoring of Prof. O Olugbara, Prof. AA Adebiyi and Dr MO Adebiyi, who took the role for technical issues. They also advised all process for this work. MO Arowolo wrote the manuscript, while O Olugbara, MO Adebiyi and AA Adebiyi revised the manuscript. All authors read and approved the final manuscript.
Authors' information
Micheal Olaolu Arowolo is faculty member, of the Department of Computer Science, Landmark University, Omu-Aran. He is a PhD student in computer science. His research interests are machine learning, data mining, bioinformatics, artificial intelligence, gene expression analysis and computer arithmetic.
Dr Marion Olubunmi Adebiyi, is a faculty of the Department of Computer Science at Landmark University, Omu-Aran, Nigeria. She holds a B.Sc Degree from University of Ilorin, Ilorin Nigeria. She had her M.Sc and PhD Degree in Computer Science from Covenant University, Nigeria respectively. Her research interests include Bioinformatics of Infectious (African) Diseases/ Population, Organism’s Inter-pathway analysis, High throughput data analytics, Homology modelling and Artificial Intelligence. She has published widely in local and international reputable journals. She is a member of the Nigerian Computer Society (NCS), the Computer Registration Council of Nigeria (CPN) and IEEE member.
Professor Ayodele Ariyo Adebiyi is a Professor of Computer Science. He is currently the Head of Department of Computer Science at Landmark University, Omu-Aran, Nigeria, a sister University to Covenant University. He holds a BSc degree in Computer Science and an MBA degree from University of Ilorin, Ilorin Nigeria. He had his MSc and the PhD degree in Management Information System (MIS) from Covenant University, Nigeria, respectively. His research interests include the application of soft computing techniques in solving real-life problems, software engineering and information system research. He has successfully mentored and supervised several postgraduate students at Masters and PhD level. He has published widely in local and international reputable journals. He is a member of Nigerian Computer Society (NCS), the Computer Registration Council of Nigeria (CPN) and IEEE member.
Prof Oludayo Olugbara graduated with a first-class Bachelor of Science (Hons) in Mathematics from the University of Ilorin in 1991, he was a junior research fellow in at the University of Ilorin, after completing the national youth service corps. In 1993 he commenced his Master’s Degree in Mathematics with specialization in Computer Science at the University of Ilorin and completed the degree in 1995. He holds a PhD degree in Computer Science from the University of Zululand in South Africa. He is a Professor of Information Technology at the Durban University of Technology in South Africa. He is a holder of academic awards and scholarships, including the International Federation of Information Processing (IFIP) TC2 sponsored by Microsoft Research Cambridge in 2007 and respected research paper award at International Conference on Machine Learning and Data Analysis, organized by the IAENG International Association of Engineers, San Francisco, the USA in 2012. He is a University Scholar at the University of Ilorin, Member of Marquis Whos’ Who in the World (USA), Member of the Association for Computing Machinery (ACM, USA), Member of Computer Society of South Africa (CSSA) and other academic associations. He was awarded honorary referee of the Maejo International Journal of Science and Technology, Thailand in 2007–2010 and 2011. In December 2015, He was awarded an outstanding scientist by the Center for Advanced Research and Design of Venus International Foundation in India. He became an established researcher courtesy of the National Research Foundation (NRF) of South Africa rating in 2017. He has examined several postgraduate theses, dissertations and assessed research publications for professorial appointments both nationally and internationally. He has published widely, and he is a reviewer for many reputable journals.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Arowolo, M.O., Adebiyi, M.O., Adebiyi, A.A. et al. Optimized hybrid investigative based dimensionality reduction methods for malaria vector using KNN classifier. J Big Data 8, 29 (2021). https://doi.org/10.1186/s40537-021-00415-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537-021-00415-z