Doan 2015
Doan 2015
Abstract—In performing data mining, a common task is Finally, Section 7 summarizes the paper and provides di-
to search for the most appropriate algorithm(s) to retrieve rections for future study.
important information from data. With an increasing number
of available data mining techniques, it may be impractical II. R ELATED W ORK
to experiment with many techniques on a specific dataset
of interest to find the best algorithm(s). In this paper, we In the past, researchers have attempted to predict algo-
demonstrate the suitability of tree-based multi-variable linear rithm behavior based on data characteristics.
regression in predicting algorithm performance. We take into Example approaches include STATLOG [1], and METAL
account prior machine learning experience to construct meta- [2] that use machine learning on acquired knowledge of
knowledge for supervised learning. The idea is to use summary
knowledge about datasets along with past performance of how various machine learning algorithms perform on various
algorithms on these datasets to build this meta-knowledge. We datasets. To construct meta-knowledge, statistical summaries
augment pure statistical summaries with descriptive features and non-statistical information (such as number of instances
and a misclassification cost, and discover that transformed and number of classes) are used. Metrics from information
datasets obtained by reducing a high dimensional feature space theory (such as entropy and mutual information) are used for
to a smaller dimension still retain significant characteristic
knowledge necessary to predict algorithm performance. Our nominal data. An approach called landmarking [3] captures
approach works well for both numerical and nominal data a set of informative features computed from the selected
obtained from real world environments. learners (e.g., C5.0tree) on selected datasets. Going further,
Keywords-Meta-learning; regression; dimensionality reduc- [4] introduce the use of model-based properties in meta-
tion; combined metric learning. These features include characteristics of induced
decision trees (e.g., nodes per attribute, nodes per instance,
I. I NTRODUCTION or average gain-ratio difference) using the same datasets [5].
We have identified several issues with past approaches
Learning from data is of interest to many disparate to select machine learning algorithms. For example, statis-
fields such as banking, bioinformatics, business, computer tical features that have been used in past studies [1], [6],
vision and education. The field of data mining uses a large [7] include the mean of summary statistics (e.g., means
collection of machine learning algorithms whose goal is to of standard deviations, skewness, and kurtosis across all
extract useful information from collected data. For any given features) of a dataset. We believe that averaging values of
dataset, a common question is which learning algorithm summary statistics dilutes statistical meaning. For instance,
is best suited for it. Performing experiments with several a left-skew on one attribute might cancel out the right-skew
algorithms using the data, or getting advice from machine on another attribute in the final mean of skewnesses. The
learning experts, can help assess which algorithms might be mean of skewness or kurtosis values across all features loses
the best candidates, but this is not always practical. its discriminating character when the distribution is non-
Using the idea of meta-learning, we solve the problem uniform [6].
of selecting a machine learning algorithm for a particular Using many or all features may also be a source of
dataset by supervised learning. In this work, we represent problem for both real datasets and meta-knowledge derived
the efficient way to deal with a non-standard format in real- from them. A smaller optimized number of meta-features
world dataset to obtain training data. Our work is further may improve the quality of training data to produce a better
address the problem of well-known algorithm selection predictive meta-model. Researchers deal with a high number
recently. of dimensions in two ways: feature selection and feature
The remainder of the paper is organized as follows. extraction. Feature selection retains the most discriminating
Section 2 presents related work. Section 3 motivates the features [8]. Feature extraction, on the other hand, generates
solution, followed by our approach in Section 4. Section a new set of features as predictors in the form of composite
5 describes our experiments. Discussion is in Section 6. attributes [9].
1499
performance of all classifiers using the transformed datasets. transformed data is feasible but if we do so, we will exclude
These values represent the labels in training examples cor- original datasets with fewer than 5 or 6 features. We choose
responding to generated features for each dataset. Next, we 4 as the number of features to produce transformed datasets
construct our regression model using the set of training to generate meta-data (training set) for our study.
examples. Finally, we use this model to produce predicted
performance and the ranked list. The whole process is B. Meta-features
described in following sub-sections. Table I describes the meta-features of the training set
which are obtained from transformed dataset (see Figure
A. Dimensionality Reduction with variable feature issue 1). The first feature indicates the learner under consider-
Dimensionality reduction uses different techniques to gen- ation whereas the next two features describe the relation
erate a new small set of features from an original dataset. between the number of classes to the number of attributes
A commonly well-known technique for dimension reduction and the number of instances, respectively. The set of four
is Principal Components Analysis (PCA). We use PCA due summary statistics, viz., Median, Mean, Standard deviation
to its computational efficiency, say compared to the Single and Skewness, is computed for each of the new mixed
Value Decomposition method [11]. To work with PCA, a features of the compressed datasets. It results in 4*4 =16
nominal feature needs to be converted into numeric. We take meta-features based on statistical summaries. These includes
a straightforward approach: A nominal attribute that takes Median1, Median2, Median3, Median4, Mean1, Mean2,
m distinct values, is converted to m new binary features. Mean3, Mean4, Std1, Std2, Std3, Std4, Skewness1, Skew-
As non-linear correlation cannot be detected by PCA, we ness2, Skewness3, Skewness4 features. These four statistics
adapt the approach by [12] to include mutual information provide good predictive power while keeping the number of
by discretizing all numeric values. features in the meta-data relatively low. We also use two
information theory features such as class entropy for target
Table III attribute and total correlation. The last attribute represents
AVERAGE E RROR RATE WITH NUMBER OF FEATURES
the learner’s actual performance, which is an experimentally
No of Features 2 3 4 5 6 derived value. This attribute (Performance) indicates the
Avg Error Rate 0.3873 0.3859 0.3701 0.3705 0.3701 performance of each algorithm on a particular dataset based
on the SAR metric (Squared error, Accuracy, ROC Area),
One important question is how many features to generate described later.
in the transformed dataset for a typical classification prob- Among these information theoretic metrics, class entropy
lem. To justify a certain number of features, we perform (the EntroClass attribute) indicates how much information is
experiments by reducing the dimension of the datasets into needed to specify one target class whereas total correlation
2, 3, 4, 5 and 6 features. For each dataset, we implement (the TotalCorr attribute) measures the amount of information
classification for each of the datasets after reducing di- shared among the attributes. To offset the possible loss of
mensionality using 23 different algorithms and record the critical information that might be caused by inappropriate
accuracy measurement in each classification problem. Accu- bin boundaries in discretization, we include an additional
racy metric, which is correlated with the SAR metric used feature: the ratio of the number of classes to the dataset’s
later in the paper (explained in Section 4.3), is used in this size (the ClassInt attribute) to measure the trade-off between
preliminary experiment to justify the choice of the number of the number of instances and the number of target classes.
features for simplicity because it is generated directly from We also calculate the ratio of the number of features to the
most classifier algorithms. The error rate (determined by 1- number of target classes (the AttrClass attribute).
accuracy) of each transformed dataset is used to compute
the averages of error rate for each number of features (see C. Measurement metrics
Table III). The number of features with the lowest average In reality, different evaluation metrics can present con-
error rate is selected as the number of features used for flicting results when assessing algorithms’ performance,
dimensionality reduction. Table III indicates that the more especially when multiple algorithms based on different data
the number of features, the lower the average error rates mining approaches are involved. This problem has been
generated in classification tasks. However, it is only true in discussed extensively in the study by [13], [14]. Using
case of existing independent features. Our experiment with common accuracy metric for algorithm performance has
up to 10 generated features (not present in paper) confirm some drawbacks. It does not allow to show the different
similar pattern. With current features generated, the feasible importance of performance on each class. Accuracy is
choices are 4, 5 and 6. It is reasonable not to use higher also known to be sensitive to unbalanced classes when
numbers (e.g., from 7 onward) as dimensionality reduction the assumption of a normal distribution between classes
techniques cannot generate more features than the number is not guarantee. As a result, to evaluate an algorithm’s
of features in original datasets. Choosing 5 or 6 features for performance, we propose to use a combined metric that takes
1500
advantage of three commonly used metrics in classification. with small sets of predictors separately in these sub-regions.
This metric, SAR proposed in the study [13], is computed This feature gives the Regression Tree the ability to apply
as SAR = [Accuracy+AUC+(1-RMSE)]/3 where AUC and linear models to non-linear data since non-linear data is
RMSE are Area Under the Curve, and Root Mean Square common seen and is also in our case. They are Classification
Error, respectively. SAR ∈ [0,1] where the higher the better. and Regression Trees (CART) [19], Conditional Tree [20],
On the other hand, we use RMSE metric for regression Model Tree [21], Rule Based System [22], Bagging CART
task when we provide the comparison of several candidate [23], Random Forest [24], Gradient Boost Regression [25]
models for predicting algorithm performance. In regression, and Cubist [26].
RMSE is more suitable as it indicates the difference between ∙ The CART tree splits attributes to achieve minimize a
observed and predicted values. loss function. Each final split determines a sub-region
To generate a list of algorithms for classifying a particular of data space that indicate a linear relationship.
dataset, we use a minimum performance threshold (0.6) ∙ The Conditional Decision Tree (CDT) applies statistical
using the SAR metric. With a threshold parameter value tests to select split points of attributesto avoid selection
bigger than 0.5, the final list of algorithms includes only bias with splits.
those with high performance measured by the combined ∙ The Model Tree represents each leaf as a linear regres-
metric. Using threshold of 0.6 can be justified since in- sion model. Model tree aims at use reduction of error
cluding less performance algorithms in the final list is rate at each node when constructing a tree.
not productivity. Retaining only high performed algorithms ∙ The Rule based system simplifies a Decision Tree by
for a specific dataset reduces significantly computational removing parts of rules having low predictive power to
expense when we decide either more features need to be avoid overfitting.
collected in original dataset or further hyper-parameters is ∙ Bagging CART uses bootstrapping with aggregation
required. Finally, we generate a ranked list of algorithms by regression to reduce the variance of prediction. Each
predicted performance indicating how a particular algorithm model can be built independently.
may behave given an unknown dataset. ∙ Random Forest uses a random selection of features to
D. Data source split each node with bootstrap samples when building
trees.
We use two types of datasets, real and synthetic. From the ∙ Gradient Boost Regression extend the AdaBoost [27]
UCI repository [15], we select 73 datasets from a variety of using gradient boosting. It adds new models to learn
domains. The collection of data includes many datasets that misclassification errors in order to reduce bias.
have been used in similar work on algorithm selection [16], ∙ The Cubist tries to reduce the condition or a rule
[17], [10]. without increasing error rate. Cubist can adjust the
As real world datasets often come with noise and im- model prediction using a training set to improve its
precision due to error in measuring devices and human performance.
error, drawing precise conclusions from mined results suffers
from the fact that any assumption regarding data distribution The remaining models include Neural Network [28],
cannot be guaranteed. The artificial datasets also counter Support Vector Regression [29], K-Nearest Neighbor [30],
the limitation that selected real datasets cover only a small Partial Least Squared or PLS [31], Ridge Regression [32],
number of data domains. We use generators from the open Least Angle Regression or LARS [33], Elastic Net [34] and
source tool scikit-learn [18] to generate synthetic datasets Multi Variate Adaptive Regression Splines or MARS [35].
with at least four attributes. The number of classes vary from ∙ Neural Network connects its predictors to response
2 to 10 with a randomized number of instances between through its hidden units. Each unit receives information
100 and 4,000. Random noise is introduced in all of these from previous layers and generates output to next layer.
datasets. From generated synthetic datasets, we select only ∙ Support Vector Regression (SVR) searches data points
datasets with at least 4 features, resulting in 37 synthetic to support a regression. SVR uses a loss function with
datasets in a collection of 100 datasets. penalty for high performance in the presence of outliers.
∙ K-Nearest Neighbor Regression (KNN regression) lo-
E. Regression Models cates K nearest neighbors in the predictor space to
Our proposed predicting model generates performance for predict new value using the summary statistic
each of algorithms of choice. These predicted performances ∙ The PLS model tries to obtain a linear combination of
are used to compute a final ranked list. Among several independent variables that can maximize the covariance
regression models, we use Regression Tree models and other needed to separate groups.
state-of-the-art models. The basic idea of regression tree is ∙ The Ridge uses regularization to penalize by shrinking
to split data space into many subregions so that a model tree the coefficients toward 0 to counter against highly
can obtain high accuracy by computing the regression model correlated predictors in a dataset.
1501
Table IV strong evidence against the null hypothesis or that we reject
DATASETS USED AS EXAMPLES TO DEMONSTRATE THE FINAL RANKING
LIST
the null hypothesis. On the other hand, Friedman rank sum
test comes out as: 𝜒2 = 186.2428, df = 8, p-value = 2.2e-
Name Nominal Numeric Class Instances 18. Based on the result, we reject the null hypothesis that
Real data
credit-g 13 7 2 1000 all algorithms are the same at the 0.005 level. All codes
Kr-vs-kp 37 0 2 3196 in the study are provided in https://github.com/uccs-tdoan/
Abalone 1 7 28 4117 algorithmSelection
Waveform 0 40 3 5000
Shuttle 9 0 7 58000 Performances below and above the threshold (.6) with the
Synthetic data horizontal line indicates performance worse or better than
art-02 0 19 5 1038
art-07 0 23 3 4430
random guess, respectively (see Figures 2 and 3). These
art-12 0 12 10 1276 diagrams illustrate how algorithms are expected to perform
art-17 0 21 10 3266 in case of unknown datasets (in these figures, we have 5 real
art-24 0 18 2 3552
world and 5 synthetic datasets).
VI. D ISCUSSION
∙ LARS is based on the LASSO model [36]. It calculate Given training data obtained, we demonstrate the exper-
its move in the least angle direction for the next step iment to select the best regression model that we use to
among the currently most correlated covariates. generate predicted performances. The RMSE values in Table
∙ The Elastic Net is a generalized combination of two VI show that all the regression models are competitive in
penalized regression techniques: L1 (LASSO) and L2 prediction of algorithm performances. While the RMSEs are
(Ridge) introduced by [32] to exploit both advantages. less than 1 for all models, Cubist stands out as the best model
∙ MARS initially use surrogates features with only one in this study. Especially, the largest RMSE produced by
or two predictors at a time that indicate clearly linear Model Tree (Gradient Boost model) is still good compared
relationship to produce the best fit given the initial set. to the smallest RMSE produced by the best model (MARS)
Our approach differs from [37] and [38], which either use under the category of other linear and non-linear regression
only artificial datasets (the former) or associate a set of rules models. In reality, MARS is considered to be among the
with learner’s performance (the latter). most appealing models in regression. LARS, Elastic Net,
PLS, Neural Net perform well following the MARS model.
F. Evaluation approach Among the tree models, Model Tree and Gradient Boost are
We use two statistical significance tests to assess the result considered comparable to the others.
of our approach: Spearman’s rank correlation test [39] to Table VII shows a monotonically increasing relation-
measure how close the two ranks are, other is Friedman test ship between predicted ranks and observed ranks (indicated
to validate true ranking which is robust in case of normal by 𝜌 >0 Spearman’s rank coefficient). This positive rank
distribution is not guarantee. correlation implies the validity of generated ranks where
the predicted high rank is consistent with the observed
V. E XPERIMENTS rank. These results support our rank list based on predicted
Experiments are conducted on a system with Intel Core performance. Figure 2 and Figure 3 illustrate the predicted
i5, CPU 2.4Ghz and 6GB RAM. Weka API is used for performance with error ranges for all 23 algorithms using
all classification tasks, Python scripts for creating synthetic examples generated by the Cubist model on few extra real
datasets and R scripts for remain works. world and synthetic datasets, respectively. When the lowest
We use the stratified k-fold method [40] for tackling error is higher than the threshold (dotted line), we are
real world unbalanced datasets We estimate the predicted confident that the corresponding algorithm is capable of
performance using the regression models for all algorithms working well for the dataset and we include it in the ranked
(refer to Table II). Finally, predicted and observed values of list. The predicted values with error ranges imply further
performance of algorithms based on the test set are ordered improvement may be possible using hyperparameter optimal
and the following statistical tests are performed. search if tuning option is available.
Using Spearman’s rank correlation test report in Table Table V
VII, we see that all Spearman coefficients fall into four S PEARMAN R ANK C OEFFICIENT
groups (refer Table V): very strong (such as Neural Net,
Coefficient very weak weak moderate strong very strong
MARS, Bagging CART, Cubist, Conditional Decision Tree,
range .001-.19 .2-.39 .4-.59 .6-.79 .8-1
Random Forest), strong (SVR, CART, Model Tree, Rule
based System) and moderate (KNN, PLS, Ridge Regre-
The top 5 rank lists indicate that no single learner
sion, LARS and Elastic Net) and weak (Gradient Boost).
dominates on all datasets (Table VIII). The well-known
However, all p-values are significantly small, which imply
1502
Figure 2. Performance of algorithms on realworld examples.
Table VI
RMSE S BY MULTIPLE R EGRESSION M ODELS Table VII
S PEARMAN ’ S RANKING CORRELATION
Tree Models RMSE Other Models RMSE
CART 0.9291 Neuron Net 0.9699 Spearman Spearman
Tree Models Other Models
Conditional D.T. 0.9166 SVR 0.9714 rank coef. rank coef.
Model Tree 0.9308 KNN 0.9692 CART 0.7721 Neuron Network 0.8104
Rule Based Conditional D.T. 0.8856 SVR 0.7246
0.9166 PLS 0.9669
System Model Tree 0.6438 KNN 0.4849
Bagging CART 0.9251 Ridge 0.5731 Rule Based Sys. 0.6438 PLS 0.5619
Random Forest 0.9216 LARS 0.9668 Bagging CART 0.8374 Ridge 0.5731
Gradient Boost 0.9439 Elastic Net 0.9668 Random Forest 0.8977 LARS 0.5790
Cubist 0.9025 MARS 0.9626 Gradient Boost 0.3470 Elastic Net 0.5790
Cubist 0.8807 MARS 0.8385
1503
Figure 3. Performance of algorithms on synthetic datasets.
1504
[8] I. Guyon and A. Elisseeff, “An introduction to variable [25] J. H. Friedman, “Stochastic gradient boosting,” Computa-
and feature selection,” The Journal of Machine Learning tional Statistics & Data Analysis, vol. 38, no. 4, pp. 367–378,
Research, vol. 3, pp. 1157–1182, 2003. 2002.
[9] H. Liu and H. Motoda, Feature extraction, construction and [26] J. R. Quinlan, “Combining instance-based and model-based
selection: A data mining perspective. Springer, 1998. learning,” in Proceedings of the Tenth International Confer-
ence on Machine Learning, 1993, pp. 236–243.
[10] C. Thornton and F. e. a. Hutter, “Auto-weka: Combined
selection and hyperparameter optimization of classification [27] Y. Freund and R. E. Schapire, “A desicion-theoretic general-
algorithms,” in Proceedings of the 19th ACM SIGKDD. ization of on-line learning and an application to boosting,” in
ACM, 2013, pp. 847–855. Computational learning theory. Springer, 1995, pp. 23–37.
[11] L. De Lathauwer and B. e. a. De Moor, “Singular value [28] C. M. Bishop et al., “Neural networks for pattern recogni-
decomposition,” in Proc. EUSIPCO-94, Edinburgh, Scotland, tion,” 1995.
UK, vol. 1, 1994, pp. 175–178.
[29] H. Drucker, C. J. Burges, L. Kaufman, A. Smola, and
[12] D. N. Reshef and Y. A. e. a. Reshef, “Detecting novel V. Vapnik, “Support vector regression machines,” Advances in
associations in large data sets,” science, vol. 334, no. 6062, neural information processing systems, vol. 9, pp. 155–161,
pp. 1518–1524, 2011. 1997.
[20] T. Hothorn, K. Hornik, and A. Zeileis, “Unbiased recursive [37] C. Köpf and C. e. a. Taylor, “Meta-analysis: from data
partitioning: A conditional inference framework,” Journal of characterisation for meta-learning to meta-regression,” in Pro-
Computational and Graphical statistics, vol. 15, no. 3, pp. ceedings of the PKDD-00 workshop on data mining, decision
651–674, 2006. support, meta-learning and ILP. Citeseer, 2000.
[21] J. R. Quinlan et al., “Learning with continuous classes,” in 5th [38] H. Bensusan and A. Kalousis, “Estimating the predictive
Australian joint conference on artificial intelligence, vol. 92. accuracy of a classifier,” in Machine Learning: ECML 2001.
Singapore, 1992, pp. 343–348. Springer, 2001, pp. 25–36.
[22] G. Holmes, M. Hall, and E. Prank, Generating rule sets from [39] M. Kuhn and K. Johnson, Applied predictive modeling.
model trees. Springer, 1999. Springer, 2013.
1505