Doan 2015

2015 IEEE 15th International Conference on Data Mining Workshops
Selecting Machine Learning Algorithms using Regression Models
Tri Doan Jugal Kalita

Department of Computer Science Department of Computer Science
University of Colorado University of Colorado
Colorado Springs Colorado Springs
[email protected] [email protected]
Abstract—In performing data mining, a common task is Finally, Section 7 summarizes the paper and provides di-
to search for the most appropriate algorithm(s) to retrieve rections for future study.
important information from data. With an increasing number
of available data mining techniques, it may be impractical II. R ELATED W ORK
to experiment with many techniques on a specific dataset
of interest to find the best algorithm(s). In this paper, we In the past, researchers have attempted to predict algo-
demonstrate the suitability of tree-based multi-variable linear rithm behavior based on data characteristics.
regression in predicting algorithm performance. We take into Example approaches include STATLOG [1], and METAL
account prior machine learning experience to construct meta- [2] that use machine learning on acquired knowledge of
knowledge for supervised learning. The idea is to use summary
knowledge about datasets along with past performance of how various machine learning algorithms perform on various
algorithms on these datasets to build this meta-knowledge. We datasets. To construct meta-knowledge, statistical summaries
augment pure statistical summaries with descriptive features and non-statistical information (such as number of instances
and a misclassification cost, and discover that transformed and number of classes) are used. Metrics from information
datasets obtained by reducing a high dimensional feature space theory (such as entropy and mutual information) are used for
to a smaller dimension still retain significant characteristic
knowledge necessary to predict algorithm performance. Our nominal data. An approach called landmarking [3] captures
approach works well for both numerical and nominal data a set of informative features computed from the selected
obtained from real world environments. learners (e.g., C5.0tree) on selected datasets. Going further,
Keywords-Meta-learning; regression; dimensionality reduc- [4] introduce the use of model-based properties in meta-
tion; combined metric learning. These features include characteristics of induced
decision trees (e.g., nodes per attribute, nodes per instance,
I. I NTRODUCTION or average gain-ratio difference) using the same datasets [5].
We have identified several issues with past approaches
Learning from data is of interest to many disparate to select machine learning algorithms. For example, statis-
fields such as banking, bioinformatics, business, computer tical features that have been used in past studies [1], [6],
vision and education. The field of data mining uses a large [7] include the mean of summary statistics (e.g., means
collection of machine learning algorithms whose goal is to of standard deviations, skewness, and kurtosis across all
extract useful information from collected data. For any given features) of a dataset. We believe that averaging values of
dataset, a common question is which learning algorithm summary statistics dilutes statistical meaning. For instance,
is best suited for it. Performing experiments with several a left-skew on one attribute might cancel out the right-skew
algorithms using the data, or getting advice from machine on another attribute in the final mean of skewnesses. The
learning experts, can help assess which algorithms might be mean of skewness or kurtosis values across all features loses
the best candidates, but this is not always practical. its discriminating character when the distribution is non-
Using the idea of meta-learning, we solve the problem uniform [6].
of selecting a machine learning algorithm for a particular Using many or all features may also be a source of
dataset by supervised learning. In this work, we represent problem for both real datasets and meta-knowledge derived
the efficient way to deal with a non-standard format in real- from them. A smaller optimized number of meta-features
world dataset to obtain training data. Our work is further may improve the quality of training data to produce a better
address the problem of well-known algorithm selection predictive meta-model. Researchers deal with a high number
recently. of dimensions in two ways: feature selection and feature
The remainder of the paper is organized as follows. extraction. Feature selection retains the most discriminating
Section 2 presents related work. Section 3 motivates the features [8]. Feature extraction, on the other hand, generates
solution, followed by our approach in Section 4. Section a new set of features as predictors in the form of composite
5 describes our experiments. Discussion is in Section 6. attributes [9].
978-1-4673-8493-3/15 $31.00 © 2015 IEEE 1498

DOI 10.1109/ICDMW.2015.43
As a breakthrough of among studies of algorithm selec-
tion, Auto-Weka [10] allows one to find the best algorithm
with corresponding optimal parameters among algorithm
of interest. While its result provides optimal parameters
for each algorithm, it suffers a computational problem that
our work desire to address. Two main reasons include
non heuristic to select algorithms to perform the hyper-
parameter search for a given dataset and each execution
starting from scratch. First reason results in wasting time
for searching hyper-parameter of algorithms that are actual
low performance in a specific data domain. Second reason Figure 1. Proposed model for algorithm selection
lies on the approach such that Auto-Weka is built ignores
Table I
knowledge gained from past experience. M ETA - FEATURES USED
In addition, there is very little work on algorithm selection
using regression, especially a regression tree model. Our Feature Description
work study the use of the family of tree models and Algorithm Data mining algorithm’s name
ClassInt Ratio of no of classes to instances
demonstrate the advantage of tree model. AttrClass Ratio of no of features to no of classes
Median* Middle value in a dataset
III. M ETA - LEARNING IN STEPS Mean* Average value in a dataset
∙ Create a training set from original datasets and eval- Std* Standard deviation
Skewness* Measure of asymmetry of the prob. dis.
uate performance of algorithm on these transformed Entro-Class Class entropy for target attribute
datasets. TotalCorr Amount of information shared among variables
∙ Select a best regression model and predict algorithm Performance Measure of performance of each algorithm
performance on unknown dataset. * 4 statistics computed for each of the four meta-features
∙ Generate a ranked list of machine learning algorithms.
Table II
To create meta-data, each original dataset is used to DATA MINING ALGORITHMS IN THIS STUDY
generate one meta-instance in the training set. Since real-
world datasets come with different numbers of features, we Learner Description
J48 Non-commercial decision tree C4.5
transform each original dataset into a dataset with a fixed Decision Stump A decision stump
number of features before performing training. Our approach PART A version of C4.5 using decision list
uses dimension reduction to generate a fixed number of Decision Table Simple decision table
JRip Propositional rule learner
features. Transformed datasets can be generated with a OneR Uses minimum-error attribute
feature generation approach which takes all features from ZeroR 0-R classifier
original dataset into account. In addition, the runtime for the IBk K-NN classifier
KStar Entropy-based
transformed dataset is likely to be lower than for the original LWL Locally Weighed Learning
datasets because computational time for a large number of Naive Bayes Naive Bayes classifier
features often results in higher run time of an algorithm. As AdaBoost M1 Boosting a nominal class classifier
Bagging Reduces variance
each feature has its own predictive power, a feature selection Stacking Combines the output from others
approach may not have provided an adequate solution to Logit Boost Additive logistic regression classifier
generate the same fixed number of features. The reason is Random Committe Ensemble of randomizable base classifiers
Random Forest Forest of random trees
that if we want the features to guarantee high predictive Vote Uses majority vote to label new instance
power, feature selection is likely to select different numbers Logistic Logistic regression with a ridge estimator
of features for different datasets whereas our meta-learning Multilayer Perceptron Neural network with backpropagation
Simple Logistic Linear logistic regression
training set requires a fixed number of features. SMO Sequential minimal optimization
With transformed datasets obtained, we use supervised SVM Support Vector Machine
learning to solve the problem of predicting performance of
specific algorithms on specific datasets. We apply both linear
regression and non-linear regression. Finally, the outcomes IV. A PPROACH
of the regression methods are used to generate a ranked list Figure 1 illustrates our proposed approach. At first, the
of algorithms by performance for a specific dataset. Intu- original datasets are preprocessed to generate training ex-
itively, using regression is suitable as data mining practition- amples. To do this, each dataset is transformed into a
ers may be interested in comparing how various algorithms corresponding intermediate dataset with a fixed number of
perform on a dataset first by performing regression to predict features. This allows us to generate meta-features for the
performance of individual algorithms on the dataset. training set. Weka is used to obtain actual classification
1499
performance of all classifiers using the transformed datasets. transformed data is feasible but if we do so, we will exclude
These values represent the labels in training examples cor- original datasets with fewer than 5 or 6 features. We choose
responding to generated features for each dataset. Next, we 4 as the number of features to produce transformed datasets
construct our regression model using the set of training to generate meta-data (training set) for our study.
examples. Finally, we use this model to produce predicted
performance and the ranked list. The whole process is B. Meta-features
described in following sub-sections. Table I describes the meta-features of the training set
which are obtained from transformed dataset (see Figure
A. Dimensionality Reduction with variable feature issue 1). The first feature indicates the learner under consider-
Dimensionality reduction uses different techniques to gen- ation whereas the next two features describe the relation
erate a new small set of features from an original dataset. between the number of classes to the number of attributes
A commonly well-known technique for dimension reduction and the number of instances, respectively. The set of four
is Principal Components Analysis (PCA). We use PCA due summary statistics, viz., Median, Mean, Standard deviation
to its computational efficiency, say compared to the Single and Skewness, is computed for each of the new mixed
Value Decomposition method [11]. To work with PCA, a features of the compressed datasets. It results in 4*4 =16
nominal feature needs to be converted into numeric. We take meta-features based on statistical summaries. These includes
a straightforward approach: A nominal attribute that takes Median1, Median2, Median3, Median4, Mean1, Mean2,
m distinct values, is converted to m new binary features. Mean3, Mean4, Std1, Std2, Std3, Std4, Skewness1, Skew-
As non-linear correlation cannot be detected by PCA, we ness2, Skewness3, Skewness4 features. These four statistics
adapt the approach by [12] to include mutual information provide good predictive power while keeping the number of
by discretizing all numeric values. features in the meta-data relatively low. We also use two
information theory features such as class entropy for target
Table III attribute and total correlation. The last attribute represents
AVERAGE E RROR RATE WITH NUMBER OF FEATURES
the learner’s actual performance, which is an experimentally
No of Features 2 3 4 5 6 derived value. This attribute (Performance) indicates the
Avg Error Rate 0.3873 0.3859 0.3701 0.3705 0.3701 performance of each algorithm on a particular dataset based
on the SAR metric (Squared error, Accuracy, ROC Area),
One important question is how many features to generate described later.
in the transformed dataset for a typical classification prob- Among these information theoretic metrics, class entropy
lem. To justify a certain number of features, we perform (the EntroClass attribute) indicates how much information is
experiments by reducing the dimension of the datasets into needed to specify one target class whereas total correlation
2, 3, 4, 5 and 6 features. For each dataset, we implement (the TotalCorr attribute) measures the amount of information
classification for each of the datasets after reducing di- shared among the attributes. To offset the possible loss of
mensionality using 23 different algorithms and record the critical information that might be caused by inappropriate
accuracy measurement in each classification problem. Accu- bin boundaries in discretization, we include an additional
racy metric, which is correlated with the SAR metric used feature: the ratio of the number of classes to the dataset’s
later in the paper (explained in Section 4.3), is used in this size (the ClassInt attribute) to measure the trade-off between
preliminary experiment to justify the choice of the number of the number of instances and the number of target classes.
features for simplicity because it is generated directly from We also calculate the ratio of the number of features to the
most classifier algorithms. The error rate (determined by 1- number of target classes (the AttrClass attribute).
accuracy) of each transformed dataset is used to compute
the averages of error rate for each number of features (see C. Measurement metrics
Table III). The number of features with the lowest average In reality, different evaluation metrics can present con-
error rate is selected as the number of features used for flicting results when assessing algorithms’ performance,
dimensionality reduction. Table III indicates that the more especially when multiple algorithms based on different data
the number of features, the lower the average error rates mining approaches are involved. This problem has been
generated in classification tasks. However, it is only true in discussed extensively in the study by [13], [14]. Using
case of existing independent features. Our experiment with common accuracy metric for algorithm performance has
up to 10 generated features (not present in paper) confirm some drawbacks. It does not allow to show the different
similar pattern. With current features generated, the feasible importance of performance on each class. Accuracy is
choices are 4, 5 and 6. It is reasonable not to use higher also known to be sensitive to unbalanced classes when
numbers (e.g., from 7 onward) as dimensionality reduction the assumption of a normal distribution between classes
techniques cannot generate more features than the number is not guarantee. As a result, to evaluate an algorithm’s
of features in original datasets. Choosing 5 or 6 features for performance, we propose to use a combined metric that takes
1500
advantage of three commonly used metrics in classification. with small sets of predictors separately in these sub-regions.
This metric, SAR proposed in the study [13], is computed This feature gives the Regression Tree the ability to apply
as SAR = [Accuracy+AUC+(1-RMSE)]/3 where AUC and linear models to non-linear data since non-linear data is
RMSE are Area Under the Curve, and Root Mean Square common seen and is also in our case. They are Classification
Error, respectively. SAR ∈ [0,1] where the higher the better. and Regression Trees (CART) [19], Conditional Tree [20],
On the other hand, we use RMSE metric for regression Model Tree [21], Rule Based System [22], Bagging CART
task when we provide the comparison of several candidate [23], Random Forest [24], Gradient Boost Regression [25]
models for predicting algorithm performance. In regression, and Cubist [26].
RMSE is more suitable as it indicates the difference between ∙ The CART tree splits attributes to achieve minimize a
observed and predicted values. loss function. Each final split determines a sub-region
To generate a list of algorithms for classifying a particular of data space that indicate a linear relationship.
dataset, we use a minimum performance threshold (0.6) ∙ The Conditional Decision Tree (CDT) applies statistical
using the SAR metric. With a threshold parameter value tests to select split points of attributesto avoid selection
bigger than 0.5, the final list of algorithms includes only bias with splits.
those with high performance measured by the combined ∙ The Model Tree represents each leaf as a linear regres-
metric. Using threshold of 0.6 can be justified since in- sion model. Model tree aims at use reduction of error
cluding less performance algorithms in the final list is rate at each node when constructing a tree.
not productivity. Retaining only high performed algorithms ∙ The Rule based system simplifies a Decision Tree by
for a specific dataset reduces significantly computational removing parts of rules having low predictive power to
expense when we decide either more features need to be avoid overfitting.
collected in original dataset or further hyper-parameters is ∙ Bagging CART uses bootstrapping with aggregation
required. Finally, we generate a ranked list of algorithms by regression to reduce the variance of prediction. Each
predicted performance indicating how a particular algorithm model can be built independently.
may behave given an unknown dataset. ∙ Random Forest uses a random selection of features to
D. Data source split each node with bootstrap samples when building
trees.
We use two types of datasets, real and synthetic. From the ∙ Gradient Boost Regression extend the AdaBoost [27]
UCI repository [15], we select 73 datasets from a variety of using gradient boosting. It adds new models to learn
domains. The collection of data includes many datasets that misclassification errors in order to reduce bias.
have been used in similar work on algorithm selection [16], ∙ The Cubist tries to reduce the condition or a rule
[17], [10]. without increasing error rate. Cubist can adjust the
As real world datasets often come with noise and im- model prediction using a training set to improve its
precision due to error in measuring devices and human performance.
error, drawing precise conclusions from mined results suffers
from the fact that any assumption regarding data distribution The remaining models include Neural Network [28],
cannot be guaranteed. The artificial datasets also counter Support Vector Regression [29], K-Nearest Neighbor [30],
the limitation that selected real datasets cover only a small Partial Least Squared or PLS [31], Ridge Regression [32],
number of data domains. We use generators from the open Least Angle Regression or LARS [33], Elastic Net [34] and
source tool scikit-learn [18] to generate synthetic datasets Multi Variate Adaptive Regression Splines or MARS [35].
with at least four attributes. The number of classes vary from ∙ Neural Network connects its predictors to response
2 to 10 with a randomized number of instances between through its hidden units. Each unit receives information
100 and 4,000. Random noise is introduced in all of these from previous layers and generates output to next layer.
datasets. From generated synthetic datasets, we select only ∙ Support Vector Regression (SVR) searches data points
datasets with at least 4 features, resulting in 37 synthetic to support a regression. SVR uses a loss function with
datasets in a collection of 100 datasets. penalty for high performance in the presence of outliers.
∙ K-Nearest Neighbor Regression (KNN regression) lo-
E. Regression Models cates K nearest neighbors in the predictor space to
Our proposed predicting model generates performance for predict new value using the summary statistic
each of algorithms of choice. These predicted performances ∙ The PLS model tries to obtain a linear combination of
are used to compute a final ranked list. Among several independent variables that can maximize the covariance
regression models, we use Regression Tree models and other needed to separate groups.
state-of-the-art models. The basic idea of regression tree is ∙ The Ridge uses regularization to penalize by shrinking
to split data space into many subregions so that a model tree the coefficients toward 0 to counter against highly
can obtain high accuracy by computing the regression model correlated predictors in a dataset.
1501
Table IV strong evidence against the null hypothesis or that we reject
DATASETS USED AS EXAMPLES TO DEMONSTRATE THE FINAL RANKING
LIST
the null hypothesis. On the other hand, Friedman rank sum
test comes out as: 𝜒2 = 186.2428, df = 8, p-value = 2.2e-
Name Nominal Numeric Class Instances 18. Based on the result, we reject the null hypothesis that
Real data
credit-g 13 7 2 1000 all algorithms are the same at the 0.005 level. All codes
Kr-vs-kp 37 0 2 3196 in the study are provided in https://github.com/uccs-tdoan/
Abalone 1 7 28 4117 algorithmSelection
Waveform 0 40 3 5000
Shuttle 9 0 7 58000 Performances below and above the threshold (.6) with the
Synthetic data horizontal line indicates performance worse or better than
art-02 0 19 5 1038
art-07 0 23 3 4430
random guess, respectively (see Figures 2 and 3). These
art-12 0 12 10 1276 diagrams illustrate how algorithms are expected to perform
art-17 0 21 10 3266 in case of unknown datasets (in these figures, we have 5 real
art-24 0 18 2 3552
world and 5 synthetic datasets).
VI. D ISCUSSION
∙ LARS is based on the LASSO model [36]. It calculate Given training data obtained, we demonstrate the exper-
its move in the least angle direction for the next step iment to select the best regression model that we use to
among the currently most correlated covariates. generate predicted performances. The RMSE values in Table
∙ The Elastic Net is a generalized combination of two VI show that all the regression models are competitive in
penalized regression techniques: L1 (LASSO) and L2 prediction of algorithm performances. While the RMSEs are
(Ridge) introduced by [32] to exploit both advantages. less than 1 for all models, Cubist stands out as the best model
∙ MARS initially use surrogates features with only one in this study. Especially, the largest RMSE produced by
or two predictors at a time that indicate clearly linear Model Tree (Gradient Boost model) is still good compared
relationship to produce the best fit given the initial set. to the smallest RMSE produced by the best model (MARS)
Our approach differs from [37] and [38], which either use under the category of other linear and non-linear regression
only artificial datasets (the former) or associate a set of rules models. In reality, MARS is considered to be among the
with learner’s performance (the latter). most appealing models in regression. LARS, Elastic Net,
PLS, Neural Net perform well following the MARS model.
F. Evaluation approach Among the tree models, Model Tree and Gradient Boost are
We use two statistical significance tests to assess the result considered comparable to the others.
of our approach: Spearman’s rank correlation test [39] to Table VII shows a monotonically increasing relation-
measure how close the two ranks are, other is Friedman test ship between predicted ranks and observed ranks (indicated
to validate true ranking which is robust in case of normal by 𝜌 >0 Spearman’s rank coefficient). This positive rank
distribution is not guarantee. correlation implies the validity of generated ranks where
the predicted high rank is consistent with the observed
V. E XPERIMENTS rank. These results support our rank list based on predicted
Experiments are conducted on a system with Intel Core performance. Figure 2 and Figure 3 illustrate the predicted
i5, CPU 2.4Ghz and 6GB RAM. Weka API is used for performance with error ranges for all 23 algorithms using
all classification tasks, Python scripts for creating synthetic examples generated by the Cubist model on few extra real
datasets and R scripts for remain works. world and synthetic datasets, respectively. When the lowest
We use the stratified k-fold method [40] for tackling error is higher than the threshold (dotted line), we are
real world unbalanced datasets We estimate the predicted confident that the corresponding algorithm is capable of
performance using the regression models for all algorithms working well for the dataset and we include it in the ranked
(refer to Table II). Finally, predicted and observed values of list. The predicted values with error ranges imply further
performance of algorithms based on the test set are ordered improvement may be possible using hyperparameter optimal
and the following statistical tests are performed. search if tuning option is available.
Using Spearman’s rank correlation test report in Table Table V
VII, we see that all Spearman coefficients fall into four S PEARMAN R ANK C OEFFICIENT
groups (refer Table V): very strong (such as Neural Net,
Coefficient very weak weak moderate strong very strong
MARS, Bagging CART, Cubist, Conditional Decision Tree,
range .001-.19 .2-.39 .4-.59 .6-.79 .8-1
Random Forest), strong (SVR, CART, Model Tree, Rule
based System) and moderate (KNN, PLS, Ridge Regre-
The top 5 rank lists indicate that no single learner
sion, LARS and Elastic Net) and weak (Gradient Boost).
dominates on all datasets (Table VIII). The well-known
However, all p-values are significantly small, which imply
1502
Figure 2. Performance of algorithms on realworld examples.
Table VI
RMSE S BY MULTIPLE R EGRESSION M ODELS Table VII
S PEARMAN ’ S RANKING CORRELATION
Tree Models RMSE Other Models RMSE
CART 0.9291 Neuron Net 0.9699 Spearman Spearman
Tree Models Other Models
Conditional D.T. 0.9166 SVR 0.9714 rank coef. rank coef.
Model Tree 0.9308 KNN 0.9692 CART 0.7721 Neuron Network 0.8104
Rule Based Conditional D.T. 0.8856 SVR 0.7246
0.9166 PLS 0.9669
System Model Tree 0.6438 KNN 0.4849
Bagging CART 0.9251 Ridge 0.5731 Rule Based Sys. 0.6438 PLS 0.5619
Random Forest 0.9216 LARS 0.9668 Bagging CART 0.8374 Ridge 0.5731
Gradient Boost 0.9439 Elastic Net 0.9668 Random Forest 0.8977 LARS 0.5790
Cubist 0.9025 MARS 0.9626 Gradient Boost 0.3470 Elastic Net 0.5790
Cubist 0.8807 MARS 0.8385
SVM learner appears in half of all top 5 rank lists but is

recommended as first choice only one time across the real Table VIII
R ANKING OF ALGORITHM PERFORMANCE ON DATASETS
world and synthetic datasets. Some learners such as J48
(decision tree) and Part often occur in the top 5 lists and Rank First Second Third Fourth Fifth
abalone SVM AdaBoost J48 IBk Logistic
thus can be use as baseline when algorithms are compared waveform S.Logistic Logistic MultiL J48 JRip
or new learning algorithms are developed. Several learners shuttle Part Bagging IBk Kstar RForest
g-credit SLogistic Logistic Bagging LWL LBoost
such as KStar, LWL, Logistic with long training time also KRvsKP Part Bagging RandC. Logistic SLogistic
make their way into the ranking lists. In Figure 2, we observe art-02 MultiL SVM IBk Logistic simpleL
art-07 MultiL IBk RandomC Logistic RForest
that only half the algorithms are “suitable” or the abalone art-12 SVM IBk MultiL RandomC SMO
dataset whereas the shuttle dataset has a wide range of art-17 IBk MultiL SVM SMO RandC
“suitable” algorithms. Half the “suitable” algorithms have art-24 MultiL SVM IBk Logistic RandC
high performance on the second dataset, implying choices
1503
Figure 3. Performance of algorithms on synthetic datasets.
for practitioners. R EFERENCES

We note that our rank results obtained with the combined [1] R. King and C. e. a. Feng, “Statlog: comparison of classifi-
SAR metric are more robust than those obtained only with cation algorithms on large real-world problems,” Appl Artif
Intell an Intern Journal, vol. 9, no. 3, pp. 289–333, 1995.
the accuracy metric [7]. For instance, [7] rank MultiLayer
Perceptron in first place but we rank it seventh due to low [2] H. Berrer and I. e. a. Paterson, “Evaluation of machine-
AUC of this algorithm’s performance on abalone. learning algorithm ranking advisors,” in In Proceedings of
the PKDD-2000 Workshop on DataMining, Decision Support,
Meta-Learning and ILP. Citeseer, 2000.
[3] H. Bensusan and C. Giraud-Carrier, “Discovering task neigh-

VII. C ONCLUSION AND F UTURE W ORK bourhoods through landmark learning performances,” in Prin-
ciples of Data Mining and Knowledge Discovery. Springer,
2000, pp. 325–330.
Given an unknown dataset, our proposed approach are
able give a robust ranked list of learning algorithms with [4] Y. Peng and P. A. e. a. Flach, “Decision tree-based data
characterization for meta-learning,” 2002.
default parameter setting. Furthermore the ranked list can
be use to select only feasible algorithms for Auto-Weka [5] H. Bensusan and C. G. e. a. Giraud-Carrier, “A higher-order
execution. One limitation in this work is the minimum 4 approach to meta-learning.” in ILP Work-in-progress reports,
features of original datasets. We will address this limitation 2000.
in future work and investigate the practical use of our [6] C. Castiello and G. e. a. Castellano, “Meta-data: Charac-
proposed approach in a field of NLP (Natural Language terization of input features for meta-learning,” in Modeling
Processing). There is a possible extension of this work Decisions for Artificial Intelligence. Springer, 2005, pp. 457–
including integration with Auto-Weka and combination of 468.
performance with run time factor. Our future work also is [7] S. Ali and K. A. Smith, “On learning algorithm selection for
to extend this study in case of a large scale data, particular classification,” Appl Soft Compu, vol. 6, no. 2, pp. 119–138,
for big data challenge. 2006.
1504
[8] I. Guyon and A. Elisseeff, “An introduction to variable [25] J. H. Friedman, “Stochastic gradient boosting,” Computa-
and feature selection,” The Journal of Machine Learning tional Statistics & Data Analysis, vol. 38, no. 4, pp. 367–378,
Research, vol. 3, pp. 1157–1182, 2003. 2002.
[9] H. Liu and H. Motoda, Feature extraction, construction and [26] J. R. Quinlan, “Combining instance-based and model-based
selection: A data mining perspective. Springer, 1998. learning,” in Proceedings of the Tenth International Confer-
ence on Machine Learning, 1993, pp. 236–243.
[10] C. Thornton and F. e. a. Hutter, “Auto-weka: Combined
selection and hyperparameter optimization of classification [27] Y. Freund and R. E. Schapire, “A desicion-theoretic general-
algorithms,” in Proceedings of the 19th ACM SIGKDD. ization of on-line learning and an application to boosting,” in
ACM, 2013, pp. 847–855. Computational learning theory. Springer, 1995, pp. 23–37.
[11] L. De Lathauwer and B. e. a. De Moor, “Singular value [28] C. M. Bishop et al., “Neural networks for pattern recogni-
decomposition,” in Proc. EUSIPCO-94, Edinburgh, Scotland, tion,” 1995.
UK, vol. 1, 1994, pp. 175–178.
[29] H. Drucker, C. J. Burges, L. Kaufman, A. Smola, and
[12] D. N. Reshef and Y. A. e. a. Reshef, “Detecting novel V. Vapnik, “Support vector regression machines,” Advances in
associations in large data sets,” science, vol. 334, no. 6062, neural information processing systems, vol. 9, pp. 155–161,
pp. 1518–1524, 2011. 1997.
[30] O. Kramer, Dimensionality Reduction with Unsupervised

[13] R. Caruana and A. Niculescu-Mizil, “Data mining in metric
Nearest Neighbors. Springer, 2013.
space: an empirical analysis of supervised learning perfor-
mance criteria,” in Proceedings of the 10th ACM SIGKDD.
[31] H. Wold et al., “Estimation of principal components and re-
ACM, 2004, pp. 69–78.
lated models by iterative least squares,” Multivariate analysis,
vol. 1, pp. 391–420, 1966.
[14] C. Ferri and J. e. a. Hernández-Orallo, “An experimental com-
parison of performance measures for classification,” Pattern [32] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased es-
Recognition Letters, vol. 30, no. 1, pp. 27–38, 2009. timation for nonorthogonal problems,” Technometrics, vol. 12,
no. 1, pp. 55–67, 1970.
[15] C. Blake and C. J. Merz, “{UCI} repository of machine
learning databases,” 1998. [33] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani et al., “Least
angle regression,” The Annals of statistics, vol. 32, no. 2, pp.
[16] R. Leite and P. e. a. Brazdil, “Selecting classification algo- 407–499, 2004.
rithms with active testing,” in ML and Data Mining in Pattern
Recognition. Springer, 2012, pp. 117–131. [34] H. Zou and T. Hastie, “Regularization and variable selection
via the elastic net,” Journal of the Royal Statistical Society:
[17] M. Reif and F. e. a. Shafait, “Meta-learning for evolution- Series B (Statistical Methodology), vol. 67, no. 2, pp. 301–
ary parameter optimization of classifiers,” Machine learning, 320, 2005.
vol. 87, no. 3, pp. 357–380, 2012.
[35] J. H. Friedman, “Multivariate adaptive regression splines,”
[18] F. Pedregosa and G. e. a. Varoquaux, “Scikit-learn: Machine The annals of statistics, pp. 1–67, 1991.
learning in Python,” vol. 12, pp. 2825–2830, 2011.
[36] R. Tibshirani, “Regression shrinkage and selection via the
[19] L. Breiman and J. e. a. Friedman, Classification and regres- lasso,” Journal of the Royal Statistical Society, pp. 267–288,
sion trees. CRC press, 1984. 1996.
[20] T. Hothorn, K. Hornik, and A. Zeileis, “Unbiased recursive [37] C. Köpf and C. e. a. Taylor, “Meta-analysis: from data
partitioning: A conditional inference framework,” Journal of characterisation for meta-learning to meta-regression,” in Pro-
Computational and Graphical statistics, vol. 15, no. 3, pp. ceedings of the PKDD-00 workshop on data mining, decision
651–674, 2006. support, meta-learning and ILP. Citeseer, 2000.
[21] J. R. Quinlan et al., “Learning with continuous classes,” in 5th [38] H. Bensusan and A. Kalousis, “Estimating the predictive
Australian joint conference on artificial intelligence, vol. 92. accuracy of a classifier,” in Machine Learning: ECML 2001.
Singapore, 1992, pp. 343–348. Springer, 2001, pp. 25–36.
[22] G. Holmes, M. Hall, and E. Prank, Generating rule sets from [39] M. Kuhn and K. Johnson, Applied predictive modeling.
model trees. Springer, 1999. Springer, 2013.
[40] P. Refaeilzadeh, L. Tang, and H. Liu, “Cross-validation,” in

[23] L. Breiman, “Bagging predictors,” Machine learning, vol. 24,
Encyclopedia of database systems. Springer, 2009, pp. 532–
no. 2, pp. 123–140, 1996.
538.
[24] ——, “Random forests,” Machine learning, vol. 45, no. 1,
pp. 5–32, 2001.
1505

Doan 2015

Uploaded by

Doan 2015

Uploaded by

2015 IEEE 15th International Conference on Data Mining Workshops

Selecting Machine Learning Algorithms using Regression Models

Tri Doan Jugal Kalita

978-1-4673-8493-3/15 $31.00 © 2015 IEEE 1498

SVM learner appears in half of all top 5 rank lists but is

high performance on the second dataset, implying choices

for practitioners. R EFERENCES

[3] H. Bensusan and C. Giraud-Carrier, “Discovering task neigh-

[30] O. Kramer, Dimensionality Reduction with Unsupervised

[40] P. Refaeilzadeh, L. Tang, and H. Liu, “Cross-validation,” in

You might also like