Features Election
Features Election
Features Election
Max Kuhn
[email protected]
Many feature selection routines used a “wrapper” approach to find appropriate variables such that
an algorithm that searches the feature space repeatedly fits the model with different predictor
SETS. The best predictor set is determined by some measure of performance (i.e. R2 , classification
accuracy, ECT). Examples of search functions are genetic algorithms, simulated annealing and for-
ward/backward/stepwise selection methods. In theory, each of these search routines could converge
to an optimal set of predictors.
An example of one search routine is backwards selection (a.k.a. recursive feature elimination).
Variable Selection Using The caret Package
First, the algorithm fits the model to all predictors. Each predictor is ranked ITS importance to
the model. Let S be a sequence of ordered numbers which are candidate values for the number
of predictors to retain (S1 > S2 , . . .). At each iteration of feature selection, the Si top raked
predictors are retained, the model is refit and performance is assessed. The value of Si with the
best performance is determined and the top Si predictors are used to fit the final model. Algorithm
1 has a more complete definition.
The algorithm has an optional step (line 1.8) where the predictor rankings are recomputed on
the model on the reduced feature set. Svetnik el al (2004) showed that, for random forest models,
there was a decrease in performance when the rankings were re–computed at every step. However,
in other cases when the initial rankings are not good (e.g. linear models with highly collinear
predictors), re–calculation can slightly improve performance.
One potential issue over–fitting to the predictor set such that the wrapper procedure could focus
on nuances of the training data that are not found in future samples (i.e. over–fitting to predictors
and samples).
For example, suppose a very large number of uninformative predictors were collected and one
such predictor randomly correlated with the outcome. The RFE algorithm would give a good rank
to this variable and the prediction error (on the same data set) would be lowered. It would take a
different test/validation to find out that this predictor was uninformative. The was referred to as
2 of 18
Variable Selection Using The caret Package
Since feature selection is part of the model building process, resampling methods (e.g. cross–
validation, the bootstrap) should factor in the variability caused by feature selection when cal-
culating performance. For example, the RFE procedure in Algorithm 1 can estimate the model
performance on line 1.6, which during the selection process. Ambroise and McLachlan (2002) and
Svetnik el al (2004) showed that improper use of resampling to measure performance will result in
models that perform poorly on new samples.
To get performance estimates that incorporate the variation due to feature selection, it is sug-
gested that the steps in Algorithm 1 be encapsulated inside an outer layer of resampling (e.g. 10–fold
cross–validation). Algorithm 2 shows a version of the algorithm that uses resampling.
While this will provide better estimates of performance, it is more computationally burdensome.
For users with access to machines with multiple processors, the first For loop in Algorithm 2 (line
2.1) can be easily parallelized. Another complication to using resampling is that multiple lists of
the “best” predictors are generated at each iteration. At first this may seem like a disadvantage, but
it does provide a more probabilistic assessment of predictor importance than a ranking based on a
single fixed data set. At the end of the algorithm, a consensus ranking can be used to determine
the best predictors to retain.
3 of 18
Variable Selection Using The caret Package
2.2 Partition data into training and test/hold–back set via resampling
2.3 Tune/train the model on the training set using all predictors
2.4 Predict the held–back samples
2.5 Calculate variable importance or rankings
2.6 for Each subset size Si , i = 1 . . . S do
2.7 Keep the Si most important variables
2.8 Tune/train the model on the training set using Si predictors
2.9 Predict the held–back samples
2.10 [Optional] Recalculate the rankings for each predictor
2.11 end
2.12 end
2.13 Calculate the performance profile over the Si using the held–back samples
2.14 Determine the appropriate number of predictors
2.15 Estimate the fial list of predictors to keep in the final model
2.16 Fit the final model based on the optimal Si using the original training set
4 of 18
Variable Selection Using The caret Package
For a specific model, a set of functions must be specified in rfeControl$functions. Section 3.2
below has descriptions of these sub–functions. There are a number of pre–defined sets of functions
for several models, including: linear regression (in the object lmFuncs), random forests (rfFuncs),
naive Bayes (nbFuncs), bagged trees (treebagFuncs) and functions that can be used with caret’s
train function (caretFuncs). The latter is useful if the model has tuning parameters that must be
determined at each iteration.
3.1 An Example
To test the algorithm, the “Friedman 1” benchmark (Friedman, 1991) was used. There are three
informative variables generated by the equation
y = 10 sin(πx1 x2 ) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N (0, σ 2 )
In the simulation used here:
Of the 50 predictors, there are 45 pure noise variables: 5 are uniform on [0, 1] and 40 are random
univariate standard normals. The predictors are centered and scaled:
5 of 18
Variable Selection Using The caret Package
The simulation will fit models with subset sizes of 25, 20, 15, 10, 5, 4, 3, 2, 1.
As previously mentioned, to fit linear models, the lmFuncs set of functions can be used. To do
this, a control object is created with the rfeControl function. We also specify that 10–fold cross–
validation should be used in line 2.1 of Algorithm 2. The number of folds can be changed via the
number argument to rfeControl (defaults to 10). The verbose option prevents copious amounts
of output from being produced and the returnResamp argument specifies that the 10 performance
estimates should be saved only for the optimal subset size.
> set.seed(10)
> ctrl <- rfeControl(functions = lmFuncs, method = "cv", verbose = FALSE,
+ returnResamp = "final")
> lmProfile <- rfe(x, y, sizes = subsets, rfeControl = ctrl)
> lmProfile
The output shows that the best subset size was estimated to be 3 predictors. This set includes
informative variables but did not include them all. The predictors function can be used to get a
6 of 18
Variable Selection Using The caret Package
text string of variable names that were picked in the final model. The lmProfile is a list of class
"rfe" that contains an object fit that is the final linear model with the remaining terms. The
model can be used to get predictions for future or test samples.
> predictors(lmProfile)
> lmProfile$fit
Call:
lm(formula = y ~ ., data = tmp)
Coefficients:
(Intercept) var4 var5 var2
14.613 2.625 1.967 1.648
> lmProfile$resample
There are also several plot methods to visualize the results. plot(lmProfile) produces the per-
formance profile across different subset sizes, as shown in Figure 1. Also the resampling results are
stored in the sub–object lmProfile$resample and can be used with several lattice functions. Uni-
variate lattice functions (densityplot, histogram) can be used to plot the resampling distribution
while bivariate functions (xyplot, stripplot) can be used to plot the distributions for different
subset sizes. In the latter case, the option returnResamp = "all" in rfeControl can be used to
save all the resampling results. See Figure 4 for two examples.
7 of 18
Variable Selection Using The caret Package
● ●
Resampled RMSE
4.0
●
3.5 ● ●
●
●
●
3.0 ●
0 10 20 30 40 50
Variables
0.65 ●
●
Resampled R^2
●
0.60
●
0.55
●
0.50 ●
●
0.45
0.40 ● ●
0 10 20 30 40 50
Variables
Figure 1: Performance profiles for recursive feature elimination using linear models. These images
were generated by plot(lmProfile) and plot(lmProfile, metric = "Rsquared").
8 of 18
Variable Selection Using The caret Package
To use feature elimination for an arbitrary model, a set of functions must be passed to rfe for each
of the steps in Algorithm 2. This section defines those functions and uses the existing random forest
functions as an illustrative example.
This function builds the model based on the current data set (lines line 2.3, 2.8 and 2.16). The
arguments for the function must be:
• x: the current training set of predictor data with the appropriate subset of variables
• first: a single logical value for whether the current predictor set has all possible variables
(e.g. line 2.3)
• last: similar to first, but TRUE when the last model is fit with the final subset size and
predictors. (line 2.16)
The function should return a model object that can be used to generate predictions. For random
forest, the fit function is simple:
> rfFuncs$fit
For feature selection without re–ranking at each iteration, the random forest variable importances
only need to be computed on the first iterations when all of the predictors are in the model. This
can be accomplished using importance = first.
9 of 18
Variable Selection Using The caret Package
This function returns a vector of predictions (numeric or factors) from the current model (lines 2.4
and 2.9). The input arguments must be
For random forests, the function is a simple wrapper for the predict function:
> rfFuncs$pred
function (object, x)
{
predict(object, x)
}
<environment: namespace:caret>
For classification, it is probably a good idea to ensure that the resulting factor variables of predictions
has the same levels as the input data.
This function is used to return the predictors in the order of the most important to the least
important (lines 2.5 and 2.10). Inputs are:
The function should return a data frame with a column called vars that has the current variable
names. The first row should be the most important predictor etc. Other columns can be included
in the output and will be returned in the final rfe object.
For random forests, the function below uses caret’s varImp function to extract the random
forest importances and orders them. For classification, randomForest will produce a column of
importances for each class. In this case, the default ranking function orders the predictors by the
averages importance across the classes.
10 of 18
Variable Selection Using The caret Package
> rfFuncs$rank
function (object, x, y)
{
vimp <- varImp(object)
if (is.factor(y)) {
if (all(levels(y) %in% colnames(vimp))) {
avImp <- apply(vimp[, levels(y), drop = TRUE], 1,
mean)
vimp$Overall <- avImp
}
}
vimp <- vimp[order(vimp$Overall, decreasing = TRUE), , drop = FALSE]
vimp$var <- rownames(vimp)
vimp
}
<environment: namespace:caret>
This function determines the optimal number of predictors based on the resampling output (line
2.14). Inputs for the function are:
• x: a matrix with columns for the performance metrics and the number of variables, called
Variables
• metric: a character string of the performance measure to optimize (e.g. RMSE, Accuracy)
This function should return an integer corresponding to the optimal subset size.
caret comes with two examples functions for this purpose: pickSizeBest and pickSizeTol-
erance. The former simply selects the subset size that has the best value. The latter takes into
account the whole profile and tries to pick a subset size that is small without sacrificing too much
performance. For example, suppose we have computed the RMSE over a series of variables sizes:
11 of 18
Variable Selection Using The caret Package
RMSE Variables
1 3.215 1
2 2.819 2
3 2.414 3
4 2.144 4
5 2.014 5
6 1.997 6
7 2.025 7
8 1.987 8
9 1.971 9
10 2.055 10
11 1.935 11
12 1.999 12
13 2.047 13
14 2.002 14
15 1.895 15
16 2.018 16
These are depicted in Figure 2. The solid circle identifies the subset size with the absolute smallest
RMSE. However, there are many smaller subsets that produce approximately the same performance
but with fewer predictors. In this case, we might be able to accept a slightly larger error for less
predictors.
The pickSizeTolerance determines the absolute best value then the percent difference of the
other points to this value. In the case of RMSE, this would be
RM SE − RM SEopt
RM SEtol = 100 ×
RM SEopt
where RM SEopt is the absolute best error rate. These “tolerance” values are plotted in the bottom
panel of Figure 2. The solid triangle is the smallest subset size that is within 10% of the optimal
value.
This approach can produce good results for many of the tree based models, such as random forest,
where there is a plateau of good performance for larger subset sizes. For trees, this is usually because
unimportant variables are infrequently used in splits and do not significantly affect performance.
After the optimal subset size is determined, this function will be used to calculate the best rankings
for each variable across all the resampling iterations (line 2.15). Inputs for the function are:
• y: a list of variables importance for each resampling iteration and each subset size (generated
by the user–defined rank function). In the example, each each of the cross–validation groups
12 of 18
Variable Selection Using The caret Package
●
3.0
●
RMSE
2.5
●
2.0
● ● ●
● ● ● ● ● ●
●
●
5 10 15
Variables
●
60
●
Tolerance
40
●
20
●
● ● ● ● ● ● ●
●
● ●
● ●
0
Figure 2: An example of where a smaller subset sizes is not necessarily the best choice. The solid
circle in the top panel indicates the subset size with the absolute smallest RMSE. If the percent
differences from the smallest RMSE are calculated (lower panel), the user may want to accept a
pre–specified drop in performance as long as the drop is within some limit of the optimal.
13 of 18
Variable Selection Using The caret Package
the output of the rank function is saved for each of the 10 subset sizes (including the original
subset). If the rankings are not recomputed at each iteration, the values will be the same
within each cross–validation iteration.
This function should return a character string of predictor names (of length size) in the order of
most important to least important
For random forests, only the first importance calculation (line 2.5) is used since these are the
rankings on the full set of predictors. These importances are averaged and the top predictors are
returned.
> rfFuncs$selectVar
Note that if the predictor rankings are recomputed at each iteration (line 2.10) the user will need
to write their own selection function to use the other ranks.
For random forest, we fit the same series of model sizes as the linear model. The option to save
all the resampling results across subset sizes was changed for this model and are used to show the
lattice plot function capabilities in Figure 4.
> set.seed(10)
> ctrl$functions <- rfFuncs
> ctrl$returnResamp <- "all"
> rfProfile <- rfe(x, y, sizes = subsets, rfeControl = ctrl)
> print(rfProfile)
14 of 18
Variable Selection Using The caret Package
4 Session Information
• R version 2.9.0 (2009-04-17), x86_64-apple-darwin9.6.0
• Locale: en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
• Base packages: base, datasets, graphics, grDevices, grid, methods, splines, stats, tools, utils
• Other packages: caret 4.19, class 7.2-46, e1071 1.5-19, ellipse 0.3-5, gbm 1.6-3, Hmisc 3.5-0,
ipred 0.8-6, kernlab 0.9-8, klaR 0.5-8, lattice 0.17-22, MASS 7.2-46, mlbench 1.1-5, nnet 7.2-46,
pls 2.1-0, proxy 0.4-1, randomForest 4.5-30, rpart 3.1-43, survival 2.35-4
5 References
Ambroise, C. and McLachlan, J. H. (2002) “Selection bias in gene extraction on the basis of
microarray gene-expression data,” Proceedings of the National Academy of Science, 99, 6562–
6566
15 of 18
Variable Selection Using The caret Package
3.6 ● ●
●
Resampled RMSE
●
3.4
●
●
3.2 ● ●
3.0
●
2.8 ●
0 10 20 30 40 50
Variables
0.75 ●●
0.70
Resampled R^2
●
●
●
0.65
● ●
●
●
0.60
0.55
0.50
●
0 10 20 30 40 50
Variables
16 of 18
Variable Selection Using The caret Package
● ●
● ●
4.0 ● ● ●
● ●
RMSE CV Estimates
● ● ●
● ● ●
●● ●
●
● ●
● ● ●
●● ● ●
● ●
● ●
3.5 ● ● ●
●
●● ● ●
●
●
● ● ● ●
● ● ● ●
● ●● ● ● ●
● ● ●
●
3.0 ● ●
●
● ● ●
●● ●
●● ●
●● ●
●
●
●●
●●
2.5 ●
●● ●
●
0 10 20 30 40 50
Variables
2 3 4 5
1 2
1.5
1.0
0.5
Density
0.0 ●● ●●
●●●● ● ●● ● ● ●● ● ●
3 4
1.5
1.0
0.5
● ●
●● ●
●● ● ●
● ● ●●●●●
●● ● 0.0
2 3 4 5
RMSE CV Estimates
Figure 4: Resampling RMSE estimates for random forests across different subset sizes. These plots
were generated using xyplot(rfProfile) and densityplot(rfProfile, subset = Variables <
5)
17 of 18
Variable Selection Using The caret Package
Svetnik, V., Liaw, A. , Tong, C amd Wang, T. (2004) “Application of Breiman’s random forest
to modeling structure-activity relationships of pharmaceutical molecules,” Multiple Classier
Systems, Fifth International Workshop, 3077, 334–343
18 of 18