LGRS2745049 RandomForests Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS 1

A Systematic Approach for Variable Selection


With Random Forests: Achieving Stable
Variable Importance Values
Amir Behnamian, Koreen Millard, Sarah N. Banks, Lori White, Murray Richardson, and Jon Pasher

Abstract— Random Forests variable importance measures are thus, it is oftentimes desirable to reduce the model data load to
often used to rank variables by their relevance to a classification the fewest number of inputs with maximal predictive accuracy.
problem and subsequently reduce the number of model inputs This is especially relevant for large data sets (e.g., Landsat
in high-dimensional data sets, thus increasing computational
efficiency. However, as a result of the way that training data and imagery for all of Canada, RADARSAT-2 archive data, and
predictor variables are randomly selected for use in constructing with its four day repeat pass cycle, high-frequency temporal
each tree and splitting each node, it is also well known that data via the RADARSAT Constellation Mission in the near
if too few trees are generated, variable importance rankings future [1]–[3]) and/or data acquired from multiple sensors.
tend to differ between model runs. In this letter, we characterize Reducing model data load can reduce processing times and
the effect of the number of trees (ntree) and class separability
on the stability of variable importance rankings and develop storage requirements, and can also be used to inform long-
a systematic approach to define the number of model runs term analyses, as attention can focus on just the sensors and
and/or trees required to achieve stability in variable importance variables that provide relevant information to a given classifi-
measures. Results demonstrate that both a large ntree for a cation problem. Furthermore, it has also been demonstrated
single model run, or averaged values across multiple model that with very high dimensional data sets, results can be
runs with fewer trees, are sufficient for achieving stable mean
importance values. While the latter is far more computationally noisier than models where only the most important variables
efficient, both the methods tend to lead to the same ranking of are used [4]. Both the mean decrease in accuracy (MDA)
variables. Moreover, the optimal number of model runs differs and the mean decrease in Gini (MDG) are commonly used
depending on the separability of classes. Recommendations are statistical measures of variable importance for determining
made to users regarding how to determine the number of model which predictor variables are best suited to differentiate the
runs and/or trees that are required to achieve stable variable
importance rankings. classes of interest and for reducing the dimensionality of
large data sets [4]–[7]. MDA quantifies variable importance by
Index Terms— Mean decrease in accuracy (MDA), mean measuring the change in prediction accuracy when the values
decrease in Gini (MDG) index, random forest, variable reduction.
of the variable are randomly permuted. MDG is the sum of all
decreases in Gini impurity due to a given variable, normalized
I. I NTRODUCTION by the number of trees (ntree) [8], [9]. However, because
of the random way in which training data and variables are
R ANDOM Forests, based on the ensembles of classifica-
tion and regression trees, has become a widely used clas-
sification approach in various fields, including remote sensing.
selected to determine the split at each node in Random Forests,
importance rankings differ from one model run to another,
especially when if only a small ntree are generated [4], [7],
It is relatively easy to implement in a variety of software
[10]–[12]. As such, users should not rely on rankings derived
packages (e.g., R Statistics and Python) and is also computa-
from a single model run [13]–[15].
tionally efficient. The latter is especially relevant today, since
high-dimensional data sets from different sources are widely
available, and are commonly used for image classification. II. BACKGROUND
However, in many cases, not all data sets and predictor
A conservative approach to dealing with varying importance
variables provide relevant information to the classifier, and
values is to average outputs from a sufficiently large number
Manuscript received June 13, 2017; revised July 27, 2017; accepted of forests and sufficiently large ntree (e.g., 50 forests with
August 19, 2017. This work was supported in part by Environment and more than 1000 trees), followed by a “forward” or “reverse”
Climate Change Canada and in part by Defence Research and Development stepwise approach to reduce model inputs to only the most
Canada. (Corresponding author: Amir Behnamian.)
A. Behnamian, S. N. Banks, L. White, J. Pasher are with Environment important predictor variables, until the minimum out of bag
and Climate Change Canada, National Wildlife Research Centre, error (OOBE) is achieved [12], [16]. It is notable that an
Ottawa, ON K1S 5B6, Canada (e-mail: [email protected]; iterative variable importance reduction (i.e., recalculating
[email protected]; [email protected]; [email protected]).
K. Millard is with Defence Research and Development Canada, Ottawa, variable importance) is computationally expensive for big
ON K1A 0Z4, Canada (e-mail: [email protected]). data sets (in this context, and throughout this letter, the
M. Richardson is with the Department of Geography and Environmental computational expense refers specifically to the amount of
Studies, Carleton University, Ottawa, ON K1S 5B6, Canada (e-mail:
[email protected]). time required to generate importance values and/or predict
Digital Object Identifier 10.1109/LGRS.2017.2745049 the classification) [6].

1545-598X © 2017 Crown Copyright


2 IEEE GEOSCIENCE AND REMOTE SENSING LETTERS

Several automated methods of variable selection with TABLE I


Random Forests exist (implemented in commercial software L AND C OVER C LASSES C ONSIDERED IN T HIS L ETTER
such as R). For example, ggRandomForests [17], [18] assumes
that the variables that are used in the split closest to the
root are the most important. However, it appears to base this
ranking on a single forest. varSelRF [10], [16] runs a single
model, removes the 20% least important variables, and then
recalculates errors iteratively with the new set of variables until
an unacceptable level of error is reached. While this process
is iterative, the rankings are based on a single model run.
VSURF runs Random Forests in a two-step-process [11]. First,
it ranks the variables based on the average over 50 runs and
removes those ranked below a threshold. Then, it sequentially types. This site and data set have been previously presented
builds Random Forest models and monitors OOBEs by adding in [6]. The other study site (hereafter referred to as Alfred Bog)
variables; starting from the first most important and excluding centers on a large peatland complex in South Eastern Ontario,
those variables that do not improve OOBE (based on the and was previously presented in [4]. For this site, the focus
average error over 25 runs). is to discriminate peatland types and to differentiate peatland
While it is known that the ntree used in a Random Forest and nonpeatland classes (Table I). For additional site specific
model can impact the stability of variable importance [11], a details, as well as information on model training and validation
systematic analysis of the convergence of importance values to data, readers are referred to [2], [4], and [6].
a stable mean has not been undertaken in previous studies (in
this context, and throughout this letter, a stable mean impor- B. Remote Sensing Data
tance value is one that closely approximates the true mean For both the study areas, a combination of Landsat,
importance value, which is unknown to the user). Running a RADARSAT-2, and digital elevation model variables were
model multiple times and subsequently averaging importance provided as inputs to the model. In total, 49 variables
values will eventually lead to a stable ranking of important were classified for Coronation Gulf and 50 for Alfred Bog.
variables. However, the optimum number of model runs using Banks et al. [6] provide all image processing details, which
an optimum ntree should be determined in order to maximize were followed exactly for the Alfred Bog study area, with
computational efficiency. Several attempts have been made to the exception that: Landsat 8 imagery and Shuttle RADAR
determine an optimum value for the latter parameter [19] but Topography Mission data were used in the place of Landsat 5
have not addressed a link to the former. imagery and Canadian Digital Elevation Data. RADARSAT-2
data for Alfred Bog were also Boxcar filtered instead of
III. O BJECTIVE Enhanced Lee filtered, and two additional variables (Shannon
Given the inherent random variation of importance values, entropy: phase and intensity) were used for this site (described
we hypothesized that average variable importance values will in [2]), and not for Coronation Gulf. The Julian date of the
converge (to its true, but unknown mean) after a certain num- RADARSAT-2 acquisition was also not included among the
ber of model runs. This may occur across relatively few mod- set of variables for the Alfred Bog site.
els, thus unnecessary processing can be avoided. The primary With Users and Producers accuracies for seven land
objective of this letter is to develop a systematic approach for covers ≥84%, Coronation Gulf has been selected to reflect the
determining the number of model runs (i.e., forests) required “good separability” case [6]. Alfred Bog represents the “poor
to achieve a stable mean variable importance value. We also separability” case, since Users and Producers accuracies for
address whether the point of convergence varies as a function five classes were much lower (≥63%) [2].
of the ntree generated per Random Forest model, as well as the
separability of the classes (referring to the physical separation C. Random Forests and Variable Importance
of class values within multivariate feature space). With respect
to the latter, we hypothesized that the convergence of variable The Random Forests model was implemented using the
importance will depend on the separability of the classes, as randomForest [8] package in R. To address the objectives of
classification accuracies are higher and more stable in the cases this letter, four sets of models with a different ntree (50 200 500
where there is good separability. Thus, in this letter, we have and 10 000) were each run for 25 iterations to assess the
analyzed two data sets: one with “poor” and one with “good” stability of variable importance rankings, as well as the effect
separability. of the ntree on the point at which importance values converge
to a stable mean. Each time the model was run using identical
IV. M ETHOD training data, and for each run, variable importance was
A. Study Areas calculated (both MDA and MDG). Mean importance values
were then calculated using the following equation:
Two study sites are considered in this analysis. The first ⎛ ⎞
site (hereafter referred to as Coronation Gulf) encompasses the 
i
entirety of Coronation Gulf, Bathurst Inlet, and Dease Strait, V I p (i ) = ⎝ V I p ( j )⎠ /i (1)
Nunavut (Table I), where the focus is to classify shoreline j =1
BEHNAMIAN et al.: SYSTEMATIC APPROACH FOR VARIABLE SELECTION WITH RANDOM FORESTS 3

where p is the predictor variable of interest listed in TABLE II


Section IV-B, V I p ( j ) is the corresponding variable importance R EQUIRED N UMBER OF RUNS FOR VARIABLE I MPORTANCE S TABILITY
value for an individual run j , and V I p (i ) is the mean
importance variable over i runs.
To further investigate the number of runs required to achieve
stable mean importance values, the convergence of the devia-
tion of mean importance values from their true mean at each
model run was calculated for all predictor variables using
the following equation (assuming that the average of 25 runs
provides a good approximation of the true mean importance
value):
 P
2 0.5
p=1 (V I p (i ) − V I p (25))
D(i ) = (2)
P
the ntree is increased, and this is true for both the poor
where P is the total number of predictor variables (50 for separability case [see Fig. 1(C) and (D)] and the good
Coronation Gulf and 49 for Alfred Bog). separability case [see Fig. 1(E) and (F)]. For example, when
Note that the degree of correlation of variables was con- the ntree is increased to 200 and 500, the MDA importance
sidered outside the scope of this letter, but it is an important ranking of all variables, including the top eight most
consideration when using Random Forests [4], [6], [11], [20]. important variables, stabilizes after fewer runs. This lower
For example, Genuer et al. [11] showed that the addition variation is also reflected in the magnitude of error bars
of highly correlated replications of a true predictor variable [see Fig. 1(c)–(f)].
leads to a decrease in the magnitude of the importance of These results clearly demonstrate that the convergence of
the true variable, and likely results in a decrease in the mean importance values to a close approximation of their
variability of importance values of the true variables (but not true mean requires more runs for models built with fewer
the corresponding correlated ones). Additionally, in this letter, trees (see Fig. 1). With 10 000 trees, the low variation
effort was also not made to assess the effect of mtry (i.e., the of mean importance values based on sequential averaging
number of variables tried to determine the optimal split at each [Fig. 1(g) and (h)], in addition to the fact that there is almost
node), since the mtry default value (i.e., the square root of no cross-ranking among the variables [Fig. 1(G) and (H)],
the number of predictor variables) has been shown to achieve indicates values closely approximate the real mean. As such,
results that are close to optimal [21]–[24]; increasing this value the maximum deviation of the sequential mean importance
would greatly decrease the computational efficiency of the values from the mean with 10 000 trees [Fig. 2 (green line)]
algorithm, which is one of the primary benefits associated with can be used as a threshold to specify the point of conver-
Random Forests. gence of mean importance values for both Coronation Gulf
and Alfred Bog. This threshold is drawn in Fig. 2 using
V. R ESULTS AND D ISCUSSION a dashed horizontal line. These values are also listed in
Fig. 1 shows the plots of mean variable importance rankings Table II, and shown in Fig. 1 (vertical dotted lines). As can
for different numbers of trees (50, 200, 500, and 10 000) for be observed, the convergence point is consistently lower for
both the sites. In the good separability case (Coronation Gulf) the good separability case (Coronation Gulf) than the poor
with only 50 trees [Fig. 1(A)], the MDA values over the first separability case (Alfred Bog; see Fig. 2). A similar analysis
25 model runs are highly variable. For instance, the ranking was also performed using MDG (listed in Table II; results
of variable 4 (in bold blue line) has been changed from 4 to 8 not provided in detail here for brevity). Results were similar
after 21 runs. Results are similar in the poor separability case and indicated that the predicted point of stability is similar to
[Alfred Bog; see Fig. 1(B)] though the separation between the those calculated from the MDA. However, with 200 or more
top eight variables is not reached even after 21 model runs trees, convergence occurred after fewer iterations with MDG,
(e.g., variables represented by blue line and orange line are indicating that MDG may be slightly more stable. This obser-
still crossing). Fig. 1(a) and (b) represents the ranking and vation is consistent with the findings of Liaw and Wiener [7].
the error bars (representing 95% confidence interval) of the We also found larger differences in importance values between
mean importance for the first 30 most important variables. the most and least important variables with MDG com-
For Coronation Gulf, the first 20 variables exhibit a gradual pared with MDA, meaning that a visual threshold between
decrease in the value of the mean importance and the root the important and nonimportant variables was easier to
mean square values, but this is only the case for the first eight identify.
or nine variables with the Alfred Bog data set. This could It is worth noting that the OOBE reached a minimal value
be a result of the lesser ranking variables not containing any with as few as 50 trees for Coronation Gulf and slightly
additional information or that this additional information is not higher for Alfred Bog (OOBE ∼14% and ∼18%, respectively),
relevant, given the land cover classes of interest. and remained the same regardless of the ntree that were
The variability, and as a result, the stability of the mean generated for each model, as shown in Fig. 2(c) (which
importance value for each variable, improves further as also suggests the approximate lowest limit of the ntree value
4 IEEE GEOSCIENCE AND REMOTE SENSING LETTERS

Fig. 1. Sequential averages of the variable importance based on the MDA, plots with capital letters, and the variable importance ranking over 25 runs, plots
with small letters. (Left) Coronation data set. (Right) Alfred Bog data set. (A), (a), (B), and (b) 50 trees. (C), (c), (D), and (d) 200 trees. (E), (e), (F), and
(f) 500 trees. (G), (g), (H), and (h) 10 000 trees. Dotted lines: number of runs at which the convergence achieved. Only the first 30 important variables are
illustrated here.

Fig. 2. Deviation of the all predictor variables from their true mean at each model run. (a) Coronation. (b) Alfred Bog. Dashed lines: convergence threshold.
(c) OOBE against the ntree for both the Coronation Gulf and Alfred Bog data sets. Dotted lines: minimum value of OOBE.

to users). Furthermore, running one Random Forest model on the training data from Alfred Bog (number of training
with a large ntree required considerably more time than data points = 500 and P = 50) using a desktop computer
running multiple models with fewer trees. For example, based (Intel i7 6700HQ at 2.6 GHz and 16 GB of DDR4 RAM
BEHNAMIAN et al.: SYSTEMATIC APPROACH FOR VARIABLE SELECTION WITH RANDOM FORESTS 5

at 2400 MHz), one Random Forest model with 50 trees [3] A. A. Thompson, “Overview of the RADARSAT constellation
required 0.114 s and one random forest model with mission,” Can. J. Remote Sens., vol. 41, no. 5, pp. 401–407,
2015.
10 000 trees required 21.54 s (both averaged over 1000 repli- [4] K. Millard and M. Richardson, “On the importance of training
cates). Thus, the minimum required time to achieve sta- data sample selection in random forest image classification: A case
ble importance rankings with the latter is 2.62 s, which study in peatland ecosystem mapping,” Remote Sens., vol. 7, no. 7,
pp. 8489–8515, 2015.
is one order of magnitude less than the time required for [5] J. M. Corcoran, J. F. Knight, and A. L. Gallant, “Influence of multi-
one Random Forest with 10 000 trees. This difference has source and multi-temporal remotely sensed and ancillary data on the
important implications for the operational uses of Random accuracy of random forest classification of wetlands in Northern
Minnesota,” Remote Sens., vol. 5, no. 7, pp. 3212–3238,
Forests with much larger data sets. Specifically, these results 2013.
show that in obtaining stable mean importance values, it is [6] S. Banks, K. Millard, J. Pasher, M. Richardson, H. Wang, and J. Duffe,
more computationally efficient to run many iterations of the “Assessing the potential to operationalize shoreline sensitivity mapping:
Classifying multiple Wide Fine Quadrature Polarized RADARSAT-2 and
model with a small ntree than to run a single stable forest of Landsat 5 scenes with a single Random Forest model,” Remote Sens.,
10 000 trees. Note that, in this case, both the approaches led to vol. 7, no. 10, pp. 13528–13563, 2015.
approximately the same ranking of variables (i.e., the top ten [7] A. Liaw and M. Wiener, “Classification and regression by
randomForest,” R News, vol. 2, no. 3, pp. 18–22, 2002.
most important tended to remain constant, while the ranking [8] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classi-
of lesser important variables varied slightly and the OOBE fication and Regression Trees. Boca Raton, FL, USA: CRC Press,
was not significantly different). The methods used here to 1984.
[9] L. Breiman. (2003). Manual–Setting Up, Using, and Understanding
determine the optimum number of model runs based on the Random Forests V4. 0. [Online]. Available: http://oz.berkeley.edu/
ntree in each forest can be fully automated by the user. This users/breiman.Using_random_forests_v4.0.pdf
requires a two-step process, including: 1) defining threshold [10] R. Diaz-Uriarte and S. A. de Andrés. (2005). “Variable selection from
random forests: Application to gene expression data,” Spanish Nat.
for the deviation of mean importance values from their true Cancer Center, Tech. Rep. [Online]. Available: https://arxiv.org/abs/
mean by calculating D(i = 2) with a large ntree (for example q-bio/0503025
10 000), and 2) comparing the convergence plots such as those [11] R. Genuer, J.-M. Poggi, and C. Tuleau-Malot, “Variable selection using
random forests,” Pattern Recognit. Lett., vol. 31, no. 14, pp. 2225–2236,
in Fig. 2(a) or (b) with the calculated threshold value (e.g., for 2010.
a given ntree determined using [19]). [12] M. L. Calle and V. Urrea, “Letter to the editor: Stability of random forest
importance measures,” Briefings Bioinform., vol. 12, no. 1, pp. 86–89,
VI. C ONCLUSION 2011.
[13] M. Immitzer, C. Atzberger, and T. Koukal, “Tree species classifica-
Importance rankings of MDA and MDG can be variable tion with random forest using very high spatial resolution 8-band
between runs of Random Forests, even if the same settings WorldView-2 satellite data,” Remote Sens., vol. 4, no. 9, pp. 2661–2693,
are used (e.g., the ntree). Therefore, it is recommended that in 2012.
[14] K. Millard and M. Richardson, “Wetland mapping with LiDAR deriva-
order to select variables based on their importance ranking, tives, SAR polarimetric decompositions, and LiDAR–SAR fusion using
Random Forests should be run more than once and the a random forest classifier,” Can. J. Remote Sens., vol. 39, no. 4,
variability of values must be assessed. We have demonstrated pp. 290–307, 2013.
[15] S. V. Beijma, A. Comber, and A. Lamb, “Random forest classifica-
that variable importance rankings based on the average of tion of salt marsh vegetation habitats using quad-polarimetric airborne
sequential models eventually stabilize, but that the minimum SAR, elevation and optical RS data,” Remote Sens. Environ., vol. 149,
number of runs required to achieve stability depends on both pp. 118–129, Jun. 2014.
[16] R. Díaz-Uriarte and S. A. de Andres, “Gene selection and classification
the ntree used to build the models and the separability of of microarray data using random forest,” BMC Bioinform., vol. 7, no. 1,
the classes in the input data. We have demonstrated that p. 3, 2006.
convergence to a stable mean can be achieved either by using [17] J. Ehrlinger. (Dec. 2016). “ggRandomForests: Exploring random forest
survival.” [Online]. Available: https://arxiv.org/abs/1612.08974
very large ntree (10 000 or more) or by taking the average [18] H. Ishwaran, U. B. Kogalur, E. Z. Gorodeski, A. J. Minn, and
variable importance over an optimal number of runs. While M. S. Lauer, “High-dimensional variable selection for survival data,”
both the approaches tend to lead to the same ranking of J. Amer. Stat. Assoc., vol. 105, no. 489, pp. 205–217, 2010.
[19] T. M. Oshiro, P. S. Perez, and J. A. Baranauskas, “How many trees in
variables (especially for the top most important), the latter a random forest?” in Proc. Int. Workshop Mach. Learn. Data Mining
has also been found to be more computationally efficient. Pattern Recognit. (MLDM), Jul. 2012, pp. 154–168.
A systematic approach to determine the optimum number of [20] C. Strobl, A.-L. Bouleseix, T. Kneib, T. Augustin, and A. Zeileis,
“Conditional variable importance for random forests,” BMC Bioinform.,
runs to achieve a stable mean variable importance has been vol. 9, pp. 307–318, Dec. 2008.
demonstrated, and recommendations have been made to the [21] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32,
user on how to repeat this process. 2001.
[22] Ö. Akar and O. Güngör, “Integrating multiple texture methods and NDVI
to the Random Forest classification algorithm to detect tea and hazelnut
R EFERENCES plantation areas in northeast Turkey,” Int. J. Remote Sens., vol. 36, no. 2,
[1] A. Mellor, A. Haywood, C. Stone, and S. Jones, “The performance pp. 442–464, 2015.
of random forests in an operational setting for large area sclerophyll [23] R. Sonobe, H. Tani, X. Wang, N. Kobayashi, and H. Shimamura,
forest classification,” Remote Sens., vol. 5, no. 6, pp. 2838–2856, “Random forest classification of crop type using multi-temporal
2013. TerraSAR-X dual-polarimetric data,” Remote Sens. Lett., vol. 5, no. 2,
[2] L. White, K. Millard, S. Banks, M. Richardson, J. Pasher, and J. Duffe, pp. 157–164, 2014.
“Moving to the RADARSAT constellation mission: Comparing syn- [24] V. Svetnik, A. Liaw, C. Tong, and T. Wang, “Application of Breiman’s
thesized compact polarimetry and dual polarimetry data with fully random forest to modeling structure-activity relationships of pharmaceu-
polarimetric RADARSAT-2 data for image classification of peatlands,” tical molecules,” in Proc. Int. Workshop Multiple Classifier Syst., 2004,
Remote Sens., vol. 9, no. 6, p. 573, 2017. pp. 334–343.

View publication stats

You might also like