LGRS2745049 RandomForests Paper
LGRS2745049 RandomForests Paper
LGRS2745049 RandomForests Paper
Abstract— Random Forests variable importance measures are thus, it is oftentimes desirable to reduce the model data load to
often used to rank variables by their relevance to a classification the fewest number of inputs with maximal predictive accuracy.
problem and subsequently reduce the number of model inputs This is especially relevant for large data sets (e.g., Landsat
in high-dimensional data sets, thus increasing computational
efficiency. However, as a result of the way that training data and imagery for all of Canada, RADARSAT-2 archive data, and
predictor variables are randomly selected for use in constructing with its four day repeat pass cycle, high-frequency temporal
each tree and splitting each node, it is also well known that data via the RADARSAT Constellation Mission in the near
if too few trees are generated, variable importance rankings future [1]–[3]) and/or data acquired from multiple sensors.
tend to differ between model runs. In this letter, we characterize Reducing model data load can reduce processing times and
the effect of the number of trees (ntree) and class separability
on the stability of variable importance rankings and develop storage requirements, and can also be used to inform long-
a systematic approach to define the number of model runs term analyses, as attention can focus on just the sensors and
and/or trees required to achieve stability in variable importance variables that provide relevant information to a given classifi-
measures. Results demonstrate that both a large ntree for a cation problem. Furthermore, it has also been demonstrated
single model run, or averaged values across multiple model that with very high dimensional data sets, results can be
runs with fewer trees, are sufficient for achieving stable mean
importance values. While the latter is far more computationally noisier than models where only the most important variables
efficient, both the methods tend to lead to the same ranking of are used [4]. Both the mean decrease in accuracy (MDA)
variables. Moreover, the optimal number of model runs differs and the mean decrease in Gini (MDG) are commonly used
depending on the separability of classes. Recommendations are statistical measures of variable importance for determining
made to users regarding how to determine the number of model which predictor variables are best suited to differentiate the
runs and/or trees that are required to achieve stable variable
importance rankings. classes of interest and for reducing the dimensionality of
large data sets [4]–[7]. MDA quantifies variable importance by
Index Terms— Mean decrease in accuracy (MDA), mean measuring the change in prediction accuracy when the values
decrease in Gini (MDG) index, random forest, variable reduction.
of the variable are randomly permuted. MDG is the sum of all
decreases in Gini impurity due to a given variable, normalized
I. I NTRODUCTION by the number of trees (ntree) [8], [9]. However, because
of the random way in which training data and variables are
R ANDOM Forests, based on the ensembles of classifica-
tion and regression trees, has become a widely used clas-
sification approach in various fields, including remote sensing.
selected to determine the split at each node in Random Forests,
importance rankings differ from one model run to another,
especially when if only a small ntree are generated [4], [7],
It is relatively easy to implement in a variety of software
[10]–[12]. As such, users should not rely on rankings derived
packages (e.g., R Statistics and Python) and is also computa-
from a single model run [13]–[15].
tionally efficient. The latter is especially relevant today, since
high-dimensional data sets from different sources are widely
available, and are commonly used for image classification. II. BACKGROUND
However, in many cases, not all data sets and predictor
A conservative approach to dealing with varying importance
variables provide relevant information to the classifier, and
values is to average outputs from a sufficiently large number
Manuscript received June 13, 2017; revised July 27, 2017; accepted of forests and sufficiently large ntree (e.g., 50 forests with
August 19, 2017. This work was supported in part by Environment and more than 1000 trees), followed by a “forward” or “reverse”
Climate Change Canada and in part by Defence Research and Development stepwise approach to reduce model inputs to only the most
Canada. (Corresponding author: Amir Behnamian.)
A. Behnamian, S. N. Banks, L. White, J. Pasher are with Environment important predictor variables, until the minimum out of bag
and Climate Change Canada, National Wildlife Research Centre, error (OOBE) is achieved [12], [16]. It is notable that an
Ottawa, ON K1S 5B6, Canada (e-mail: [email protected]; iterative variable importance reduction (i.e., recalculating
[email protected]; [email protected]; [email protected]).
K. Millard is with Defence Research and Development Canada, Ottawa, variable importance) is computationally expensive for big
ON K1A 0Z4, Canada (e-mail: [email protected]). data sets (in this context, and throughout this letter, the
M. Richardson is with the Department of Geography and Environmental computational expense refers specifically to the amount of
Studies, Carleton University, Ottawa, ON K1S 5B6, Canada (e-mail:
[email protected]). time required to generate importance values and/or predict
Digital Object Identifier 10.1109/LGRS.2017.2745049 the classification) [6].
Fig. 1. Sequential averages of the variable importance based on the MDA, plots with capital letters, and the variable importance ranking over 25 runs, plots
with small letters. (Left) Coronation data set. (Right) Alfred Bog data set. (A), (a), (B), and (b) 50 trees. (C), (c), (D), and (d) 200 trees. (E), (e), (F), and
(f) 500 trees. (G), (g), (H), and (h) 10 000 trees. Dotted lines: number of runs at which the convergence achieved. Only the first 30 important variables are
illustrated here.
Fig. 2. Deviation of the all predictor variables from their true mean at each model run. (a) Coronation. (b) Alfred Bog. Dashed lines: convergence threshold.
(c) OOBE against the ntree for both the Coronation Gulf and Alfred Bog data sets. Dotted lines: minimum value of OOBE.
to users). Furthermore, running one Random Forest model on the training data from Alfred Bog (number of training
with a large ntree required considerably more time than data points = 500 and P = 50) using a desktop computer
running multiple models with fewer trees. For example, based (Intel i7 6700HQ at 2.6 GHz and 16 GB of DDR4 RAM
BEHNAMIAN et al.: SYSTEMATIC APPROACH FOR VARIABLE SELECTION WITH RANDOM FORESTS 5
at 2400 MHz), one Random Forest model with 50 trees [3] A. A. Thompson, “Overview of the RADARSAT constellation
required 0.114 s and one random forest model with mission,” Can. J. Remote Sens., vol. 41, no. 5, pp. 401–407,
2015.
10 000 trees required 21.54 s (both averaged over 1000 repli- [4] K. Millard and M. Richardson, “On the importance of training
cates). Thus, the minimum required time to achieve sta- data sample selection in random forest image classification: A case
ble importance rankings with the latter is 2.62 s, which study in peatland ecosystem mapping,” Remote Sens., vol. 7, no. 7,
pp. 8489–8515, 2015.
is one order of magnitude less than the time required for [5] J. M. Corcoran, J. F. Knight, and A. L. Gallant, “Influence of multi-
one Random Forest with 10 000 trees. This difference has source and multi-temporal remotely sensed and ancillary data on the
important implications for the operational uses of Random accuracy of random forest classification of wetlands in Northern
Minnesota,” Remote Sens., vol. 5, no. 7, pp. 3212–3238,
Forests with much larger data sets. Specifically, these results 2013.
show that in obtaining stable mean importance values, it is [6] S. Banks, K. Millard, J. Pasher, M. Richardson, H. Wang, and J. Duffe,
more computationally efficient to run many iterations of the “Assessing the potential to operationalize shoreline sensitivity mapping:
Classifying multiple Wide Fine Quadrature Polarized RADARSAT-2 and
model with a small ntree than to run a single stable forest of Landsat 5 scenes with a single Random Forest model,” Remote Sens.,
10 000 trees. Note that, in this case, both the approaches led to vol. 7, no. 10, pp. 13528–13563, 2015.
approximately the same ranking of variables (i.e., the top ten [7] A. Liaw and M. Wiener, “Classification and regression by
randomForest,” R News, vol. 2, no. 3, pp. 18–22, 2002.
most important tended to remain constant, while the ranking [8] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classi-
of lesser important variables varied slightly and the OOBE fication and Regression Trees. Boca Raton, FL, USA: CRC Press,
was not significantly different). The methods used here to 1984.
[9] L. Breiman. (2003). Manual–Setting Up, Using, and Understanding
determine the optimum number of model runs based on the Random Forests V4. 0. [Online]. Available: http://oz.berkeley.edu/
ntree in each forest can be fully automated by the user. This users/breiman.Using_random_forests_v4.0.pdf
requires a two-step process, including: 1) defining threshold [10] R. Diaz-Uriarte and S. A. de Andrés. (2005). “Variable selection from
random forests: Application to gene expression data,” Spanish Nat.
for the deviation of mean importance values from their true Cancer Center, Tech. Rep. [Online]. Available: https://arxiv.org/abs/
mean by calculating D(i = 2) with a large ntree (for example q-bio/0503025
10 000), and 2) comparing the convergence plots such as those [11] R. Genuer, J.-M. Poggi, and C. Tuleau-Malot, “Variable selection using
random forests,” Pattern Recognit. Lett., vol. 31, no. 14, pp. 2225–2236,
in Fig. 2(a) or (b) with the calculated threshold value (e.g., for 2010.
a given ntree determined using [19]). [12] M. L. Calle and V. Urrea, “Letter to the editor: Stability of random forest
importance measures,” Briefings Bioinform., vol. 12, no. 1, pp. 86–89,
VI. C ONCLUSION 2011.
[13] M. Immitzer, C. Atzberger, and T. Koukal, “Tree species classifica-
Importance rankings of MDA and MDG can be variable tion with random forest using very high spatial resolution 8-band
between runs of Random Forests, even if the same settings WorldView-2 satellite data,” Remote Sens., vol. 4, no. 9, pp. 2661–2693,
are used (e.g., the ntree). Therefore, it is recommended that in 2012.
[14] K. Millard and M. Richardson, “Wetland mapping with LiDAR deriva-
order to select variables based on their importance ranking, tives, SAR polarimetric decompositions, and LiDAR–SAR fusion using
Random Forests should be run more than once and the a random forest classifier,” Can. J. Remote Sens., vol. 39, no. 4,
variability of values must be assessed. We have demonstrated pp. 290–307, 2013.
[15] S. V. Beijma, A. Comber, and A. Lamb, “Random forest classifica-
that variable importance rankings based on the average of tion of salt marsh vegetation habitats using quad-polarimetric airborne
sequential models eventually stabilize, but that the minimum SAR, elevation and optical RS data,” Remote Sens. Environ., vol. 149,
number of runs required to achieve stability depends on both pp. 118–129, Jun. 2014.
[16] R. Díaz-Uriarte and S. A. de Andres, “Gene selection and classification
the ntree used to build the models and the separability of of microarray data using random forest,” BMC Bioinform., vol. 7, no. 1,
the classes in the input data. We have demonstrated that p. 3, 2006.
convergence to a stable mean can be achieved either by using [17] J. Ehrlinger. (Dec. 2016). “ggRandomForests: Exploring random forest
survival.” [Online]. Available: https://arxiv.org/abs/1612.08974
very large ntree (10 000 or more) or by taking the average [18] H. Ishwaran, U. B. Kogalur, E. Z. Gorodeski, A. J. Minn, and
variable importance over an optimal number of runs. While M. S. Lauer, “High-dimensional variable selection for survival data,”
both the approaches tend to lead to the same ranking of J. Amer. Stat. Assoc., vol. 105, no. 489, pp. 205–217, 2010.
[19] T. M. Oshiro, P. S. Perez, and J. A. Baranauskas, “How many trees in
variables (especially for the top most important), the latter a random forest?” in Proc. Int. Workshop Mach. Learn. Data Mining
has also been found to be more computationally efficient. Pattern Recognit. (MLDM), Jul. 2012, pp. 154–168.
A systematic approach to determine the optimum number of [20] C. Strobl, A.-L. Bouleseix, T. Kneib, T. Augustin, and A. Zeileis,
“Conditional variable importance for random forests,” BMC Bioinform.,
runs to achieve a stable mean variable importance has been vol. 9, pp. 307–318, Dec. 2008.
demonstrated, and recommendations have been made to the [21] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32,
user on how to repeat this process. 2001.
[22] Ö. Akar and O. Güngör, “Integrating multiple texture methods and NDVI
to the Random Forest classification algorithm to detect tea and hazelnut
R EFERENCES plantation areas in northeast Turkey,” Int. J. Remote Sens., vol. 36, no. 2,
[1] A. Mellor, A. Haywood, C. Stone, and S. Jones, “The performance pp. 442–464, 2015.
of random forests in an operational setting for large area sclerophyll [23] R. Sonobe, H. Tani, X. Wang, N. Kobayashi, and H. Shimamura,
forest classification,” Remote Sens., vol. 5, no. 6, pp. 2838–2856, “Random forest classification of crop type using multi-temporal
2013. TerraSAR-X dual-polarimetric data,” Remote Sens. Lett., vol. 5, no. 2,
[2] L. White, K. Millard, S. Banks, M. Richardson, J. Pasher, and J. Duffe, pp. 157–164, 2014.
“Moving to the RADARSAT constellation mission: Comparing syn- [24] V. Svetnik, A. Liaw, C. Tong, and T. Wang, “Application of Breiman’s
thesized compact polarimetry and dual polarimetry data with fully random forest to modeling structure-activity relationships of pharmaceu-
polarimetric RADARSAT-2 data for image classification of peatlands,” tical molecules,” in Proc. Int. Workshop Multiple Classifier Syst., 2004,
Remote Sens., vol. 9, no. 6, p. 573, 2017. pp. 334–343.