Lecture 20: Bagging, Random Forests, Boosting: Reading: Chapter 8
Lecture 20: Bagging, Random Forests, Boosting: Reading: Chapter 8
Boosting
Reading: Chapter 8
1 / 17
Classification and Regression trees, in a nut shell
2 / 17
Classification and Regression trees, in a nut shell
2 / 17
Classification and Regression trees, in a nut shell
2 / 17
Classification and Regression trees, in a nut shell
2 / 17
Classification and Regression trees, in a nut shell
2 / 17
Example. Heart dataset.
Thal:a
|
3 / 17
Categorical predictors
4 / 17
Categorical predictors
4 / 17
Categorical predictors
4 / 17
Missing data
5 / 17
Missing data
5 / 17
Missing data
5 / 17
Missing data
5 / 17
Missing data
5 / 17
Bagging
I Bagging = Bootstrap Aggregating
6 / 17
Bagging
I Bagging = Bootstrap Aggregating
I In the Bootstrap, we replicate our dataset by sampling with
replacement:
I Original dataset: x = c(x1, x2, . . . , x100)
I Bootstrap samples:
boot1 = sample(x, 100, replace = True), ...,
bootB = sample(x, 100, replace = True).
6 / 17
Bagging
I Bagging = Bootstrap Aggregating
I In the Bootstrap, we replicate our dataset by sampling with
replacement:
I Original dataset: x = c(x1, x2, . . . , x100)
I Bootstrap samples:
boot1 = sample(x, 100, replace = True), ...,
bootB = sample(x, 100, replace = True).
I We used these samples to get the Standard Error of a
parameter estimate:
v
u
u 1 X B B
(b) 1 X (k) 2
SE(β̂1 ) ≈ t (β̂1 − β̂1 )
B−1 B
b=1 k=1
6 / 17
Bagging
7 / 17
Bagging
7 / 17
Bagging
7 / 17
When does Bagging make sense?
8 / 17
When does Bagging make sense?
8 / 17
When does Bagging make sense?
8 / 17
When does Bagging make sense?
8 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
9 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
→ Loss of interpretability
9 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
→ Loss of interpretability
I For each predictor, add up the total amount by which the RSS
(or Gini index) decreases every time we use the predictor in T b .
9 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
→ Loss of interpretability
I For each predictor, add up the total amount by which the RSS
(or Gini index) decreases every time we use the predictor in T b .
I Average this total over each Boostrap estimate T 1 , . . . , T B .
9 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
→ Loss of interpretability
I For each predictor, add up the total amount by which the RSS
(or Gini index) decreases every time we use the predictor in T b .
I Average this total over each Boostrap estimate T 1 , . . . , T B .
Fbs
RestECG
ExAng
Sex
Slope
Chol
Age
RestBP
MaxHR
Oldpeak
ChestPain
Ca
Thal
0 20 40 60 80 100
Variable Importance
9 / 17
Out-of-bag (OOB) error
10 / 17
Out-of-bag (OOB) error
10 / 17
Out-of-bag (OOB) error
10 / 17
Out-of-bag (OOB) error
10 / 17
Out-of-bag (OOB) error
10 / 17
Out-of-bag (OOB) error
10 / 17
Out-of-bag (OOB) error
0.30
0.25
Error
0.20
0.15
Test: Bagging
Test: RandomForest
OOB: Bagging
0.10
OOB: RandomForest
Number of Trees
Random Forests:
12 / 17
Random Forests
Random Forests:
I We fit a decision tree to different Bootstrap samples.
12 / 17
Random Forests
Random Forests:
I We fit a decision tree to different Bootstrap samples.
I When growing the tree, we select a random sample of m < p
predictors to consider in each step.
12 / 17
Random Forests
Random Forests:
I We fit a decision tree to different Bootstrap samples.
I When growing the tree, we select a random sample of m < p
predictors to consider in each step.
I This will lead to very different (or “uncorrelated”) trees from
each sample.
12 / 17
Random Forests
Random Forests:
I We fit a decision tree to different Bootstrap samples.
I When growing the tree, we select a random sample of m < p
predictors to consider in each step.
I This will lead to very different (or “uncorrelated”) trees from
each sample.
I Finally, average the prediction of each tree.
12 / 17
Random Forests vs. Bagging
0.30
0.25
Error
0.20
0.15
Test: Bagging
Test: RandomForest
OOB: Bagging
0.10
OOB: RandomForest
Number of Trees
13 / 17
Random Forests, choosing m
m=p
m=p/2
0.5
Test Classification Error m= p
0.4
0.3
0.2
Number of Trees
√
The optimal m is usually around p,
but this can be used as a tuning parameter.
14 / 17
Boosting
15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree fˆb with d splits to the response r1 , . . . , rn .
15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree fˆb with d splits to the response r1 , . . . , rn .
2.2 Update the prediction to:
15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree fˆb with d splits to the response r1 , . . . , rn .
2.2 Update the prediction to:
ri ← ri − λfˆb (xi ).
15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree fˆb with d splits to the response r1 , . . . , rn .
2.2 Update the prediction to:
ri ← ri − λfˆb (xi ).
We first use the samples that are easiest to predict, then slowly
down weigh these cases, moving on to harder samples.
16 / 17
Boosting vs. random forests
0.25
Boosting: depth=1
Boosting: depth=2
0.20 RandomForest: m= p
Test Classification Error
0.15
0.10
0.05
Number of Trees