ML 10 Decision Trees

Download as pdf or txt
Download as pdf or txt
You are on page 1of 101

Decision Tree Learning

The City College of New York


CSc 59929 -- Introduction to Machine Learning 1
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
How do I decide what to do today?

from Textbook

The City College of New York


CSc 59929 -- Introduction to Machine Learning 2
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Work to Do?

from Textbook

The City College of New York


CSc 59929 -- Introduction to Machine Learning 3
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Work to Do?
Yes No

Stay In Outlook?

from Textbook

The City College of New York


CSc 59929 -- Introduction to Machine Learning 4
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Work to Do?
Yes No

Stay In Outlook?
Sunny Overcast Rainy

Go to Beach Go Running

from Textbook

The City College of New York


CSc 59929 -- Introduction to Machine Learning 5
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Work to Do?
Yes No

Stay In Outlook?
Sunny Overcast Rainy

Go to Beach Go Running Friends Busy?

from Textbook

The City College of New York


CSc 59929 -- Introduction to Machine Learning 6
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Non-binary decision tree

Work to Do?
Yes No

Stay In Outlook?
Sunny Overcast Rainy

Go to Beach Go Running Friends Busy?


Yes No

Stay In Go to Movies

from Textbook

The City College of New York


CSc 59929 -- Introduction to Machine Learning 7
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Binary decision tree

Work to Do?
Yes No

Stay In Sunny?
Yes No

Go to Beach Overcast?
Yes No

Go Running Friends Busy?


Yes No

Stay In Go to Movies

The City College of New York


CSc 59929 -- Introduction to Machine Learning 8
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Components of a decision tree

Features Categories

Work to Do? Stay In

Sunny? Go to Beach

Overcast? Go Running

Friends Busy? Go to Movies

The City College of New York


CSc 59929 -- Introduction to Machine Learning 9
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Sample binary decision tree

F1 Root = Root Node = Root Feature

Yes No Branch = Split

C1 F2 Node = Feature
Yes No

C2 F3
Yes No

Leaf = Class C3 F4
Yes No

C1 C4

The City College of New York


CSc 59929 -- Introduction to Machine Learning 10
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Full binary decision tree
F1

F2
F2

F3 F3 F3 F3

F4 F4 F4 F4 F4 F4 F4 F4

C C C C C C C C C C C C C C C C
1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0
1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0

The City College of New York


CSc 59929 -- Introduction to Machine Learning 11
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Complete binary decision tree

F1

F2
F2

F3 F3 F3 F3

F4 F4 F4 F4 F4 F4 F4

• Full at every level


C C C C C C C C C C C C C except possibly the last
• All nodes are as far left
as possible

The City College of New York


CSc 59929 -- Introduction to Machine Learning 12
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Decision stump

F1

R1 R2

The City College of New York


CSc 59929 -- Introduction to Machine Learning 13
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Decision trees

• Decision trees classify instances by sorting


them down the tree from the root to some leaf
node which provides the classification of the
instance.
• Each node in the tree specifies examination of
some feature of the instance and each branch
from that node corresponds to one of the
possible values (or range of values) of this
feature.

The City College of New York


CSc 59929 -- Introduction to Machine Learning 14
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Decision tree classification
• Starting at the root node, test the feature
specified by this node
• Then move down the tree branch
corresponding to the value (or range of values)
of the feature corresponding to the node for the
given instance.
• Repeat this process for the subtree rooted at the
new node until there are no remaining features
to be examined.

The City College of New York


CSc 59929 -- Introduction to Machine Learning 15
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Decision tree learning
• Training a decision tree consists of
determining which feature to assign to each
node and which value (or range of values) of
that feature to assign to each branch.
• Most decision tree learning algorithms utilize a
top-down, greedy search through the space of
possible decision trees.

The City College of New York


CSc 59929 -- Introduction to Machine Learning 16
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Greedy algorithms
• A greedy algorithm makes the locally optimal
choice at each stage with the hope of finding a
global optimal solution (or something close to
it).

The City College of New York


CSc 59929 -- Introduction to Machine Learning 17
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Subway problem: choosing a route
• A commute from 72nd St & Broadway to the
WeWork in Dumbo Heights.

F
1 A York St.
59th 4th
St. St. A
72nd
St C High St.
C
Clark St.
2 3

The City College of New York


CSc 59929 -- Introduction to Machine Learning 18
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Greedy algorithm for choosing my route
• At each node where there’s a choice, take the
first train that arrives.
• If two arrive simultaneously, take the express.
• Don’t switch trains on the same line.
F
1 A York St.
59th 4th
St. St. A
72nd
St. C High St.
C
Clark St.
2 3
The City College of New York
CSc 59929 -- Introduction to Machine Learning 19
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Greedy algorithm drawbacks
• Never take the F although York St. is the closest of the
three Dumbo stations to the office.
• Take a C even when the displays say that an A will be
arriving momentarily.
• Take the 1 even if the A/C line is out of service
F
1 A York St.
59th 4th
72nd St. St. A
St. C High St.
C
Clark St.
2 3

The City College of New York


CSc 59929 -- Introduction to Machine Learning 20
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Measures of a decision tree

• The depth of a decision tree is the length of the longest path


from the root of the tree to a leaf.

• The size of a decision tree is the number of nodes in the tree.

The City College of New York


CSc 59929 -- Introduction to Machine Learning 21
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Full binary decision tree

F1
n binary features can be
used to classify samples
F2
F2 into at most 2n classes
F3
depth = d
F3 F3 F3

F4 F4 F4
size = s = 2d+1-1
F4 F4 F4 F4 F4

C C C C C C C C C C C C C C C C
d =4
s= 24-1=31
depth = length of the longest path from
the root to a leaf
size = the number of nodes in the tree
The City College of New York
CSc 59929 -- Introduction to Machine Learning 22
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Decision stump

F1

R1 R2

• Only one decision node, i.e.,


depth = 1
size = 3

The City College of New York


CSc 59929 -- Introduction to Machine Learning 23
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Finding the optimal binary decision tree
• There are many possible decision trees
• How do we choose the optimal one?
• We start at the tree root and split the tree on the
feature that results in the largest information gain (IG).
• If we don’t prune the tree, we continue down the
tree, splitting it at each node until at stopping criterion
is satisfied.

The City College of New York


CSc 59929 -- Introduction to Machine Learning 24
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Impurity
Impurity is a measure of how homogenous or heterogenous a
set of objects is. A set is said to be pure (or homogenous) if it
contains only a single class.

The are a number of commonly used measures of impurity.

The City College of New York


CSc 59929 -- Introduction to Machine Learning 25
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Notation
• f is the feature (and its values) used to perform the split
• the parent node is the node at which the split is made
• m is the number of child nodes of the parent node
for a binary tree, m = 2
• Dp is the dataset of the parent node
• Dj is the dataset of the jth child node
• Np is the number of samples in the parent node
• Nj is the number of samples in the jth child node
• I is the impurity, a measure of the heterogeneity of a
dataset
• IG is the information gain at a particular split in the tree;
it is the difference between the impurity of the parent node
and the weighted sum of the impurities of the child nodes
The City College of New York
CSc 59929 -- Introduction to Machine Learning 26
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Information Gain
Information gain at a particular split in the tree is the
difference between the impurity of the parent node and the
weighted sum of the impurities of the child nodes.

m Nj
IG ( D=
p, f ) I ( Dp ) − ∑ I ( Dj )
j =1 Np

The City College of New York


CSc 59929 -- Introduction to Machine Learning 27
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Impurity measures

• IG Gini impurity
• IH Entropy
• IE Classification Error

The City College of New York


CSc 59929 -- Introduction to Machine Learning 28
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Gini impurity, IG

The Gini Impurity, IG, is a measure of how often a


randomly chosen element from the set would be
incorrectly labeled if were randomly labeled
according to the distribution of labels in the subset.
m m m m m
I G { pi } =
=i 1 =i 1
( 2
)
∑ pi (1 − pi ) =∑ pi − pi =∑ pi − ∑ pi =
1 −∑ p i2 =
=i 1 =i 1
∑ pi p j 2

=i 1 i≠ j

pi denotes the fraction of the elements in the set that are in class i.

The City College of New York


CSc 59929 -- Introduction to Machine Learning 29
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Entropy, IH

The entropy, IH, is a measure of information content


and of disorder. It is a measure that is in wide use in
statistical physics.
m
I H { pi } = −∑ pi log 2 pi
i =1

The City College of New York


CSc 59929 -- Introduction to Machine Learning 30
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Classification error, IE

The classification error, IE, is a simple measure of


order. It is a measure of how often a randomly chosen
element from the set would be incorrectly labeled if
were labeled with the label of the most prevalent label
in the subset.
I E { pi } = 1 − max { pi }

The City College of New York


CSc 59929 -- Introduction to Machine Learning 31
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Impurity measures for sets with two classes
Gini Impurity Entropy Classification Error
1.00

0.90

0.80

0.70

0.60

I 0.50

0.40

0.30

0.20

0.10

0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

p1=1- p2
The City College of New York
CSc 59929 -- Introduction to Machine Learning 32
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Impurity measures for sets with two classes
Gini Impurity Scaled Entropy Classification Error
0.50

0.45

0.40

0.35

0.30

I 0.25

0.20

0.15

0.10

0.05

0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

p1=1- p2
The City College of New York
CSc 59929 -- Introduction to Machine Learning 33
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Examples of impurity measures
n = 10, m = 2

p1= 0.3, p2 = 0.7 p1= 0.5, p2 = 0.5 p1= 0.7, p2 = 0.3

The City College of New York


CSc 59929 -- Introduction to Machine Learning 34
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Gini impurity, IG I G { pi } = ∑ pi p j
i≠ j

n = 10, m = 2

p1= 0.3, p2 = 0.7 p1= 0.5, p2 = 0.5 p1= 0.7, p2 = 0.3

IG = 0.3*0.7 + 0.7*0.3 IG = 0.5*0.5 + 0.5*0.5 IG = 0.7*0.3 + 0.7*0.3


= 0.42 = 0.5 = 0.42

The City College of New York


CSc 59929 -- Introduction to Machine Learning 35
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Entropy, IH
m
I H { pi } = −∑ pi log 2 pi
i =1

n = 10, m = 2

p1= 0.3, p2 = 0.7 p1= 0.5, p2 = 0.5 p1= 0.7, p2 = 0.3

IH = -0.3*log2 0.3 IH = -0.5*log2 0.5 IH = -0.7*log2 0.7


-0.7*log2 0.7 -0.5*log2 0.5 -0.3*log2 0.3
= 0.8812 = 1. = 0.8812

The City College of New York


CSc 59929 -- Introduction to Machine Learning 36
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Classification Error, IE I E { pi } = 1 − max { pi }

n = 10, m = 2

p1= 0.3, p2 = 0.7 p1= 0.5, p2 = 0.5 p1= 0.7, p2 = 0.3

IE = 1 – max{0.3,0.7} IE = 1 – max{0.5,0.5} IE = 1 – max{0.7,0.3}


= 0.3 = 0.5 = 0.3

The City College of New York


CSc 59929 -- Introduction to Machine Learning 37
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
x2

x1

The City College of New York


CSc 59929 -- Introduction to Machine Learning 38
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
entropy = 1.0
samples = 20
value = [10,10]

The City College of New York


CSc 59929 -- Introduction to Machine Learning 39
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
x2

[10,10]

x1

The City College of New York


CSc 59929 -- Introduction to Machine Learning 40
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
X1 <= 65.75
entropy = 1.0
samples = 20
tree
value = [10,10] depth
True False

entropy = 0.469 entropy = 0.469 1


samples = 10 samples = 10
value = [9,1] value = [1,9]

The City College of New York


CSc 59929 -- Introduction to Machine Learning 41
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
x2

[9,1]

[1,9]

x1
x1 <= 65.75

The City College of New York


CSc 59929 -- Introduction to Machine Learning 42
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
X1 <= 65.75
entropy = 1.0
samples = 20
tree
value = [10,10] depth
True False
X1 <= 57.5
entropy = 0.469 entropy = 0.469 1
samples = 10 samples = 10
value = [9,1] value = [1,9]

entropy = 0.0 entropy = 0.918


samples = 7 samples = 3 2
value = [7,0] value = [2,1]

The City College of New York


CSc 59929 -- Introduction to Machine Learning 43
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
[2,1]
x2

[7,0]

[1,9]

x1
x1 <=57.5

The City College of New York


CSc 59929 -- Introduction to Machine Learning 44
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
X1 <= 65.75
entropy = 1.0
samples = 20
tree
value = [10,10] depth
True False
X1 <= 57.5 X2 <= 67.0
entropy = 0.469 entropy = 0.469 1
samples = 10 samples = 10
value = [9,1] value = [1,9]

entropy = 0.0 entropy = 0.918 entropy = 0.918 entropy = 0.0


samples = 7 samples = 3 samples = 3 samples = 7 2
value = [7,0] value = [2,1] value = [1,2] value = [0,7]

The City College of New York


CSc 59929 -- Introduction to Machine Learning 45
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
[2,1]
x2

[0,7]
[7,0]

x2 <= 67.0

[1,2]

x1

The City College of New York


CSc 59929 -- Introduction to Machine Learning 46
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
X1 <= 65.75
entropy = 1.0
samples = 20
tree
value = [10,10] depth
True False
X1 <= 57.5 X2 <= 67.0
entropy = 0.469 entropy = 0.469 1
samples = 10 samples = 10
value = [9,1] value = [1,9]

X1 <= 58.75
entropy = 0.0 entropy = 0.918 entropy = 0.918 entropy = 0.0
samples = 7 samples = 3 samples = 3 samples = 7 2
value = [7,0] value = [2,1] value = [1,2] value = [0,7]

entropy = 0.0 entropy = 0.0


samples = 1 samples = 2 3
value = [0,1] value = [2,0]

The City College of New York


CSc 59929 -- Introduction to Machine Learning 47
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
[2,0]
x2 [0,1]

[0,7]
[7,0]

[1,2]

x1
x1 <= 58.75

The City College of New York


CSc 59929 -- Introduction to Machine Learning 48
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
X1 <= 65.75
entropy = 1.0
samples = 20
tree
value = [10,10] depth
True False
X1 <= 57.5 X2 <= 67.0
entropy = 0.469 entropy = 0.469 1
samples = 10 samples = 10
value = [9,1] value = [1,9]

X1 <= 58.75 X1 <= 92.0


entropy = 0.0 entropy = 0.918 entropy = 0.918 entropy = 0.0
samples = 7 samples = 3 samples = 3 samples = 7 2
value = [7,0] value = [2,1] value = [1,2] value = [0,7]

entropy = 0.0 entropy = 0.0 entropy = 0.0 entropy = 0.0


samples = 1 samples = 2 samples = 1 samples = 2 3
value = [0,1] value = [2,0] value = [1,0] value = [0,2]

The City College of New York


CSc 59929 -- Introduction to Machine Learning 49
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
[2,0]
x2 [0,1]

[0,7]
[7,0]

[1,0] [0,2]

x1
x1 <= 92.0

The City College of New York


CSc 59929 -- Introduction to Machine Learning 50
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
x2

x1

The City College of New York


CSc 59929 -- Introduction to Machine Learning 51
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
x2

x1

The City College of New York


CSc 59929 -- Introduction to Machine Learning 52
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Typical criteria to stop training

• Only leaf nodes remain


• No further features remain to be examined
• Additional splitting fails to reduce the impurity
by a specified amount
• A specified maximum tree depth has been
reached
• A specified number of leaf nodes have been
generated

The City College of New York


CSc 59929 -- Introduction to Machine Learning 53
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Classifying a new sample
• Follow the tree down until a leaf is reached.
• Classify the sample based on the training samples in that leaf.
• If the leaf is homogeneous, use the category of the
training samples in the leaf.
• If the leaf is heterogeneous, apply one of the following:
• Use the category that occurs most frequently in the
training samples in the leaf.
• Choose a category randomly with the probability for
each category being the fraction of training samples
of that category in that leaf.

The City College of New York


CSc 59929 -- Introduction to Machine Learning 54
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Advantages of decision trees

• Efficient and scalable


• Handles both discrete and continuous features
• Robust against monotonic input transformations
• Robust against outliers
• Automatically ignores irrelevant features; no
need for feature selection (but feature
engineering may be useful)
• Results are usually interpretable

The City College of New York


CSc 59929 -- Introduction to Machine Learning 55
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Overfitting is often an issue
• n features can be used to classify samples into up to
2n classes.
• There are many possible decision trees
N
• For N binary features there are 2 possible
2

binary decision trees.


• If one or more of the features is a rational
number there an infinite number of possible
decision trees and that feature(s) can be used
repeatedly.
• Pruning one way to reduce overfitting.
The City College of New York
CSc 59929 -- Introduction to Machine Learning 56
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Pruning a decision tree

Pruning is a technique in machine learning that reduces the


size of decision trees by removing sections of the tree that
provide little power to classify instances.
Pruning reduces the complexity of the final classifier, and
hence improves predictive accuracy by the reduction of
overfitting.

Pruning algorithms
• Reduced error pruning
• Cost complexity pruning

The City College of New York


CSc 59929 -- Introduction to Machine Learning 57
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Reduced Error Pruning
• A fast and efficient, bottoms-up, greedy algorithm
• Start with testing each of the leaves
• Consider replacing a node with its most popular
class thereby making it a leaf node.
• If the prediction accuracy is not affected, keep the
change under consideration. If it is, move on to
another node.
• Prune nodes iteratively, always pruning the node that
most increases the decision tree accuracy over the set of
validation data.
• Continue pruning until further pruning decreases the
accuracy of the tree over the validation set.
The City College of New York
CSc 59929 -- Introduction to Machine Learning 58
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Cost Complexity Pruning

• Create a series of trees


• Choose one that minimizes a cost function that
includes terms for impurity and complexity.

The City College of New York


CSc 59929 -- Introduction to Machine Learning 59
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Cost-complexity cost function

The total cost Cα (T ) of tree T is defined as


C= α(T ) R(T ) + α T
Where
 R(T ) is the fraction of cases in the training sample that are
misclassified by T , the resubstituion error.
 α T determines the complexity penalty and T is the
number of leaves in T .

The City College of New York


CSc 59929 -- Introduction to Machine Learning 60
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Summary of the Algorithm for
Cost-Complexity Pruning
A) Choose Tmax , the tree that is to be pruned to the "right size."
B) Compute T1 from Tmax . T1 is the smallest subtree of Tmax
that has the same resubstituion error as Tmax .
C ) Generate the rest of the decreasing sequence of subtrees
of Tmax ,
T1 > T2 > ... > {t1} ,
where {t1} is the root node of the tree and such that Tk is
the smallest minimizing subtree for α ∈ [α k , α k +1 ) .
D) Select the final tree from the sequence by choosing the
one with the lowest error rate on a set of test data.
The City College of New York
CSc 59929 -- Introduction to Machine Learning 61
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
X1 <= 65.75
A) T = 6 entropy = 1.0
samples = 20
tree
value = [10,10] depth
True False
X1 <= 57.5 X2 <= 67.0
entropy = 0.469 entropy = 0.469 1
samples = 10 samples = 10
value = [9,1] value = [1,9]

X1 <= 58.75 X1 <= 92.0


entropy = 0.0 entropy = 0.918 entropy = 0.918 entropy = 0.0
samples = 7 samples = 3 samples = 3 samples = 7 2
value = [7,0] value = [2,1] value = [1,2] value = [0,7]

entropy = 0.0 entropy = 0.0 entropy = 0.0 entropy = 0.0


samples = 1 samples = 2 samples = 1 samples = 2 3
value = [0,1] value = [2,0] value = [1,0] value = [0,2]

The City College of New York


CSc 59929 -- Introduction to Machine Learning 62
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
[2,0]
x2 [0,1]

[0,7]
[7,0]

[1,0] [0,2]

x1

R (T ) 0.00
A) T 6,=
=
The City College of New York
CSc 59929 -- Introduction to Machine Learning 63
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
X1 <= 65.75
B) T = 5 entropy = 1.0
samples = 20
tree
value = [10,10] depth
True False
X1 <= 57.5 X2 <= 67.0
entropy = 0.469 entropy = 0.469 1
samples = 10 samples = 10
value = [9,1] value = [1,9]

X1 <= 58.75
entropy = 0.0 entropy = 0.918 entropy = 0.918 entropy = 0.0
samples = 7 samples = 3 samples = 3 samples = 7 2
value = [7,0] value = [2,1] value = [1,2] value = [0,7]

entropy = 0.0 entropy = 0.0


samples = 1 samples = 2 3
value = [0,1] value = [2,0]

The City College of New York


CSc 59929 -- Introduction to Machine Learning 64
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
[2,0]
x2 [0,1]

[0,7]
[7,0]

[1,2]

x1

R (T ) 0.05
B) T 5,=
=
The City College of New York
CSc 59929 -- Introduction to Machine Learning 65
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
X1 <= 65.75
C ) T = 5 entropy = 1.0
samples = 20
tree
value = [10,10] depth
True False
X1 <= 57.5 X2 <= 67.0
entropy = 0.469 entropy = 0.469 1
samples = 10 samples = 10
value = [9,1] value = [1,9]

X1 <= 92.0
entropy = 0.0 entropy = 0.918 entropy = 0.918 entropy = 0.0
samples = 7 samples = 3 samples = 3 samples = 7 2
value = [7,0] value = [2,1] value = [1,2] value = [0,7]

entropy = 0.0 entropy = 0.0


samples = 1 samples = 2 3
value = [1,0] value = [0,2]

The City College of New York


CSc 59929 -- Introduction to Machine Learning 66
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
[2,1]
x2

[0,7]
[7,0]

[1,0] [0,2]

x1

R (T ) 0.05
C ) T 5,=
=
The City College of New York
CSc 59929 -- Introduction to Machine Learning 67
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
X1 <= 65.75
D) T = 4 entropy = 1.0
samples = 20
tree
value = [10,10] depth
True False
X1 <= 57.5 X2 <= 67.0
entropy = 0.469 entropy = 0.469 1
samples = 10 samples = 10
value = [9,1] value = [1,9]

entropy = 0.0 entropy = 0.918 entropy = 0.918 entropy = 0.0


samples = 7 samples = 3 samples = 3 samples = 7 2
value = [7,0] value = [2,1] value = [1,2] value = [0,7]

The City College of New York


CSc 59929 -- Introduction to Machine Learning 68
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
[2,1]
x2

[0,7]
[7,0]

[1,2]

x1

R (T ) 0.10
D) T 4,=
=
The City College of New York
CSc 59929 -- Introduction to Machine Learning 69
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
X1 <= 65.75
E ) T = 4 entropy = 1.0
samples = 20
tree
value = [10,10] depth
True False
X1 <= 57.5
entropy = 0.469 entropy = 0.469 1
samples = 10 samples = 10
value = [9,1] value = [1,9]

X1 <= 58.75
entropy = 0.0 entropy = 0.918
samples = 7 samples = 3 2
value = [7,0] value = [2,1]

entropy = 0.0 entropy = 0.0


samples = 1 samples = 2 3
value = [0,1] value = [2,0]

The City College of New York


CSc 59929 -- Introduction to Machine Learning 70
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
[2,0]
x2 [0,1]

[1,9]
[7,0]

x1

R (T ) 0.05
E ) T 4,=
=
The City College of New York
CSc 59929 -- Introduction to Machine Learning 71
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
X1 <= 65.75
F ) T = 4 entropy = 1.0
samples = 20
tree
value = [10,10] depth
True False
X2 <= 67.0
entropy = 0.469 entropy = 0.469 1
samples = 10 samples = 10
value = [9,1] value = [1,9]

X1 <= 92.0
entropy = 0.918 entropy = 0.0
samples = 3 samples = 7 2
value = [1,2] value = [0,7]

entropy = 0.0 entropy = 0.0


samples = 1 samples = 2 3
value = [1,0] value = [0,2]

The City College of New York


CSc 59929 -- Introduction to Machine Learning 72
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
x2

[0,7]
[9,1]

[1,0] [0,2]

x1

R (T ) 0.05
F ) T 4,=
=
The City College of New York
CSc 59929 -- Introduction to Machine Learning 73
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
X1 <= 65.75
G ) T = 3 entropy = 1.0
samples = 20
tree
value = [10,10] depth
True False
X1 <= 57.5 X2 <= 67.0
entropy = 0.469 entropy = 0.469 1
samples = 10 samples = 10
value = [9,1] value = [1,9]

entropy = 0.0 entropy = 0.918


samples = 7 samples = 3 2
value = [7,0] value = [2,1]

The City College of New York


CSc 59929 -- Introduction to Machine Learning 74
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
[2,1]
x2

[1,9]
[7,0]

x1

R (T ) 0.10
G ) T 3,=
=
The City College of New York
CSc 59929 -- Introduction to Machine Learning 75
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
X1 <= 65.75
H ) T = 3 entropy = 1.0
samples = 20
tree
value = [10,10] depth
True False
X2 <= 67.0
entropy = 0.469 entropy = 0.469 1
samples = 10 samples = 10
value = [9,1] value = [1,9]

entropy = 0.918 entropy = 0.0


samples = 3 samples = 7 2
value = [1,2] value = [0,7]

The City College of New York


CSc 59929 -- Introduction to Machine Learning 76
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
x2

[0,7]
[9,1]

[1,2]

x1

R (T ) 0.10
H ) T 3,=
=
The City College of New York
CSc 59929 -- Introduction to Machine Learning 77
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
X1 <= 65.75
I ) T = 2 entropy = 1.0
samples = 20
tree
value = [10,10] depth
True False

entropy = 0.469 entropy = 0.469 1


samples = 10 samples = 10
value = [9,1] value = [1,9]

The City College of New York


CSc 59929 -- Introduction to Machine Learning 78
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
x2

[1,9]
[9,1]

x1

R (T ) 0.10
I ) T 2,=
=
The City College of New York
CSc 59929 -- Introduction to Machine Learning 79
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
J ) T = 1 entropy = 1.0
samples = 20
tree
value = [10,10] depth

The City College of New York


CSc 59929 -- Introduction to Machine Learning 80
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
x2

[10,10]

x1

R (T ) 0.50
J ) T 1,=
=
The City College of New York
CSc 59929 -- Introduction to Machine Learning 81
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
All subtrees

ID |T| R(T) depth Cα(T)

A 6 0.00 3 6α
B 0.05 3
5 0.05+5α
C 0.05 3
D 0.10 2 0.10+4α
E 4 0.05 3 0.05+4α
F 0.05 3 0.05+4α
G 0.10 2
3 0.10+3α
H 0.10 2
I 2 0.10 1 0.10+2α
J 1 0.50 0 0.50+α

The City College of New York


CSc 59929 -- Introduction to Machine Learning 82
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Cost Complexity Function
0.600

0.500

A [6]
0.400 B,C [5]
D[4]
E,F [4]
Cα(T)

0.300 G, H [3]
I [2]
J [1]
0.200

0.100

0.000
0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040
α
The City College of New York
CSc 59929 -- Introduction to Machine Learning 83
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Cost Complexity Function
0.250

0.200

0.150
A [6]
Cα(T)

A [6]
B,C [5]
0.100
D[4]
E,F [4]
G, H [3]
0.050 I [2]
I [2]
J [1]
0.000
0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040
α
The City College of New York
CSc 59929 -- Introduction to Machine Learning 84
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Candidate subtrees

ID |T| R(T) depth Cα(T) α=argmin(Cα(T)

A 6 0.00 3 6α [0,0.025)
B
5 0.05 3 0.05+5α
C
D 4 0.10 2 0.10+4α
E
4 0.05 3 0.05+4α 0.025
F
G
3 0.10 2 0.10+3α
H
I 2 0.10 1 0.10+2α [0.025,∞)
J 1 0.50 0 0.50+α

The City College of New York


CSc 59929 -- Introduction to Machine Learning 85
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Candidate subtrees

ID |T| R(T) depth Cα(T) α=argmin(Cα(T) C0.025(T)

A 6 0.00 3 6α [0,0.025) 0.150


B
5 0.05 3 0.05+5α 0.175
C
D 4 0.10 2 0.10+4α 0.200
E
4 0.05 3 0.05+4α 0.025 0.150
F
G
3 0.10 2 0.10+3α 0.175
H
I 2 0.10 1 0.10+2α [0.025,∞) 0.150
J 1 0.50 0 0.50+α 0.525

The City College of New York


CSc 59929 -- Introduction to Machine Learning 86
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Candidate subtrees

TA > TI

Choose the one that minimizes the


error rate on a set of test data.

The City College of New York


CSc 59929 -- Introduction to Machine Learning 87
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Pruned Subtrees
Let T denote a binary tree and T the number of leaf nodes of T .
The branch T j of tree T consists of node j and all of its decendents.
The tree obtained by pruning T at node j is denoted by T - T j .
A pruned subtree of T is any tree that can be obtained by pruning
at zero or more nodes.
If T ' is a pruned subtree of T , we denote this by T ' ≤ T or by T ≥ T '.
A non-trivial binary tree has a larger number of pruned subtrees.
T
In fact, the number of pruned subtrees of T is ≈ 1.5028369 .

The City College of New York


CSc 59929 -- Introduction to Machine Learning 88
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Number of Pruned Subtrees

30,000

25,000

20,000
Pruned Subtrees

15,000

10,000

5,000

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Leaves in Full Tree
The City College of New York
CSc 59929 -- Introduction to Machine Learning 89
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Number of Pruned Subtrees
12

10

8
ln(Pruned Subtrees)

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Leaves in Full Tree
The City College of New York
CSc 59929 -- Introduction to Machine Learning 90
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Pruned Subtrees

T
The number of pruned subtrees of T is ≈ 1.5028369 .

T
For
= 
our sample tree T 6 and thus 1.5028369 ≈ 12.

This is close to value of 10 that we obtained by enumerating


the possible subtrees.

The City College of New York


CSc 59929 -- Introduction to Machine Learning 91
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Cost Complexity Function
The impetus behind cost-completity pruning is to greatly
reduce the number of pruned subtrees that should be examined.
Let R (T ) be the fraction of cases in the training sample that are
misclassified by T . This is referred to as the resubstituion error.
We define the total cost Cα (T ) of tree T as
C=α(T ) R(T ) + α T
Thus Cα (T ) consists of two terms, the resubstituion error R (T )
and a penalty for the complexity of the tree α T .

The City College of New York


CSc 59929 -- Introduction to Machine Learning 92
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Cost Complexity Function …
We denote a tree T to be pruned to the "right size" by Tmax .
If α is fixed, there is a smallest minimizing subtree T (α ) of T
such that
1) Cα (T (α )) = min T ≤Tmax Cα (T )
2) If Cα (T ) Cα (T (α )), then Tα ≤ T .
=
It can be shown that for every value of α > 0 there is a smallest
minimizing subtree T (α ). If two subtrees achieve this minimum
then they must not be incompatible, that is neither is a subtree of
the other.

The City College of New York


CSc 59929 -- Introduction to Machine Learning 93
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Sequence of Subtrees of Tmax
Although there are an infinite number of possible values of α ,
there are only a finite number of subtrees of Tmax .
We can construct a decreasing sequence of subtrees of Tmax,
T1 > T2 > ... > {t1}
(where t1 is the root nodes of the tree), such that Tk is the
smallest minimizing subtree for α ∈ [α k , α k +1 ) .
This result makes it possible to create an efficient algorithm
to find the smallest minimizing subtrees for different values of α .
The first tree in the sequence, T1 , is the smallest subtree of Tmax
with same resubstiution error as Tmax , that is= (α 0).
T1 T=

The City College of New York


CSc 59929 -- Introduction to Machine Learning 94
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Computing T1 from Tmax
To compute T1 from Tmax we find any pair of leaf nodes with a
common parent that can be pruned without increasing the
resubstituion error. Continue until there are no more such
pairs of nodes. The result is a tree with the same total cost
as Tmax at α = 0. Since the resulting tree is smaller, it has a
lower total cost and is thus preferred.

The City College of New York


CSc 59929 -- Introduction to Machine Learning 95
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Algorithm for Computing T1 from Tmax
A) Set T ' = Tmax
B) Combine pairs of nodes:
1) Choose any pair of leaves l and r with a common
parent t in T ' such that R=
(t ) R(l ) + R(r )
2) T=' T '− Tt
3) Stop when there are no more such pairs.
C) Set T1 = T '

The City College of New York


CSc 59929 -- Introduction to Machine Learning 96
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Computing the Sequence of Subtrees
Let Tt denote the branch of T with root node t. If we were to
prune at t , its contribution to the total cost of T - Tt would
become Cα ({t} ) =
R ( t ) + α , where R ( t ) =
r (t ) p (t ), r (t ) is the
resubstitution error at node t and p (t ) is the proportion of cases
that fall into node t.
The contribution of Tt to the total cost of T is
C (T ) =
α t R (T ) + α T , where R (T ) =
t t ∑ R ( t ').
t t '∈Tt

T − Tt becomes a better tree when Cα ({t} ) =


Cα (Tt ) since at
that value of α they have the same total cost but T − Tt is the
smaller of the two.

The City College of New York


CSc 59929 -- Introduction to Machine Learning 97
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Computing the Sequence of Subtrees …
When Cα ({t} ) = Cα (Tt ) ,

{
R (Tt ) +α Tt = R ( t ) + α =t} R ( t ) + α
Solving for α we obtain
R ( t ) − R (Tt )
α= .

T −1
t

Thus, for any node t in T1 , if we increse α then when


R ( t ) − R (T1,t )
α=
T − 1
1,t

the tree obtained by pruning at t becomes better than T1.


We denote this value of α by αˆ .
The City College of New York
CSc 59929 -- Introduction to Machine Learning 98
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Computing the Sequence of Subtrees …
We compute αˆ for for each node in T1 and then select
the "weakest link(s)," i.e., the nodes for which
R ( t ) − R (Tt )
g (t ) =
T − 1
t

is smallest. We prune T1 at this node (or nodes) to obtain


T2 . We then repeat the process to obtain T3 , T4 , ... until
we have reached {t1} , the root of the tree which gives us
the decreasing sequence of subtrees of Tmax ,
T1 > T2 > ... > {t1} .

The City College of New York


CSc 59929 -- Introduction to Machine Learning 99
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Algorithm for Computing the Sequence of
Subtrees
A)= (0), α1 0,=
Set T1 T= and k 1.
B) While Tk > {t1}

1) For all non-terminal nodes t ∈ Tk


R ( t ) − R (Tk ,t )
g k (t ) =
T − 1
k ,t

2) α k +1 = min ( g k (t ) )
3) Visit the nodes in top-down order and prune whenever
g k (t ) = α k +1 to obtain Tk +1
4) k += 1

The City College of New York


CSc 59929 -- Introduction to Machine Learning 100
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
Selection of the Final Tree
The final tree is chosen from the previously constructed
decreasing sequence of subtrees of Tmax ,
T1 > T2 > ... > {t1} .
This is typically done by selecting the final tree that has the
lowest error rate on a set of test data.

The City College of New York


CSc 59929 -- Introduction to Machine Learning 101
Spring 2020 -- Erik K. Grimmelmann, Ph.D.

You might also like