ML 10 Decision Trees

Decision Tree Learning
The City College of New York

CSc 59929 -- Introduction to Machine Learning 1
Spring 2020 -- Erik K. Grimmelmann, Ph.D.
How do I decide what to do today?
from Textbook

Work to Do?
from Textbook

Work to Do?
Yes No
Stay In Outlook?
from Textbook

Work to Do?
Yes No
Stay In Outlook?
Sunny Overcast Rainy
Go to Beach Go Running
from Textbook

Work to Do?
Yes No
Stay In Outlook?
Go to Beach Go Running Friends Busy?
from Textbook

Non-binary decision tree
Work to Do?
Yes No
Stay In Outlook?
Go to Beach Go Running Friends Busy?

Yes No
Stay In Go to Movies
from Textbook

Binary decision tree
Work to Do?
Yes No
Stay In Sunny?
Yes No
Go to Beach Overcast?
Yes No
Go Running Friends Busy?

Yes No
Stay In Go to Movies

Components of a decision tree
Features Categories
Work to Do? Stay In
Sunny? Go to Beach
Overcast? Go Running
Friends Busy? Go to Movies

Sample binary decision tree
F1 Root = Root Node = Root Feature
Yes No Branch = Split
C1 F2 Node = Feature
Yes No
C2 F3
Yes No
Leaf = Class C3 F4
Yes No
C1 C4

Full binary decision tree
F1
F2
F2
F3 F3 F3 F3
F4 F4 F4 F4 F4 F4 F4 F4
C C C C C C C C C C C C C C C C
1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0
1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0

Complete binary decision tree
F1
F2
F2
F3 F3 F3 F3
F4 F4 F4 F4 F4 F4 F4
• Full at every level

C C C C C C C C C C C C C except possibly the last
• All nodes are as far left
as possible

Decision stump
F1
R1 R2

Decision trees
• Decision trees classify instances by sorting

them down the tree from the root to some leaf
node which provides the classification of the
instance.
• Each node in the tree specifies examination of
some feature of the instance and each branch
from that node corresponds to one of the
possible values (or range of values) of this
feature.

Decision tree classification
• Starting at the root node, test the feature
specified by this node
• Then move down the tree branch
corresponding to the value (or range of values)
of the feature corresponding to the node for the
given instance.
• Repeat this process for the subtree rooted at the
new node until there are no remaining features
to be examined.

Decision tree learning
• Training a decision tree consists of
determining which feature to assign to each
node and which value (or range of values) of
that feature to assign to each branch.
• Most decision tree learning algorithms utilize a
top-down, greedy search through the space of
possible decision trees.

Greedy algorithms
• A greedy algorithm makes the locally optimal
choice at each stage with the hope of finding a
global optimal solution (or something close to
it).

Subway problem: choosing a route
• A commute from 72nd St & Broadway to the
WeWork in Dumbo Heights.
F
1 A York St.
59th 4th
St. St. A
72nd
St C High St.
C
Clark St.
2 3

Greedy algorithm for choosing my route
• At each node where there’s a choice, take the
first train that arrives.
• If two arrive simultaneously, take the express.
• Don’t switch trains on the same line.
F
1 A York St.
59th 4th
St. St. A
72nd
St. C High St.
C
Clark St.
2 3
Greedy algorithm drawbacks
• Never take the F although York St. is the closest of the
three Dumbo stations to the office.
• Take a C even when the displays say that an A will be
arriving momentarily.
• Take the 1 even if the A/C line is out of service
F
1 A York St.
59th 4th
72nd St. St. A
St. C High St.
C
Clark St.
2 3

Measures of a decision tree
• The depth of a decision tree is the length of the longest path

from the root of the tree to a leaf.
• The size of a decision tree is the number of nodes in the tree.

Full binary decision tree
F1
n binary features can be
used to classify samples
F2
F2 into at most 2n classes
F3
depth = d
F3 F3 F3
F4 F4 F4
size = s = 2d+1-1
F4 F4 F4 F4 F4
C C C C C C C C C C C C C C C C
d =4
s= 24-1=31
depth = length of the longest path from
the root to a leaf
size = the number of nodes in the tree
Decision stump
F1
R1 R2
• Only one decision node, i.e.,

depth = 1
size = 3

Finding the optimal binary decision tree
• There are many possible decision trees
• How do we choose the optimal one?
• We start at the tree root and split the tree on the
feature that results in the largest information gain (IG).
• If we don’t prune the tree, we continue down the
tree, splitting it at each node until at stopping criterion
is satisfied.

Impurity
Impurity is a measure of how homogenous or heterogenous a
set of objects is. A set is said to be pure (or homogenous) if it
contains only a single class.
The are a number of commonly used measures of impurity.

Notation
• f is the feature (and its values) used to perform the split
• the parent node is the node at which the split is made
• m is the number of child nodes of the parent node
for a binary tree, m = 2
• Dp is the dataset of the parent node
• Dj is the dataset of the jth child node
• Np is the number of samples in the parent node
• Nj is the number of samples in the jth child node
• I is the impurity, a measure of the heterogeneity of a
dataset
• IG is the information gain at a particular split in the tree;
it is the difference between the impurity of the parent node
and the weighted sum of the impurities of the child nodes
Information Gain
Information gain at a particular split in the tree is the
difference between the impurity of the parent node and the
weighted sum of the impurities of the child nodes.
m Nj
IG ( D=
p, f ) I ( Dp ) − ∑ I ( Dj )
j =1 Np

Impurity measures
• IG Gini impurity
• IH Entropy
• IE Classification Error

Gini impurity, IG
The Gini Impurity, IG, is a measure of how often a

randomly chosen element from the set would be
incorrectly labeled if were randomly labeled
according to the distribution of labels in the subset.
m m m m m
I G { pi } =
=i 1 =i 1
( 2
)
∑ pi (1 − pi ) =∑ pi − pi =∑ pi − ∑ pi =
1 −∑ p i2 =
=i 1 =i 1
∑ pi p j 2
=i 1 i≠ j
pi denotes the fraction of the elements in the set that are in class i.

Entropy, IH
The entropy, IH, is a measure of information content

and of disorder. It is a measure that is in wide use in
statistical physics.
m
I H { pi } = −∑ pi log 2 pi
i =1

Classification error, IE
The classification error, IE, is a simple measure of

order. It is a measure of how often a randomly chosen
element from the set would be incorrectly labeled if
were labeled with the label of the most prevalent label
in the subset.
I E { pi } = 1 − max { pi }

Impurity measures for sets with two classes
Gini Impurity Entropy Classification Error
1.00
0.90
0.80
0.70
0.60
I 0.50
0.40
0.30
0.20
0.10
0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
p1=1- p2
Impurity measures for sets with two classes
Gini Impurity Scaled Entropy Classification Error
0.50
0.45
0.40
0.35
0.30
I 0.25
0.20
0.15
0.10
0.05
0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
p1=1- p2
Examples of impurity measures
n = 10, m = 2
p1= 0.3, p2 = 0.7 p1= 0.5, p2 = 0.5 p1= 0.7, p2 = 0.3

Gini impurity, IG I G { pi } = ∑ pi p j
i≠ j
n = 10, m = 2
p1= 0.3, p2 = 0.7 p1= 0.5, p2 = 0.5 p1= 0.7, p2 = 0.3
IG = 0.3*0.7 + 0.7*0.3 IG = 0.5*0.5 + 0.5*0.5 IG = 0.7*0.3 + 0.7*0.3

= 0.42 = 0.5 = 0.42

Entropy, IH
m
I H { pi } = −∑ pi log 2 pi
i =1
n = 10, m = 2
p1= 0.3, p2 = 0.7 p1= 0.5, p2 = 0.5 p1= 0.7, p2 = 0.3
IH = -0.3*log2 0.3 IH = -0.5*log2 0.5 IH = -0.7*log2 0.7

-0.7*log2 0.7 -0.5*log2 0.5 -0.3*log2 0.3
= 0.8812 = 1. = 0.8812

Classification Error, IE I E { pi } = 1 − max { pi }
n = 10, m = 2
p1= 0.3, p2 = 0.7 p1= 0.5, p2 = 0.5 p1= 0.7, p2 = 0.3
IE = 1 – max{0.3,0.7} IE = 1 – max{0.5,0.5} IE = 1 – max{0.7,0.3}

= 0.3 = 0.5 = 0.3

x2
x1

entropy = 1.0
samples = 20
value = [10,10]

x2
[10,10]
x1

X1 <= 65.75
entropy = 1.0
samples = 20
tree
value = [10,10] depth
True False
entropy = 0.469 entropy = 0.469 1

samples = 10 samples = 10
value = [9,1] value = [1,9]

x2
[9,1]
[1,9]
x1
x1 <= 65.75

X1 <= 65.75
entropy = 1.0
samples = 20
tree
True False
X1 <= 57.5
value = [9,1] value = [1,9]
entropy = 0.0 entropy = 0.918

samples = 7 samples = 3 2
value = [7,0] value = [2,1]

[2,1]
x2
[7,0]
[1,9]
x1
x1 <=57.5

X1 <= 65.75
entropy = 1.0
samples = 20
tree
True False
X1 <= 57.5 X2 <= 67.0
value = [9,1] value = [1,9]
entropy = 0.0 entropy = 0.918 entropy = 0.918 entropy = 0.0

samples = 7 samples = 3 samples = 3 samples = 7 2
value = [7,0] value = [2,1] value = [1,2] value = [0,7]

[2,1]
x2
[0,7]
[7,0]
x2 <= 67.0
[1,2]
x1

X1 <= 65.75
entropy = 1.0
samples = 20
tree
True False
X1 <= 57.5 X2 <= 67.0
value = [9,1] value = [1,9]
X1 <= 58.75

value = [0,1] value = [2,0]

[2,0]
x2 [0,1]
[0,7]
[7,0]
[1,2]
x1
x1 <= 58.75

X1 <= 65.75
entropy = 1.0
samples = 20
tree
True False
X1 <= 57.5 X2 <= 67.0
value = [9,1] value = [1,9]
X1 <= 58.75 X1 <= 92.0



[2,0]
x2 [0,1]
[0,7]
[7,0]
[1,0] [0,2]
x1
x1 <= 92.0

x2
x1

x2
x1

Typical criteria to stop training
• Only leaf nodes remain

• No further features remain to be examined
• Additional splitting fails to reduce the impurity
by a specified amount
• A specified maximum tree depth has been
reached
• A specified number of leaf nodes have been
generated

Classifying a new sample
• Follow the tree down until a leaf is reached.
• Classify the sample based on the training samples in that leaf.
• If the leaf is homogeneous, use the category of the
training samples in the leaf.
• If the leaf is heterogeneous, apply one of the following:
• Use the category that occurs most frequently in the
training samples in the leaf.
• Choose a category randomly with the probability for
each category being the fraction of training samples
of that category in that leaf.

Advantages of decision trees
• Efficient and scalable

• Handles both discrete and continuous features
• Robust against monotonic input transformations
• Robust against outliers
• Automatically ignores irrelevant features; no
need for feature selection (but feature
engineering may be useful)
• Results are usually interpretable

Overfitting is often an issue
• n features can be used to classify samples into up to
2n classes.
• There are many possible decision trees
N
• For N binary features there are 2 possible
2
binary decision trees.

• If one or more of the features is a rational
number there an infinite number of possible
decision trees and that feature(s) can be used
repeatedly.
• Pruning one way to reduce overfitting.
Pruning a decision tree
Pruning is a technique in machine learning that reduces the

size of decision trees by removing sections of the tree that
provide little power to classify instances.
Pruning reduces the complexity of the final classifier, and
hence improves predictive accuracy by the reduction of
overfitting.
Pruning algorithms
• Reduced error pruning
• Cost complexity pruning

Reduced Error Pruning
• A fast and efficient, bottoms-up, greedy algorithm
• Start with testing each of the leaves
• Consider replacing a node with its most popular
class thereby making it a leaf node.
• If the prediction accuracy is not affected, keep the
change under consideration. If it is, move on to
another node.
• Prune nodes iteratively, always pruning the node that
most increases the decision tree accuracy over the set of
validation data.
• Continue pruning until further pruning decreases the
accuracy of the tree over the validation set.
Cost Complexity Pruning
• Create a series of trees

• Choose one that minimizes a cost function that
includes terms for impurity and complexity.

Cost-complexity cost function
The total cost Cα (T ) of tree T is defined as

C= α(T ) R(T ) + α T
Where
 R(T ) is the fraction of cases in the training sample that are
misclassified by T , the resubstituion error.
 α T determines the complexity penalty and T is the
number of leaves in T .

Summary of the Algorithm for
Cost-Complexity Pruning
A) Choose Tmax , the tree that is to be pruned to the "right size."
B) Compute T1 from Tmax . T1 is the smallest subtree of Tmax
that has the same resubstituion error as Tmax .
C ) Generate the rest of the decreasing sequence of subtrees
of Tmax ,
T1 > T2 > ... > {t1} ,
where {t1} is the root node of the tree and such that Tk is
the smallest minimizing subtree for α ∈ [α k , α k +1 ) .
D) Select the final tree from the sequence by choosing the
one with the lowest error rate on a set of test data.
X1 <= 65.75
A) T = 6 entropy = 1.0
samples = 20
tree
True False
X1 <= 57.5 X2 <= 67.0
value = [9,1] value = [1,9]
X1 <= 58.75 X1 <= 92.0



[2,0]
x2 [0,1]
[0,7]
[7,0]
[1,0] [0,2]
x1
R (T ) 0.00
A) T 6,=
=
X1 <= 65.75
B) T = 5 entropy = 1.0
samples = 20
tree
True False
X1 <= 57.5 X2 <= 67.0
value = [9,1] value = [1,9]
X1 <= 58.75

value = [0,1] value = [2,0]

[2,0]
x2 [0,1]
[0,7]
[7,0]
[1,2]
x1
R (T ) 0.05
B) T 5,=
=
X1 <= 65.75
C ) T = 5 entropy = 1.0
samples = 20
tree
True False
X1 <= 57.5 X2 <= 67.0
value = [9,1] value = [1,9]
X1 <= 92.0

value = [1,0] value = [0,2]

[2,1]
x2
[0,7]
[7,0]
[1,0] [0,2]
x1
R (T ) 0.05
C ) T 5,=
=
X1 <= 65.75
D) T = 4 entropy = 1.0
samples = 20
tree
True False
X1 <= 57.5 X2 <= 67.0
value = [9,1] value = [1,9]


[2,1]
x2
[0,7]
[7,0]
[1,2]
x1
R (T ) 0.10
D) T 4,=
=
X1 <= 65.75
E ) T = 4 entropy = 1.0
samples = 20
tree
True False
X1 <= 57.5
value = [9,1] value = [1,9]
X1 <= 58.75
value = [7,0] value = [2,1]

value = [0,1] value = [2,0]

[2,0]
x2 [0,1]
[1,9]
[7,0]
x1
R (T ) 0.05
E ) T 4,=
=
X1 <= 65.75
F ) T = 4 entropy = 1.0
samples = 20
tree
True False
X2 <= 67.0
value = [9,1] value = [1,9]
X1 <= 92.0
value = [1,2] value = [0,7]

value = [1,0] value = [0,2]

x2
[0,7]
[9,1]
[1,0] [0,2]
x1
R (T ) 0.05
F ) T 4,=
=
X1 <= 65.75
G ) T = 3 entropy = 1.0
samples = 20
tree
True False
X1 <= 57.5 X2 <= 67.0
value = [9,1] value = [1,9]

value = [7,0] value = [2,1]

[2,1]
x2
[1,9]
[7,0]
x1
R (T ) 0.10
G ) T 3,=
=
X1 <= 65.75
H ) T = 3 entropy = 1.0
samples = 20
tree
True False
X2 <= 67.0
value = [9,1] value = [1,9]

value = [1,2] value = [0,7]

x2
[0,7]
[9,1]
[1,2]
x1
R (T ) 0.10
H ) T 3,=
=
X1 <= 65.75
I ) T = 2 entropy = 1.0
samples = 20
tree
True False

value = [9,1] value = [1,9]

x2
[1,9]
[9,1]
x1
R (T ) 0.10
I ) T 2,=
=
J ) T = 1 entropy = 1.0
samples = 20
tree

x2
[10,10]
x1
R (T ) 0.50
J ) T 1,=
=
All subtrees
ID |T| R(T) depth Cα(T)
A 6 0.00 3 6α
B 0.05 3
5 0.05+5α
C 0.05 3
D 0.10 2 0.10+4α
E 4 0.05 3 0.05+4α
F 0.05 3 0.05+4α
G 0.10 2
3 0.10+3α
H 0.10 2
I 2 0.10 1 0.10+2α
J 1 0.50 0 0.50+α

Cost Complexity Function
0.600
0.500
A [6]
0.400 B,C [5]
D[4]
E,F [4]
Cα(T)
0.300 G, H [3]
I [2]
J [1]
0.200
0.100
0.000
0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040
α
0.250
0.200
0.150
A [6]
Cα(T)
A [6]
B,C [5]
0.100
D[4]
E,F [4]
G, H [3]
0.050 I [2]
I [2]
J [1]
0.000
0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040
α
Candidate subtrees
ID |T| R(T) depth Cα(T) α=argmin(Cα(T)
A 6 0.00 3 6α [0,0.025)
B
5 0.05 3 0.05+5α
C
D 4 0.10 2 0.10+4α
E
4 0.05 3 0.05+4α 0.025
F
G
3 0.10 2 0.10+3α
H
I 2 0.10 1 0.10+2α [0.025,∞)
J 1 0.50 0 0.50+α

Candidate subtrees
ID |T| R(T) depth Cα(T) α=argmin(Cα(T) C0.025(T)
A 6 0.00 3 6α [0,0.025) 0.150

B
5 0.05 3 0.05+5α 0.175
C
D 4 0.10 2 0.10+4α 0.200
E
4 0.05 3 0.05+4α 0.025 0.150
F
G
3 0.10 2 0.10+3α 0.175
H
I 2 0.10 1 0.10+2α [0.025,∞) 0.150
J 1 0.50 0 0.50+α 0.525

Candidate subtrees
TA > TI
Choose the one that minimizes the

error rate on a set of test data.

Pruned Subtrees
Let T denote a binary tree and T the number of leaf nodes of T .
The branch T j of tree T consists of node j and all of its decendents.
The tree obtained by pruning T at node j is denoted by T - T j .
A pruned subtree of T is any tree that can be obtained by pruning
at zero or more nodes.
If T ' is a pruned subtree of T , we denote this by T ' ≤ T or by T ≥ T '.
A non-trivial binary tree has a larger number of pruned subtrees.
T
In fact, the number of pruned subtrees of T is ≈ 1.5028369 .

Number of Pruned Subtrees
30,000
25,000
20,000
Pruned Subtrees
15,000
10,000
5,000
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Leaves in Full Tree
Number of Pruned Subtrees
12
10
8
ln(Pruned Subtrees)
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Leaves in Full Tree
Pruned Subtrees
T
The number of pruned subtrees of T is ≈ 1.5028369 .
T
For
= 
our sample tree T 6 and thus 1.5028369 ≈ 12.
This is close to value of 10 that we obtained by enumerating

the possible subtrees.

The impetus behind cost-completity pruning is to greatly
reduce the number of pruned subtrees that should be examined.
Let R (T ) be the fraction of cases in the training sample that are
misclassified by T . This is referred to as the resubstituion error.
We define the total cost Cα (T ) of tree T as
C=α(T ) R(T ) + α T
Thus Cα (T ) consists of two terms, the resubstituion error R (T )
and a penalty for the complexity of the tree α T .

Cost Complexity Function …
We denote a tree T to be pruned to the "right size" by Tmax .
If α is fixed, there is a smallest minimizing subtree T (α ) of T
such that
1) Cα (T (α )) = min T ≤Tmax Cα (T )
2) If Cα (T ) Cα (T (α )), then Tα ≤ T .
=
It can be shown that for every value of α > 0 there is a smallest
minimizing subtree T (α ). If two subtrees achieve this minimum
then they must not be incompatible, that is neither is a subtree of
the other.

Sequence of Subtrees of Tmax
Although there are an infinite number of possible values of α ,
there are only a finite number of subtrees of Tmax .
We can construct a decreasing sequence of subtrees of Tmax,
T1 > T2 > ... > {t1}
(where t1 is the root nodes of the tree), such that Tk is the
smallest minimizing subtree for α ∈ [α k , α k +1 ) .
This result makes it possible to create an efficient algorithm
to find the smallest minimizing subtrees for different values of α .
The first tree in the sequence, T1 , is the smallest subtree of Tmax
with same resubstiution error as Tmax , that is= (α 0).
T1 T=

Computing T1 from Tmax
To compute T1 from Tmax we find any pair of leaf nodes with a
common parent that can be pruned without increasing the
resubstituion error. Continue until there are no more such
pairs of nodes. The result is a tree with the same total cost
as Tmax at α = 0. Since the resulting tree is smaller, it has a
lower total cost and is thus preferred.

Algorithm for Computing T1 from Tmax
A) Set T ' = Tmax
B) Combine pairs of nodes:
1) Choose any pair of leaves l and r with a common
parent t in T ' such that R=
(t ) R(l ) + R(r )
2) T=' T '− Tt
3) Stop when there are no more such pairs.
C) Set T1 = T '

Computing the Sequence of Subtrees
Let Tt denote the branch of T with root node t. If we were to
prune at t , its contribution to the total cost of T - Tt would
become Cα ({t} ) =
R ( t ) + α , where R ( t ) =
r (t ) p (t ), r (t ) is the
resubstitution error at node t and p (t ) is the proportion of cases
that fall into node t.
The contribution of Tt to the total cost of T is
C (T ) =
α t R (T ) + α T , where R (T ) =
t t ∑ R ( t ').
t t '∈Tt
T − Tt becomes a better tree when Cα ({t} ) =

Cα (Tt ) since at
that value of α they have the same total cost but T − Tt is the
smaller of the two.

Computing the Sequence of Subtrees …
When Cα ({t} ) = Cα (Tt ) ,
{
R (Tt ) +α Tt = R ( t ) + α =t} R ( t ) + α
Solving for α we obtain
R ( t ) − R (Tt )
α= .

T −1
t
Thus, for any node t in T1 , if we increse α then when

R ( t ) − R (T1,t )
α=
T − 1
1,t
the tree obtained by pruning at t becomes better than T1.

We denote this value of α by αˆ .
Computing the Sequence of Subtrees …
We compute αˆ for for each node in T1 and then select
the "weakest link(s)," i.e., the nodes for which
R ( t ) − R (Tt )
g (t ) =
T − 1
t
is smallest. We prune T1 at this node (or nodes) to obtain

T2 . We then repeat the process to obtain T3 , T4 , ... until
we have reached {t1} , the root of the tree which gives us
the decreasing sequence of subtrees of Tmax ,
T1 > T2 > ... > {t1} .

Algorithm for Computing the Sequence of
Subtrees
A)= (0), α1 0,=
Set T1 T= and k 1.
B) While Tk > {t1}
1) For all non-terminal nodes t ∈ Tk

R ( t ) − R (Tk ,t )
g k (t ) =
T − 1
k ,t
2) α k +1 = min ( g k (t ) )
3) Visit the nodes in top-down order and prune whenever
g k (t ) = α k +1 to obtain Tk +1
4) k += 1

Selection of the Final Tree
The final tree is chosen from the previously constructed
decreasing sequence of subtrees of Tmax ,
T1 > T2 > ... > {t1} .
This is typically done by selecting the final tree that has the
lowest error rate on a set of test data.


ML 10 Decision Trees

Uploaded by

Copyright:

Available Formats

ML 10 Decision Trees

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML 10 Decision Trees

Uploaded by

Copyright:

Available Formats

Decision Tree Learning

The City College of New York

The City College of New York

The City College of New York

The City College of New York

The City College of New York

Go to Beach Go Running Friends Busy?

The City College of New York

Go to Beach Go Running Friends Busy?

The City College of New York

Go Running Friends Busy?

The City College of New York

Work to Do? Stay In

Friends Busy? Go to Movies

The City College of New York

F1 Root = Root Node = Root Feature

Yes No Branch = Split

The City College of New York

The City College of New York

• Full at every level

The City College of New York

The City College of New York

• Decision trees classify instances by sorting

The City College of New York

The City College of New York

The City College of New York

The City College of New York

The City College of New York

The City College of New York

• The depth of a decision tree is the length of the longest path

• The size of a decision tree is the number of nodes in the tree.

The City College of New York

• Only one decision node, i.e.,

The City College of New York

The City College of New York

The are a number of commonly used measures of impurity.

The City College of New York

The City College of New York

The City College of New York

The Gini Impurity, IG, is a measure of how often a

The City College of New York

The entropy, IH, is a measure of information content

The City College of New York

The classification error, IE, is a simple measure of

The City College of New York

p1= 0.3, p2 = 0.7 p1= 0.5, p2 = 0.5 p1= 0.7, p2 = 0.3

The City College of New York

p1= 0.3, p2 = 0.7 p1= 0.5, p2 = 0.5 p1= 0.7, p2 = 0.3

IG = 0.3*0.7 + 0.7*0.3 IG = 0.5*0.5 + 0.5*0.5 IG = 0.7*0.3 + 0.7*0.3

The City College of New York

p1= 0.3, p2 = 0.7 p1= 0.5, p2 = 0.5 p1= 0.7, p2 = 0.3

IH = -0.3*log2 0.3 IH = -0.5*log2 0.5 IH = -0.7*log2 0.7

The City College of New York

p1= 0.3, p2 = 0.7 p1= 0.5, p2 = 0.5 p1= 0.7, p2 = 0.3

IE = 1 – max{0.3,0.7} IE = 1 – max{0.5,0.5} IE = 1 – max{0.7,0.3}

The City College of New York

The City College of New York

The City College of New York

The City College of New York

IG = 0.30.7 + 0.70.3 IG = 0.50.5 + 0.50.5 IG = 0.70.3 + 0.70.3

IH = -0.3log2 0.3 IH = -0.5log2 0.5 IH = -0.7*log2 0.7