Piotr Paszek
Classification
Decision Tree
Decision Tree
ADecision Tree is a directed acyclic graph (tree), where:
– each internal node (nonleaf node) denotes a test on an attribute, – each branch represents an outcome of the test,
– each leaf node (or terminal node) holds a class label.
The topmost node in a tree is the root node.
Tree is constructed in a top-down recursive divide-and-conquer manner
At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance)
Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain, entropy) Conditions for stopping partitioning
All samples for a given node belong to the same class There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf
Algorithm for Decision Tree Induction
Input: D – training set; A – attribute set;
Output: DT – decision tree.
M ake T ree(D) P artition(D);
P artition(S)
if (All samples from S belong to the same class) then return;
f orall a2 A do
Calculate Split points for the attribute a;
Select the ”best” Split point; Divide S by S1 and S2; P artition(S1);
P artition(S2);
The three measures, in general, return good results, but
Information gain (ID3, C4.5):
biased towards multivalued attributes Gain ratio (C4.5):
tends to prefer unbalanced splits in which one partition is much smaller than the other
Gini index (CART):
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized partitions and purity in both partitions
Information gain
Entropy function (a measure of uncertainty)
The expected information needed to classify a tuple in D is given by Inf o(D) =
Xm i=1
pi· log2pi
where
m – number of decision classes (in D)
pi – probability that an arbitrary tuple in D belongs to class Ci
(estimated by|Ci,D|/|D|).
Conditional Entropy
Information needed (after using attribute A to split D into v partitions S1, S2, . . . , Sv) to classify D:
Inf oA(D) = Xv
j=1
|Sj|
|D|Inf o(Sj)
Inf oA(D) is the expected information required to classify a tuple from D based on the partitioning by A.
The smaller the expected information (still) required, the greater the purity of the partitions. If the split D into partitions S1, S2, . . . , Sv
creates ”pure” partitions (Si contains objects belonging to one class)
Information Gain
Information Gain
Information gained by branching on attribute A Gain(A) = Inf o(D) Inf oA(D)
Gain(A) tells us how much would be gained by branching on A.
It is the expected reduction in the information requirement caused by knowing the value of A.
The attribute A with the highest information gain is chosen as the splitting attribute. This is equivalent to saying that we want to partition on the attribute A that would do the best classification, so that the amount of information still required to finish classifying the tuples is minimal.
Output: DT – decision tree.
M ake T ree(D) P artition(D);
P artition(S)
if (All samples from S belong to the same class) then return;
f orall a2 A do Calculate Gain(a);
Select b – attribute with the greatest information gain Divide S (using b): S1, . . . ; Svb
An algorithm that creates a decision tree using information gain
1 We choose the attribute with the greatest information gain (the greatest reduction in entropy), which becomes the root of the decision tree.
2 For each value of the selected attribute, a branch is created in this decision tree. With each branch we associate records with the same value of the selected attribute (partition).
3 Partitioning of each partition associated with each branch is started. The partition continues until all the records of that partition belong to the same class.
(name) Egg Environment Predator Vertebrate Type
wasp 1 land 0 0 insect
ladybug 1 land 1 0 insect
pigeon 1 land 0 1 bird
ostrich 1 land 0 1 bird
hawk 1 land 1 1 bird
catfish 1 water 1 1 fish
antelope 0 land 0 1 mammal
dolphin 0 water 1 1 mammal
lion 0 land 1 1 mammal
Information Gain – example
S ={wasp, ladybug, . . . , lion}, |S| = 9
T ype2 {insect, bird, fish, mammal}, m = 4 (4 classes):
C1 – insect, s1 = 2, C2 – bird, s2 = 3, C3 – fish, s3 = 1, C4 – mammal, s4 = 3.
Inf o(2, 3, 1, 3) = 2 9log2
2 9
1 3log2
1 3
1 9log2
1 9
1 3log2
1
3 = 1.891 Inf oegg:
egg = 1: |C11| = 2, |C21| = 3, |C31| = 1, |C41| = 0, Inf o(s11, s21, s31, s41) = Inf o(2, 3, 1, 0) = 1.459 egg = 0: |C10| = 0, |C20| = 0, |C30| = 0, |C40| = 3, Inf o(s10, s20, s30, s40) = Inf o(0, 0, 0, 3) = 0.0
Inf oegg(S) = 6/9· Info(2, 3, 1, 0) + 3/9 · Info(0, 0, 0, 3) = 0.973 Gain(egg) = Inf o(S) Inf oegg(S) = 0.873
Inf oenvironment:
environment = land : s1l = 2, s2l = 3, s3l = 0, s4l = 2, Inf o(s1l, s2l, s3l, s4l) = Inf o(2, 3, 0, 2) = 1.557
environment = water: s1w = 0, s2w = 0, s3w = 1, s4w = 1, Inf o(s1w, s2w, s3w, s4w) = Inf o(0, 0, 1, 1) = 1.0
Inf oenvironment(S) = 7/9·Info(2, 3, 0, 2)+2/9·Info(0, 0, 1, 1) = 1.433 Gain(environment) = Inf o(S) Inf oenvironment(S) = 0.413
Information Gain – example
Inf opredator:
predator = 1: s11 = 1, s21 = 1, s31 = 1, s41 = 2, Inf o(s11, s21, s31, s41) = Inf o(1, 1, 1, 2) = 1.922 P redator = 0: s10 = 1, s20 = 2, s30 = 0, s40 = 1, Inf o(s10, s20, s30, s40) = Inf o(1, 2, 0, 1) = 1.5
Inf opredator(S) = 5/9·Info(1, 1, 1, 2)+4/9·Info(1, 2, 0, 1) = 1.734 Gain(predator) = Inf o(S) Inf opredator(S) = 0.112
Inf overtebrate:
vertebrate = 1: s11 = 0, s21 = 3, s31 = 1, s41 = 3, Inf o(s11, s21, s31, s41) = Inf o(0, 3, 1, 3) = 1.449 vertebrate = 0: s10 = 2, s20 = 0, s30 = 0, s40 = 0, Inf o(s10, s20, s30, s40) = Inf o(2, 0, 0, 0) = 0.0
Inf overtebrate(S) = 7/9·Info(0, 3, 1, 3)+2/9·Info(2, 0, 0, 0) = 1.127 Gain(vertebrate) = Inf o(S) Inf overtebrate(S) = 0.719 Because Eggattribute maximizes the information gain, it will be selected as the first test attribute (node).
Information Gain – example
Figure: Decision Tree after first division
From the root (egg attribute), there are two branches corresponding to the values of this attribute.
They join root with vertices representing partitions S0 and S1. S0 is a clean partition, only contains mammals (items for which T ype = mammal).
Partition S1 will be split further.
Information Gain – example
S1 ={wasp, ladybug, pigeon, ostrich, hawk, catfish}; |S1| = 6 A ={environment, predator, vertebra}:
C1 – insect, s1 = 2, C2 – bird, s2 = 3, C3 – fish, s3 = 1.
Inf o(s1, s2, s3) = 2 6log2
2 6
1 2log2
1 2
1 6log2
1
6 = 1.459 Inf oenvironment:
environment = land: s1l = 2, s2l = 3, s3l = 0, Inf o(s1l, s2l, s3l) = Inf o(2, 3, 0) = 0.971
environment = water: s1w = 0, s2w = 0, s3w = 1, Inf o(s1w, s2w, s3w) = Inf o(0, 0, 1) = 0, 0
Inf oenvironment(S1) = 5/6·Info(2, 3, 0)+1/6·Info(0, 0, 1) = 0.809 Gain(environment) = Inf o(S1) Inf oenvironment(S1) = 0.650
Inf opredator:
predator = 1: s11 = 1, s21 = 1, s31 = 1, Inf o(s11, s21, s31) = Inf o(1, 1, 1) = 1.585 predator = 0: s10 = 1, s20 = 2, s30 = 0, Inf o(s10, s20, s30) = Inf o(1, 2, 0) = 0.918
Inf opredator(S1) = 1/2· I(1, 1, 1) + 1/2 · I(1, 2, 0) = 1.363 Gain(predator) = Inf o(S1) Inf opredator(S1) = 0.096
Information Gain – example
Inf overtebrate:
vertebrate = 1: s11 = 0, s21 = 3, s31 = 1, Inf o(s11, s21, s31) = Inf o(0, 3, 1) = 0.811 vertebrate = 0: s10 = 2, s20 = 0, s30 = 0, Inf o(s10, s20, s30) = Inf o(2, 0, 0) = 0
Inf overtebrate(S1) = 2/3· Info(0, 3, 1) + 1/3 · Info(2, 0, 0) = 0.541 Gain(vertebrate) = Inf o(S1) Inf overtebrate(S1) = 0.918
Vertebrateattribute maximizes the information gain for S1, so it will be selected as the second test attribute.
Partitions S10 and S11 are created.
Partition S10={wasp, ladybug} is clean (contains only insects).
For partition S11 we continue to divide.
Information Gain – example
S11 ={pigeon, ostrich, hawk, catfish}, |S11| = 4, A ={environment, predator}
C1 – bird, s1 = 3, C2 – fish, s3 = 1 Inf o(s1, s2) = Inf o(3, 1) = 3
4log2
3 4
1 4log2
1
4 = 0.811 Inf oenvironment:
environment = land: s1l = 3, s1l = 0, I(s1l, s2l) = I(3, 0) = 0
environment = water: s1w = 0, s3w = 1, I(s1w, s2w) = I(0, 1) = 0
Inf oenvironment(S11) = 3/4· Info(3, 0) + 1/4 · Info(0, 1) = 0 Gain(environment) = Inf o(S11) Inf oenvironment(S11) = 0.811
There is no need to calculate the information gain for the predator attribute because the environment had zero entropy (“pure”
divisions).
Soenvironment is the third test attribute.
It makes the division of S11l and S11w.
Partition S11l ={pigeon, ostrich, hawk} contains only birds and partition S11w ={catfish} contains fish.
Information Gain – example
Figure: Decision tree - final version
Let attribute A be a continuous-valued attribute Must determine the best split point for A
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent values is considered as a possible split point
(ai+ ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information requirement for A is selected as the split point for A
Split:
D1 is the set of tuples in D satisfying A split point, and D2
is the set of tuples in D satisfying A > split point
Gain ratio
Problem of Information Gain approach
Biased towards tests with many outcomes (attributes having a large number of values)
E.g: attribute acting as unique identifier
Produce a large number of partitions (1 tuple per partition) Each resulting partition D is pure Inf o(D) = 0
The information gain is maximized
Extension to Information Gain
C4.5, a successor of ID3 uses an extension to information gain known as gain ratio
Overcomes the bias of Information gain
Applies a kind of normalization to information gain using a split inf ormation value
Split Information value
The split inf ormation value represents the potential information generated by splitting the training data set D into v partitions, corresponding to v outcomes on attribute A
SplitInf o(A) = Xv
j=1
|Dj|
|D| · log2 |Dj|
|D|
High Split Info: partitions have more or less the same size (uniform)
Low Split Info: few partitions hold most of the tuples (peaks)
Gain Ratio
Gain ratio
The gain ratio is defined as
GainRatio(A) = Gain(A) SplitInf o(A)
In the algorithm C4.5 the attribute with the maximum gain ratio is selected as the splitting attribute
Example
Gain(egg) = 0.873
SplitInf o(egg) = 6/9· log2(6/9) 3/9· log2(3/9) = 0.918 GainRatio(egg) = Gain(egg)
SplitInf o(egg) = 0.951
The Gini Index (used in CART) measures the impurity of a data partition D
Gini Index (gini(D))
If a data set D contains examples from n classes, gini index is defined as
gini(D) = 1 Xn
j=1
p2j where pj is the relative frequency of class Cj in D (estimated by|Cj,D|/|D|).
Gini Index
Gini Index for binary split – gini
A(D)
If a data set D is split on A into two subsets D1 and D2, the gini index is defined as
giniA(D) = |D1|
|D| · gini(D1) + |D2|
|D| · gini(D2)
The attribute provides the smallest giniA(D) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)
Experience Education Job
(exp.) (edu.) (job)
3 higher yes
2 secondary no
1 higher no
2 higher yes
0 secondary no
3 basic no
Experience – professional experience of the person (# years) Education – level of education of a person
Job – has the applicant been accepted?
Gini Index – example
giniexp. : exp.2 {0, 1, 2, 3}.
Split point : exp. < 1.
# objects yes no
exp. < 1 0 1 gini(exp. < 1) = 0 exp. 1 2 3 gini(exp. 1) = 12/25 giniexp.(exp. < 1) = (1/6)· 0 + (5/6) · (12/25) = 2/5 Split point : exp. < 2.
# objects yes no
exp. < 2 0 2 gini(exp. < 2) = 0 exp. 2 2 2 gini(exp. 2) = 1/2 giniexp.(exp. < 2) = (2/6)· 0 + (4/6) · (1/2) = 1/3
Split point: exp. < 3.
# objects yes no
exp. < 3 1 3 gini(exp. < 3) = 3/8 exp. 3 1 1 gini(exp. 3) = 1/2 giniexp.(exp. < 3) = (4/6)· (3/8) + (2/6) · (1/2) = 5/12
The smallest Gini index for attribute experience is 1/3 for split point exp. < 2.
Gini Index – example
giniedu. : edu.2{basic, sec., hig.}.
Split point: P1 ={{hig., sec.},{basic}}:
# objects yes no
edu.2{hig.,sec.} 2 3 gini(edu.2{hig.,sec.}) = 12/25 edu. = basic 0 1 gini(edu. = basic) = 0
giniedu.(P1) = (5/6)· (12/25) + (1/6) · 0 = 2/5 Split point: P2 ={{sec.}, {hig.,basic}}:
# objects yes no
edu.2 {hig.,basic} 2 2 gini(edu.2 {hig.,basic})=1/2 edu. =sec. 0 2 gini(edu. =sec.) = 0
giniedu.(P2) = (2/6)· 0 + (4/6) · (1/2) = 1/3
Split point: P3 ={{hig.},{basic,sec.}}:
# objects yes no
edu.2 {sec.,basic} 0 3 gini(edu.2 {sec.,basic}) = 0 edu. = hig. 2 1 gini(edu. =hig.) = 4/9 giniedu.(P3) = (3/6)· (4/9) + (3/6) · 0 = 2/9
The minimum value of the gini index for the education attribute is 2/9 for the split point P3 ={{hig.},{basic,sec.}}.
This is less than the experience attribute.
In general, the best split point is P3 for theeducation attribute.
Gini Index – example
Figure: Decision Tree after used split point P3
giniexp. : exp.2 {1, 2, 3}.
Split point: exp. < 2.
gini(exp. < 2) = 0 gini(exp. 2) = 0
giniexp.(exp. < 2) = (1/3)· 0 + (2/3) · 0 = 0 Split point: exp. < 3.
gini(exp. < 3) = 1/2 gini(exp. 3) = 0
ginisplit(exp. < 3) = (2/3)· (1/2) + (1/3) · 0 = 1/3
Gini Index – example
Figure: Decision tree - final version
Many branches of the decision tree will reflect anomalies in the training data due to noise or outliers
Poor accuracy for unseen samples
Solution: Pruning
– Remove the least reliable branches
Tree Pruning Approaches
Prepruning
Halt tree construction early – do not split a node if this would result in the goodness measure falling below a threshold Upon halting, the node becomes a leaf; The leaf may hold the most frequent class among the subset tuples
Difficult to choose an appropriate threshold
Postpruning
Remove branches from a fully grown tree – get a sequence of progressively pruned trees
A subtree at a given node is pruned by replacing it by a leaf The leaf is labeled with the most frequent class
Use a set of data di↵erent from the training data to decide which is the best pruned tree
Example: cost complexity pruning algorithm
Cost complexity of a tree is a function of the number of leaves and the error rate (percentage of tuples misclassified by the tree) At each node N compute
The cost complexity of the subtree at N
The cost complexity of the subtree at N if it were to be pruned If pruning results is smaller cost, then prune the subtree at N Use a set of data di↵erent from the training data to decide which is the best pruned tree
Scalability and Decision Tree Induction
Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed.
Scalable Decision Tree Induction Methods
SLIQ(EDBT96 – Mehta et al.) – Builds an index for each attribute and only class list and the current attribute list reside in memory
SPRINT (VLDB96 – J. Shafer et al.) – Constructs an attribute list data structure
PUBLIC (VLDB98 – Rastogi & Shim) – Integrates tree splitting and tree pruning: stop growing the tree earlier
RainForest (VLDB98 – Gehrke, Ramakrishnan & Ganti) – Builds an AVC-list (attribute, value, class label)
BOAT (PODS99 – Gehrke, Ganti, Ramakrishnan & Loh) – Uses bootstrapping to create several small samples
– Decision Trees have relatively faster learning speed than other methods
– Conversable to simple andeasy to understand classification rules – Information Gain, Ratio Gain and Gini Index are the most
common methods of attribute selection
– Tree pruning is necessary to remove unreliable branches – Scalabilityis an issue for large datasets