Decision Tree

(1)

Piotr Paszek

Classification

Decision Tree

(2)

Decision Tree

ADecision Tree is a directed acyclic graph (tree), where:

– each internal node (nonleaf node) denotes a test on an attribute, – each branch represents an outcome of the test,

– each leaf node (or terminal node) holds a class label.

The topmost node in a tree is the root node.

(3)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance)

Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain, entropy) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf

(4)

Algorithm for Decision Tree Induction

Input: D – training set; A – attribute set;

Output: DT – decision tree.

M ake T ree(D) P artition(D);

P artition(S)

if (All samples from S belong to the same class) then return;

f orall a2 A do

Calculate Split points for the attribute a;

Select the ”best” Split point; Divide S by S₁ and S₂; P artition(S1);

P artition(S2);

(5)

The three measures, in general, return good results, but

Information gain (ID3, C4.5):

biased towards multivalued attributes Gain ratio (C4.5):

tends to prefer unbalanced splits in which one partition is much smaller than the other

Gini index (CART):

biased to multivalued attributes

has difficulty when # of classes is large

tends to favor tests that result in equal-sized partitions and purity in both partitions

(6)

Information gain

Entropy function (a measure of uncertainty)

The expected information needed to classify a tuple in D is given by Inf o(D) =

Xm i=1

pi· log2pi

where

m – number of decision classes (in D)

pi – probability that an arbitrary tuple in D belongs to class Ci

(estimated by|Ci,D|/|D|).

(7)

Conditional Entropy

Information needed (after using attribute A to split D into v partitions S1, S2, . . . , Sv) to classify D:

Inf o_A(D) = Xv

j=1

|Sj|

|D|Inf o(S_j)

Inf oA(D) is the expected information required to classify a tuple from D based on the partitioning by A.

The smaller the expected information (still) required, the greater the purity of the partitions. If the split D into partitions S1, S2, . . . , Sv

creates ”pure” partitions (Si contains objects belonging to one class)

(8)

Information Gain

Information gained by branching on attribute A Gain(A) = Inf o(D) Inf oA(D)

Gain(A) tells us how much would be gained by branching on A.

It is the expected reduction in the information requirement caused by knowing the value of A.

The attribute A with the highest information gain is chosen as the splitting attribute. This is equivalent to saying that we want to partition on the attribute A that would do the best classification, so that the amount of information still required to finish classifying the tuples is minimal.

(9)

Output: DT – decision tree.

M ake T ree(D) P artition(D);

P artition(S)

if (All samples from S belong to the same class) then return;

f orall a2 A do Calculate Gain(a);

Select b – attribute with the greatest information gain Divide S (using b): S1, . . . ; Svb

(10)

An algorithm that creates a decision tree using information gain

1 We choose the attribute with the greatest information gain (the greatest reduction in entropy), which becomes the root of the decision tree.

2 For each value of the selected attribute, a branch is created in this decision tree. With each branch we associate records with the same value of the selected attribute (partition).

3 Partitioning of each partition associated with each branch is started. The partition continues until all the records of that partition belong to the same class.

(11)

(name) Egg Environment Predator Vertebrate Type

wasp 1 land 0 0 insect

ladybug 1 land 1 0 insect

pigeon 1 land 0 1 bird

ostrich 1 land 0 1 bird

hawk 1 land 1 1 bird

catfish 1 water 1 1 fish

antelope 0 land 0 1 mammal

dolphin 0 water 1 1 mammal

lion 0 land 1 1 mammal

(12)

Information Gain – example

S ={wasp, ladybug, . . . , lion}, |S| = 9

T ype2 {insect, bird, fish, mammal}, m = 4 (4 classes):

C1 – insect, s1 = 2, C2 – bird, s2 = 3, C3 – fish, s3 = 1, C4 – mammal, s₄ = 3.

Inf o(2, 3, 1, 3) = 2 9log2

2 9

1 3log2

1 3

1 9log2

1 9

1 3log2

1

3 = 1.891 Inf oegg:

egg = 1: |C11| = 2, |C21| = 3, |C31| = 1, |C41| = 0, Inf o(s₁₁, s₂₁, s₃₁, s₄₁) = Inf o(2, 3, 1, 0) = 1.459 egg = 0: |C¹⁰| = 0, |C²⁰| = 0, |C³⁰| = 0, |C⁴⁰| = 3, Inf o(s10, s20, s30, s40) = Inf o(0, 0, 0, 3) = 0.0

Inf o_egg(S) = 6/9· Info(2, 3, 1, 0) + 3/9 · Info(0, 0, 0, 3) = 0.973 Gain(egg) = Inf o(S) Inf oegg(S) = 0.873

(13)

Inf oenvironment:

environment = land : s1l = 2, s2l = 3, s3l = 0, s4l = 2, Inf o(s1l, s2l, s3l, s4l) = Inf o(2, 3, 0, 2) = 1.557

environment = water: s_1w = 0, s_2w = 0, s_3w = 1, s_4w = 1, Inf o(s1w, s2w, s3w, s4w) = Inf o(0, 0, 1, 1) = 1.0

Inf oenvironment(S) = 7/9·Info(2, 3, 0, 2)+2/9·Info(0, 0, 1, 1) = 1.433 Gain(environment) = Inf o(S) Inf oenvironment(S) = 0.413

(14)

Information Gain – example

Inf opredator:

predator = 1: s11 = 1, s21 = 1, s31 = 1, s41 = 2, Inf o(s11, s21, s31, s41) = Inf o(1, 1, 1, 2) = 1.922 P redator = 0: s10 = 1, s20 = 2, s30 = 0, s40 = 1, Inf o(s10, s20, s30, s40) = Inf o(1, 2, 0, 1) = 1.5

Inf opredator(S) = 5/9·Info(1, 1, 1, 2)+4/9·Info(1, 2, 0, 1) = 1.734 Gain(predator) = Inf o(S) Inf opredator(S) = 0.112

(15)

Inf overtebrate:

vertebrate = 1: s11 = 0, s21 = 3, s31 = 1, s41 = 3, Inf o(s₁₁, s₂₁, s₃₁, s₄₁) = Inf o(0, 3, 1, 3) = 1.449 vertebrate = 0: s10 = 2, s20 = 0, s30 = 0, s40 = 0, Inf o(s10, s20, s30, s40) = Inf o(2, 0, 0, 0) = 0.0

Inf o_vertebrate(S) = 7/9·Info(0, 3, 1, 3)+2/9·Info(2, 0, 0, 0) = 1.127 Gain(vertebrate) = Inf o(S) Inf overtebrate(S) = 0.719 Because Eggattribute maximizes the information gain, it will be selected as the first test attribute (node).

(16)

Information Gain – example

Figure: Decision Tree after first division

(17)

From the root (egg attribute), there are two branches corresponding to the values of this attribute.

They join root with vertices representing partitions S0 and S1. S0 is a clean partition, only contains mammals (items for which T ype = mammal).

Partition S1 will be split further.

(18)

Information Gain – example

S1 ={wasp, ladybug, pigeon, ostrich, hawk, catfish}; |S¹| = 6 A ={environment, predator, vertebra}:

C₁ – insect, s₁ = 2, C₂ – bird, s₂ = 3, C₃ – fish, s₃ = 1.

Inf o(s1, s2, s3) = 2 6log2

2 6

1 2log2

1 2

1 6log2

1

6 = 1.459 Inf oenvironment:

environment = land: s1l = 2, s2l = 3, s3l = 0, Inf o(s1l, s2l, s3l) = Inf o(2, 3, 0) = 0.971

environment = water: s_1w = 0, s_2w = 0, s_3w = 1, Inf o(s1w, s2w, s3w) = Inf o(0, 0, 1) = 0, 0

Inf oenvironment(S1) = 5/6·Info(2, 3, 0)+1/6·Info(0, 0, 1) = 0.809 Gain(environment) = Inf o(S₁) Inf oenvironment(S₁) = 0.650

(19)

Inf opredator:

predator = 1: s11 = 1, s21 = 1, s31 = 1, Inf o(s11, s21, s31) = Inf o(1, 1, 1) = 1.585 predator = 0: s10 = 1, s20 = 2, s30 = 0, Inf o(s10, s20, s30) = Inf o(1, 2, 0) = 0.918

Inf opredator(S1) = 1/2· I(1, 1, 1) + 1/2 · I(1, 2, 0) = 1.363 Gain(predator) = Inf o(S1) Inf opredator(S1) = 0.096

(20)

Information Gain – example

Inf o_vertebrate:

vertebrate = 1: s11 = 0, s21 = 3, s31 = 1, Inf o(s11, s21, s31) = Inf o(0, 3, 1) = 0.811 vertebrate = 0: s₁₀ = 2, s₂₀ = 0, s₃₀ = 0, Inf o(s10, s20, s30) = Inf o(2, 0, 0) = 0

Inf overtebrate(S1) = 2/3· Info(0, 3, 1) + 1/3 · Info(2, 0, 0) = 0.541 Gain(vertebrate) = Inf o(S₁) Inf o_vertebrate(S₁) = 0.918

(21)

Vertebrateattribute maximizes the information gain for S₁, so it will be selected as the second test attribute.

Partitions S10 and S11 are created.

Partition S₁₀={wasp, ladybug} is clean (contains only insects).

For partition S11 we continue to divide.

(22)

Information Gain – example

S11 ={pigeon, ostrich, hawk, catfish}, |S¹¹| = 4, A ={environment, predator}

C₁ – bird, s₁ = 3, C₂ – fish, s₃ = 1 Inf o(s1, s2) = Inf o(3, 1) = 3

4log2

3 4

1 4log2

1

4 = 0.811 Inf oenvironment:

environment = land: s1l = 3, s1l = 0, I(s1l, s2l) = I(3, 0) = 0

environment = water: s_1w = 0, s_3w = 1, I(s1w, s2w) = I(0, 1) = 0

Inf oenvironment(S11) = 3/4· Info(3, 0) + 1/4 · Info(0, 1) = 0 Gain(environment) = Inf o(S₁₁) Inf oenvironment(S₁₁) = 0.811

(23)

There is no need to calculate the information gain for the predator attribute because the environment had zero entropy (“pure”

divisions).

Soenvironment is the third test attribute.

It makes the division of S_11l and S_11w.

Partition S11l ={pigeon, ostrich, hawk} contains only birds and partition S11w ={catfish} contains fish.

(24)

Information Gain – example

Figure: Decision tree - final version

(25)

Let attribute A be a continuous-valued attribute Must determine the best split point for A

Sort the value A in increasing order

Typically, the midpoint between each pair of adjacent values is considered as a possible split point

(ai+ ai+1)/2 is the midpoint between the values of ai and ai+1

The point with the minimum expected information requirement for A is selected as the split point for A

Split:

D₁ is the set of tuples in D satisfying A split point, and D2

is the set of tuples in D satisfying A > split point

(26)

Gain ratio

Problem of Information Gain approach

Biased towards tests with many outcomes (attributes having a large number of values)

E.g: attribute acting as unique identifier

Produce a large number of partitions (1 tuple per partition) Each resulting partition D is pure Inf o(D) = 0

The information gain is maximized

Extension to Information Gain

C4.5, a successor of ID3 uses an extension to information gain known as gain ratio

Overcomes the bias of Information gain

Applies a kind of normalization to information gain using a split inf ormation value

(27)

Split Information value

The split inf ormation value represents the potential information generated by splitting the training data set D into v partitions, corresponding to v outcomes on attribute A

SplitInf o(A) = Xv

j=1

|D^j|

|D| · log2 |D^j|

|D|

High Split Info: partitions have more or less the same size (uniform)

Low Split Info: few partitions hold most of the tuples (peaks)

(28)

Gain Ratio

Gain ratio

The gain ratio is defined as

GainRatio(A) = Gain(A) SplitInf o(A)

In the algorithm C4.5 the attribute with the maximum gain ratio is selected as the splitting attribute

Example

Gain(egg) = 0.873

SplitInf o(egg) = 6/9· log2(6/9) 3/9· log2(3/9) = 0.918 GainRatio(egg) = Gain(egg)

SplitInf o(egg) = 0.951

(29)

The Gini Index (used in CART) measures the impurity of a data partition D

Gini Index (gini(D))

If a data set D contains examples from n classes, gini index is defined as

gini(D) = 1 Xn

j=1

p²_j where pj is the relative frequency of class Cj in D (estimated by|Cj,D|/|D|).

(30)

Gini Index

Gini Index for binary split – gini

_A

(D)

If a data set D is split on A into two subsets D₁ and D₂, the gini index is defined as

giniA(D) = |D1|

|D| · gini(D¹) + |D2|

|D| · gini(D²)

The attribute provides the smallest giniA(D) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)

(31)

Experience Education Job

(exp.) (edu.) (job)

3 higher yes

2 secondary no

1 higher no

2 higher yes

0 secondary no

3 basic no

Experience – professional experience of the person (# years) Education – level of education of a person

Job – has the applicant been accepted?

(32)

Gini Index – example

giniexp. : exp.2 {0, 1, 2, 3}.

Split point : exp. < 1.

# objects yes no

exp. < 1 0 1 gini(exp. < 1) = 0 exp. 1 2 3 gini(exp. 1) = 12/25 giniexp.(exp. < 1) = (1/6)· 0 + (5/6) · (12/25) = 2/5 Split point : exp. < 2.

# objects yes no

exp. < 2 0 2 gini(exp. < 2) = 0 exp. 2 2 2 gini(exp. 2) = 1/2 giniexp.(exp. < 2) = (2/6)· 0 + (4/6) · (1/2) = 1/3

(33)

Split point: exp. < 3.

# objects yes no

exp. < 3 1 3 gini(exp. < 3) = 3/8 exp. 3 1 1 gini(exp. 3) = 1/2 giniexp.(exp. < 3) = (4/6)· (3/8) + (2/6) · (1/2) = 5/12

The smallest Gini index for attribute experience is 1/3 for split point exp. < 2.

(34)

Gini Index – example

giniedu. : edu.2{basic, sec., hig.}.

Split point: P1 ={{hig., sec.},{basic}}:

# objects yes no

edu.2{hig.,sec.} 2 3 gini(edu.2{hig.,sec.}) = 12/25 edu. = basic 0 1 gini(edu. = basic) = 0

giniedu.(P1) = (5/6)· (12/25) + (1/6) · 0 = 2/5 Split point: P2 ={{sec.}, {hig.,basic}}:

# objects yes no

edu.2 {hig.,basic} 2 2 gini(edu.2 {hig.,basic})=1/2 edu. =sec. 0 2 gini(edu. =sec.) = 0

giniedu.(P2) = (2/6)· 0 + (4/6) · (1/2) = 1/3

(35)

Split point: P3 ={{hig.},{basic,sec.}}:

# objects yes no

edu.2 {sec.,basic} 0 3 gini(edu.2 {sec.,basic}) = 0 edu. = hig. 2 1 gini(edu. =hig.) = 4/9 giniedu.(P3) = (3/6)· (4/9) + (3/6) · 0 = 2/9

The minimum value of the gini index for the education attribute is 2/9 for the split point P3 ={{hig.},{basic,sec.}}.

This is less than the experience attribute.

In general, the best split point is P3 for theeducation attribute.

(36)

Gini Index – example

Figure: Decision Tree after used split point P3

(37)

gini_exp. : exp.2 {1, 2, 3}.

Split point: exp. < 2.

gini(exp. < 2) = 0 gini(exp. 2) = 0

giniexp.(exp. < 2) = (1/3)· 0 + (2/3) · 0 = 0 Split point: exp. < 3.

gini(exp. < 3) = 1/2 gini(exp. 3) = 0

ginisplit(exp. < 3) = (2/3)· (1/2) + (1/3) · 0 = 1/3

(38)

Gini Index – example

Figure: Decision tree - final version

(39)

Many branches of the decision tree will reflect anomalies in the training data due to noise or outliers

Poor accuracy for unseen samples

Solution: Pruning

– Remove the least reliable branches

(40)

Tree Pruning Approaches

Prepruning

Halt tree construction early – do not split a node if this would result in the goodness measure falling below a threshold Upon halting, the node becomes a leaf; The leaf may hold the most frequent class among the subset tuples

Difficult to choose an appropriate threshold

Postpruning

Remove branches from a fully grown tree – get a sequence of progressively pruned trees

A subtree at a given node is pruned by replacing it by a leaf The leaf is labeled with the most frequent class

Use a set of data di↵erent from the training data to decide which is the best pruned tree

(41)

Example: cost complexity pruning algorithm

Cost complexity of a tree is a function of the number of leaves and the error rate (percentage of tuples misclassified by the tree) At each node N compute

The cost complexity of the subtree at N

The cost complexity of the subtree at N if it were to be pruned If pruning results is smaller cost, then prune the subtree at N Use a set of data di↵erent from the training data to decide which is the best pruned tree

(42)

Scalability and Decision Tree Induction

Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed.

Scalable Decision Tree Induction Methods

SLIQ(EDBT96 – Mehta et al.) – Builds an index for each attribute and only class list and the current attribute list reside in memory

SPRINT (VLDB96 – J. Shafer et al.) – Constructs an attribute list data structure

PUBLIC (VLDB98 – Rastogi & Shim) – Integrates tree splitting and tree pruning: stop growing the tree earlier

RainForest (VLDB98 – Gehrke, Ramakrishnan & Ganti) – Builds an AVC-list (attribute, value, class label)

BOAT (PODS99 – Gehrke, Ganti, Ramakrishnan & Loh) – Uses bootstrapping to create several small samples

(43)

– Decision Trees have relatively faster learning speed than other methods

– Conversable to simple andeasy to understand classification rules – Information Gain, Ratio Gain and Gini Index are the most

common methods of attribute selection

– Tree pruning is necessary to remove unreliable branches – Scalabilityis an issue for large datasets