Data Mining methods as a artificial intelligence tool
Agnieszka Nowak - Brzezinska
Decision Trees and
k-Nearest neighbor and
basket analysis
Lectures 4
BASKET ANALYSIS
• Data mining (the advanced analysis step of the
"Knowledge Discovery in Databases" process, or KDD), an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems.
• The overall goal of the data mining process is to extract
information from a data set and transform it into an
understandable structure for further use.
What Is Frequent Pattern Analysis?
• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
• First proposed by Agrawal, Imielinski, and Swami in the context of frequent itemsets and association rule mining
• Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
• Applications
– Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.
Support and Confidence - Example
• What is the support and confidence of the following rules?
• {Beer}{Bread}
• {Bread, PeanutButter}{Jelly} ?
Support(XY)=support(X Y)
confidence(XY)=support(XY)/support(X)
Association Rule Mining Problem Definition
• Given a set of transactions T={t1, t2, …,tn} and 2 thresholds; minsup and minconf,
• Find all association rules XY with support minsup and confidence minconf
• I.E: we want rules with high confidence and support
• We call these rules interesting
• We would like to
• Design an efficient algorithm for mining association rules in large data sets
• Develop an effective approach for distinguishing interesting rules from spurious ones
Basic Concepts: Frequent Patterns
• itemset: A set of one or more items
• k-itemset X = {x
1, …, x
k}
• (absolute) support, or, support count of X: Frequency or
occurrence of an itemset X
• (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X)
• An itemset X is frequent if X’s support is no less than a minsup threshold
Customer buys diaper Customer
buys both
Customer buys beer
Tid Items bought
10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk
Basic Concepts: Association Rules
• Find all the rules X Y with
minimum support and confidence – support, s, probability that a
transaction contains X Y – confidence, c, conditional
probability that a transaction having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3
Customer buys
diaper Customer
buys both
Customer buys beer
Nuts, Eggs, Milk 40
Nuts, Coffee, Diaper, Eggs, Milk 50
Beer, Diaper, Eggs 30
Beer, Coffee, Diaper 20
Beer, Nuts, Diaper 10
Items bought Tid
Association rules: (many more!)
Beer Diaper (60%, 100%)
Diaper Beer (60%, 75%)
Measures of Predictive Ability
Support refers to the percentage of baskets where the rule was true (both left and right-side products were present).
Confidence measures what percentage of baskets that contained the left-hand product also contained the right-hand.
Lift measures how many times Confidence is larger
than the expected (baseline) Confidence. A lift
value that is greater than 1 is desirable.
Support and Confidence: An Illustration
A B C A C D B C D A D E B C E
Rule A D C A A C
B & C D
Support 2/5 2/5
2/5 1/5
Confidence 2/3 2/4
2/3 1/3
Lift 2
1 2
0.50
Problem Decomposition
1. Find all sets of items that have minimum support (frequent itemsets)
2. Use the frequent itemsets to generate the
desired rules
Problem Decomposition – Example
Transaction ID Items Bought
1 Shoes, Shirt, Jacket
2 Shoes,Jacket
3 Shoes, Jeans
4 Shirt, Sweatshirt
For min support = 50% = 2 trans, and min confidence = 50%
Frequent Itemset Support
{Shoes} 75%
{Shirt} 50%
{Jacket} 50%
{Shoes, Jacket} 50%
For the rule Shoes Jacket
•Support = Sup({Shoes,Jacket)}=50%
•Confidence = =66.6%
70 50
Jacket Shoes has 50% support and 100% confidence
The Apriori Algorithm — Example
TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5
itemset sup.
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2
{2} 3
{3} 3
{5} 3
Scan D
C1 L1
itemset {1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
itemset sup
{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3
{3 5} 2
itemset sup {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2
L2
C2 C2
Scan D
C3
itemset
L3{2 3 5}
Scan D
itemset sup
{2 3 5} 2
Database D
Min support =50% = 2 trans
KNN
Examples of Classification Task
• Predicting tumor cells as benign or malignant
• Classifying credit card transactions as legitimate or fraudulent
• Classifying secondary structures of protein as alpha-helix, beta-sheet, or random
coil
• Categorizing news stories as finance,
weather, entertainment, sports, etc
KNN - Definition
KNN is a simple algorithm that stores all available cases and classifies new
cases based on a similarity measure
KNN – different names
• K-Nearest Neighbors
• Memory-Based Reasoning
• Example-Based Reasoning
• Instance-Based Learning
• Case-Based Reasoning
• Lazy Learning
KNN – Short History
• Nearest Neighbors have been used in statistical estimation and pattern recognition already in the beginning of 1970’s ( non-parametric techniques ).
• People reason by remembering and learn by doing.
• Thinking is reminding, making analogies.
• The k-Nearest Neighbors (kNN) method provides a simple approach to calculating predictions for unknown observations.
• It calculates a prediction by looking at similar observations and uses some function of their response values to make the prediction, such as an average.
• Like all prediction methods, it starts with a training set but
• instead of producing a mathematical model it determines the optimal number of similar observations to use in making the prediction.
• During the learning phase, the best number of similar
observations is chosen (k).
+/- of kNN
+:
• Noise: kNN is relatively insensitive to errors or outliers in the data.
• Large sets: kNN can be used with large training sets.
-:
• Speed: kNN can be computationally slow when it
is applied to a new data set since a similar score
must be generated between the observations
presented to the model and every member of the
training set.
• A kNN model uses the k most similar neighbors to the observation to calculate a prediction.
• Where a response variable is continuous, the prediction is the mean of the nearest neighbors.
• Where a response variable is categorical, the
prediction could be presented as a mean or a
voting scheme could be used, that is, select the
most common classification term.
http://people.revoledu.com/kardi/tutorial/KNN/index.html
K Nearest Neighbor (KNN):
• Training set includes classes.
• Examine K items near item to be classified.
• New item placed in class with the most number of close items.
• O(q) for each tuple to be classified. (Here q
is the size of the training set.)
KNN
The test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles.
If k = 3 it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle.
If k = 5 it is assigned to the first class (3 squares vs. 2 triangles inside the outer circle).
Asumptions:
• We have a training set of observations in which each element belongs to one of a given classes (Y).
• We have some new observation, for which we do not know the class and we want to find it using kNN algorithm.
K-nearest neighbor algorithm
To calculate the distance from A (2,3) to B (7,8):
D (A,B) = sqrt((7-2)2 + (8-3)2) = sqrt (25 + 25) = sqrt (50) = 7.07
0 1 2 3 4 5 6 7 8 9
0 2 4 6 8
A B
• If we have 3 points A(2,3), B(7,8) and C(5,1):
• D (A,B) = sqrt ((7-2)2 + (8-3)2) = sqrt (25 + 25) = sqrt (50) = 7.07
• D (A,C) = sqrt ((5-2)2 + (3-1)2) = sqrt (9 + 4) = sqrt (13) = 3.60
• D (B,C) = sqrt ((7-5)2 + (3-8)2) = sqrt (4 + 25) = sqrt (29) = 5.38 A
B
C
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8
A B C
K-NN
• Step 1: find k nearest neighbors for a given object
• Step 2: choose the class from the neighbors (choose the class which is more frequent)
K=3 K=5
New case: Will be New case: Will be
What if we have more dimensions ?
V1 V2 V3 V4 V5
A 0.7 0.8 0.4 0.5 0.2
B 0.6 0.8 0.5 0.4 0.2
C 0.8 0.9 0.7 0.8 0.9
D (A,B) = sqrt ((0.7-0.6)2 + (0.8-0.8)2 + (0.4-0.3)2 + (0.5-0.4)2 + (0.2-0.2)2) = sqrt (0.01 + 0.01 + 0.01) = sqrt (0.03) = 0.17
D (A,C) = sqrt ((0.7-0.8)2 + (0.8-0.9)2 + (0.4-0.7)2 + (0.5-0.8)2 + (0.2-0.9)2) =sqrt(0.01 + 0.01 + 0.09 + 0.09 + 0.49) = sqrt (0.69) = 0.83
D (B,C) = sqrt ((0.6-0.8)2 + (0.8-0.9)2 + (0.5-0.7)2 + (0.4-0.8)2 + (0.2-0.9)2) = sqrt (0.04 + 0.01 + 0.04+0.16 + 0.49) = sqrt (0.74) = 0.86
We are looking for the smallest distance. (A & B).
kNN advantages and disadvantages:
Advatages:
• Noise: kNN is relatively insensitive to errors or outliers in the data.
• Large sets: kNN can be used with large training sets.
Disadvantage:
• Speed: kNN can be computationally slow when it is applied to a new data set since a similar score must be generated between the observations presented to the model and every member of the training set.
SSE
Smaller SSE values indicate that the predictions are closer to the actual values.
The SSE evaluation criterion was used to assess the quality of each model.
To assess the different values for k, the sum of squares of error (SSE) evaluation criteria will be used:
Table for detecting the best values for k
•The Euclidean distance calculation was selected to represent the distance between observations. To calculate an optimal value for k, different values of k were selected between 2 and 20.
In this example, the value of k with the lowest SSE value is 6 and this value is selected for use with the kNN model.
Observation to be predicted
• To illustrate, a data set of cars will be used and a model built to test the car fuel efficiency (MPG).
• The following variables will be used as descriptors within the model: Cylinders, Displacement, Horsepower, Weight, Acceleration, Model Year and Origin.
Predicting
• Once a value for k has been set in the training phase, the model can now be used to make predictions.
• For example, an observation x has values for the descriptor variables but not for the response. Using the same technique for determining similarity as used in the model building phase, observation x is compared against all observations in the training set.
• A distance is computed between x and each training set observation. The closest k observations are selected and a prediction is made, for example, using the average value
The observation (Dodge Aspen) was presented to the kNN model built to predict car fuel efficiency (MPG). The Dodge Aspen observation was compared to all observations in the training set and an Euclidean distance was computed.
The six observations with the smallest distance scores are selected, as shown in Table. The prediction is the average of these top six observations, that is, 19.5.
the cross validated prediction is shown alongside the actual value.
58
Nearest Neighbor Classification
• Input
• A set of stored records
• k: # of nearest neighbors
• Output
• Compute distance:
• Identify k nearest neighbors
• Determine the class label of unknown record based on class labels of nearest neighbors (i.e. by taking majority vote)
i pi qi
q p
d( , ) ( )2
59
K Nearest Neighbors
• K Nearest Neighbors
– Advantage
• Simple
• Powerful
• Requires no training time
– Disadvantage
• Memory intensive
• Classification/estimation is slow
KNN = k nearest neighbors
Gene 1 Gene 2
?
KNN is another method for classification. For each point it looks at its k nearest neighbors.
If red = brain tumor and yellow healthy – do I have a brain tumor?
Gene 1 Gene 2
?
For each point it looks at its k nearest neighbors. For example, the
method with k=3 looks at points 3 nearest neighbors to decide how to classify it. If the majority are “Red” it will classify the point as red.
If red = brain tumor and yellow healthy – do I have a brain tumor?
KNN = k nearest neighbors
In the above example – how will the point be classified in KNN with K=1?
Gene 1 Gene 2
?
KNN - exercise
KNN Classification
$0
$50 000
$100 000
$150 000
$200 000
$250 000
0 10 20 30 40 50 60 70
Non-Default Default
Age Loan$
KNN Classification – Distance
Age Loan Default Distance
25 $40,000 N 102000
35 $60,000 N 82000
45 $80,000 N 62000
20 $20,000 N 122000
35 $120,000 N 22000
52 $18,000 N 124000
23 $95,000 Y 47000
40 $62,000 Y 80000
60 $100,000 Y 42000
48 $220,000 Y 78000
33 $150,000 Y 8000
48 $142,000 ?
2 2 1
2 2
1
) ( )
( x x y y
D
KNN Classification – Standardized Distance
Age Loan Default Distance
0.125 0.11 N 0.7652
0.375 0.21 N 0.5200
0.625 0.31 N 0.3160
0 0.01 N 0.9245
0.375 0.50 N 0.3428
0.8 0.00 N 0.6220
0.075 0.38 Y 0.6669
0.5 0.22 Y 0.4437
1 0.41 Y 0.3650
0.7 1.00 Y 0.3861
0.325 0.65 Y 0.3771
0.7 0.61 ?
Min Max
Min X
sX
KNN Regression - Distance
Age Loan House Price Index Distance
25 $40,000 135 102000
35 $60,000 256 82000
45 $80,000 231 62000
20 $20,000 267 122000
35 $120,000 139 22000
52 $18,000 150 124000
23 $95,000 127 47000
40 $62,000 216 80000
60 $100,000 139 42000
48 $220,000 250 78000
33 $150,000 264 8000
48 $142,000 ?
2 2 1
2 2
1
) ( )
( x x y y
D
KNN Regression – Standardized Distance
Age Loan House Price Index Distance
0.125 0.11 135 0.7652
0.375 0.21 256 0.5200
0.625 0.31 231 0.3160
0 0.01 267 0.9245
0.375 0.50 139 0.3428
0.8 0.00 150 0.6220
0.075 0.38 127 0.6669
0.5 0.22 216 0.4437
1 0.41 139 0.3650
0.7 1.00 250 0.3861
0.325 0.65 264 0.3771
0.7 0.61 ?
Min Max
Min X
sX
KNN – Number of Neighbors
• If K=1, select the nearest neighbor
• If K>1,
– For classification select the most frequent neighbor.
– For regression calculate the average of K
neighbors.
Distance – Categorical Variables
1 0
D y
x
D y
x
X Y Distance
Male Male 0
Male Female 1
DECISION TREES
Decision trees
Example of a Decision Tree
Tid Refund Marital Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No 5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Refund
MarSt
TaxInc
NO YES NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
Another Example of Decision Tree
Tid Refund Marital Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No 5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
MarSt
Refund
TaxInc
NO YES NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
There could be more than one tree that fits the same data!
Decision Tree Classification Task
Apply Model Induction
Deduction
Learn Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Tree Induction algorithm
Training Set
Decision Tree
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data Start from the root of tree.
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Assign Cheat to “No”
Decision Tree Classification Task
Apply Model Induction
Deduction
Learn Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Tree Induction algorithm
Training Set
Decision Tree
weather
Decision tree
yes no
distance < 20 km
rainy sunny
Concepts: root, inner node, leaf, edges
Decision tree construction
1 1 1
1
2 2
2 2 2
2 2 2
2
y1 y2
a2 a1
2
a3
yes no
y2 < a1
1
yes no
y1 < a2
2
2
2
no yes
y2 < a1
1
no y1< a3
yes
1 1 1
1
2 2
2 2 2
2 2 2
2
y1 y2
a1
Partition for node
yes no
y
i>
i1. Quantitative data: comparison with some treshold value
2. Qualitative data: each possible value has to be used
y
iyi1 yi2 yik
The partition of node for qualitative data
1. For each attribute yi calculate the value of some given measure.
2. Choose the attibute which is optimal in sense of chosen measure.
3. From a given node create a number of edges equal to the number of values of attribute yi.
y
iyi1
yi2 yik
t
t
1t
2t
k• Decision trees are often generated by hand to precisely and consistently define a decision making process.
• However, they can also be generated automatically from the data.
• They consist of a series of decision points
based on certain variables
Splitting Criteria -Dividing Observations
• It is common for the split at each level to be a two-way split.
• There are methods that split more than two ways.
• However, care should be taken using these
methods since splitting the set in many ways
early in the construction of the tree may result in
missing interesting relationships that become
exposed as the tree growing process continues.
Any variable type can be split using a two-way split:
• Dichotomous: Variables with two values are the most straightforward to split since each branch represents a specific value. For example, a variable Temperature may have only two values, hot and cold. Observations will be split based on those with hot and those with cold temperature values.
• Nominal: Since nominal values are discrete values with no order, a two-way split is accomplished with one subset being comprised of a set of observations that equal a certain value and the other subset being those observations that do not equal that value. For example, a variable Color that can take the values red, green, blue, and black may be split two-ways. Observations, for example, which have Color equaling red generate one subset and those not equaling red creating the other subset, that is, green, blue and black.
Ordinal: In the case where a variable’s discrete values are ordered, the resulting subsets may be made up of more than one value, as long as the ordering is retained. For example, a variable Quality with possible values low, medium, high, and excellent may be split in four possible ways. For example, observations equaling low or medium in one subset and observations equaling high and excellent in another subset.
Another example is where low values are in one set and medium, high, and excellent values are in the other set.
Continuous: For variables with continuous values to be split two-ways, a specific cutoff value needs to be determined, where on one side of the split are values less than the cutoff and on the other side of the split are values greater than or equal to the cutoff. For example, a variable Weight which can take any value between 0 and 1,000 with a selected cutoff of 200. The first subset would be those observations where the Weight is below 200 and the other subset would be those observations where the Weight is greater than or equal to 200.
A splitting criterion has two components:
• (1) the variable to split on and
• (2) values of the variable to split on.
To determine the best split, all possible splits of all variables must be considered. Since it is necessary to rank the splits, a score should be calculated for each split.
There are many ways to rank the split.
The following describes two approaches for prioritizing splits, based on whether the response is categorical or continuous.
• The objective for an optimal split is to create subsets which results in observations with a single response value. In this example, there are 20 observations prior to splitting.
• The response variable (Temperature) has two
possible values, hot and cold. Prior to the split,
the response has an even distribution with the
number of observations where the Temperature
equals hot is ten and with the number of
observations where the Temperature equals cold
is also ten.
• Different criteria are considered for splitting these observations which results in different distributions of the response variables for each subset (N2 and N3):
• Split a: Each subset contains ten observations. All ten observations in N2 have hot temperature values, whereas the ten observations in node N3 are all cold.
• Split b: Again each subset (N2 and N3) contains ten observations.
However, in this example there is an even distribution of hot and cold values in each subset.
• Split c: In this case the splitting criterion results in two subsets where node N2 has nine observations (one hot and eight cold) and node N3 has 11 observations (nine hot and two cold).
• Split a is the best split since each node contains observations where the response is one or the other category.
• Split b results in the same even split of hot and cold values (50%
hot, 50% cold) in each of the resulting nodes (N2 and N3) and would not be considered a good split.
• Split c is a good split; however, this split is not so clean as split a since there are values of both hot and cold in both subsets.
• The proportion of hot and cold values is biased, in node N2 towards cold values and in N3 towards hot values. When determining the best splitting criteria, it is important to determine how clean each split is, based on the proportion of the different categories of the response variable (or impurity).
• S is a sample of training examples
• p is the proportion of positive examples in S
• Entropy measures the impurity of of S
• Entropy(S) = -plogp-(1-p)log(1-p)
misclassification, Gini, and entropy
• There are three primary methods for calculating impurity:
misclassification, Gini, and entropy.
• In scenario 1, all ten observations have value cold whereas in scenario 2, one observation has value hot and nine observations have value cold.
• For each scenario, an entropy score is calculated.
• Cleaner splits result in lower scores.
• In scenario 1 and scenario 11, the split cleanly breaks the
set into observations with only one value. The score for
these scenarios is 0. In scenario 6, the observations are split
evenly across the two values and this is reflected in a score
of 1. In other cases, the score reflects how well the two
values are split.
• In order to determine the best split, we now need to calculate a ranking based on how cleanly each split separates the response data.
• This is calculated on the basis of the impurity before and after the split.
• The formula for this calculation, Gain, is
shown below:
• N is the number of observations in the parent node,
• k is the number of possible resulting nodes and
• N(vj) is the number of observations for each of the j child nodes.
• vj is the set of observations for the jth node.
• It should be noted that the Gain formula can be
used with other impurity methods by replacing
the entropy calculation.
ID3 Algorithm
• The ID3 algorithm is considered as a very simple decision tree algorithm (Quinlan, 1986).
• ID3 uses information gain as splitting criteria.
• The growing stops when all instances belong to a single value of target feature or when best information gain is not greater than zero.
• ID3 does not apply any pruning procedures nor
does it handle numeric attributes or missing
values.
C4.5 Algorithm
• C4.5 is an evolution of ID3, presented by the same author (Quinlan, 1993).
• It uses gain ratio as splitting criteria.
• The splitting ceases when the number of instances to be split is below a certain threshold.
• Error–based pruning is performed after the growing phase. C4.5 can handle numeric attributes.
• It can induce from a training set that incorporates
missing values by using corrected gain ratio
criteria as presented above.
Example: Decision Tree for PlayTennis
Example: Data for PlayTennis
Decision Tree for PlayTennis
3.4 The Basic Decision Tree Learning Algorithm
• Main loop:
1. A the “best” decision attribute for next node 2. Assign A as decision attribute for node
3. For each value of A, create new descendant of node
4. Sort training examples to leaf nodes
5. If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes
• Which attribute is best?
Entropy
S is a sample of training examples
p⊕ is the proportion of positive examples in S p⊖ is the proportion of negative examples in S Entropy measures the impurity of S
Entropy(S) - p⊕log2 p⊕ - p⊖log2 p⊖
Information Gain
Gain(S, A) = expected reduction in entropy due to
sorting on A
Training Examples
Selecting the Next Attribute(1/2)
Which attribute is the best classifier?
Selecting the Next Attribute(2/2)
Ssunny = {D1,D2,D8,D9,D11}
Gain (Ssunny , Humidity) = .970 - (3/5) 0.0 - (2/5) 0.0 = .970
Gain (Ssunny , Temperature) = .970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0 = .570 Gain (Ssunny, Wind) = .970 - (2/5) 1.0 - (3/5) .918 = .019
Converting A Tree to Rules
IF (Outlook = Sunny) ∧ (Humidity = High) THEN PlayTennis = No
IF (Outlook = Sunny) ∧ (Humidity = Normal) THEN PlayTennis = Yes
….
Factors Affecting Sunburn
Name Hair Height Weight Lotion Result Sarah blonde average light no positive Dana blonde tall average yes negative Alex brown short average yes Negative Annie blonde short average no Positive Emily red average heavy no positive Peter brown tall heavy no Negative John brown average heavy no Negative Katie blonde short light yes Negative
Phase 1: From Data to Tree
Perform average entropy calculations on the complete data set for each of the four attributes:
Name Hair Height Weight Lotion Result
Sarah blonde average light no positive
Dana blonde tall average yes negative
Alex brown short average yes Negative
Annie blonde short average no Positive
Emily red average heavy no positive
Peter brown tall heavy no Negative
John brown average heavy no Negative
Katie blonde short light yes Negative
b1 = blonde b2 = red b3 = brown
Average Entropy = 0.50
b1 = short b2 = average b3 = tall
Average Entropy = 0.69
Name Hair Height Weight Lotion Result
Sarah blonde average light no positive
Dana blonde tall average yes negative
Alex brown short average yes Negative
Annie blonde short average no Positive
Emily red average heavy no positive
Peter brown tall heavy no Negative
John brown average heavy no Negative
Katie blonde short light yes Negative
b1 = light b2 = average b3 = heavy
Average Entropy = 0.94
Name Hair Height Weight Lotion Result
Sarah blonde average light no positive
Dana blonde tall average yes negative
Alex brown short average yes Negative
Annie blonde short average no Positive
Emily red average heavy no positive
Peter brown tall heavy no Negative
John brown average heavy no Negative
Katie blonde short light yes Negative
b1 = no b2 = yes
Average Entropy = 0.61
Name Hair Height Weight Lotion Result
Sarah blonde average light no positive
Dana blonde tall average yes negative
Alex brown short average yes Negative
Annie blonde short average no Positive
Emily red average heavy no positive
Peter brown tall heavy no Negative
John brown average heavy no Negative
Katie blonde short light yes Negative
the attribute "hair color" is selected as the first test because it minimizes the entropy.