artificial intelligence tool

(1)

Data Mining methods as a artificial intelligence tool

Agnieszka Nowak - Brzezinska

(2)

Decision Trees and

k-Nearest neighbor and

basket analysis

Lectures 4

(3)

BASKET ANALYSIS

(4)

• Data mining (the advanced analysis step of the

"Knowledge Discovery in Databases" process, or KDD), an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems.

• The overall goal of the data mining process is to extract

information from a data set and transform it into an

understandable structure for further use.

(5)

What Is Frequent Pattern Analysis?

• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

• First proposed by Agrawal, Imielinski, and Swami in the context of frequent itemsets and association rule mining

• Motivation: Finding inherent regularities in data

– What products were often purchased together?— Beer and diapers?!

– What are the subsequent purchases after buying a PC?

– What kinds of DNA are sensitive to this new drug?

– Can we automatically classify web documents?

• Applications

– Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

(6)

Support and Confidence - Example

• What is the support and confidence of the following rules?

• {Beer}{Bread}

• {Bread, PeanutButter}{Jelly} ?

Support(XY)=support(X Y)

confidence(XY)=support(XY)/support(X)

(7)

Association Rule Mining Problem Definition

• Given a set of transactions T={t₁, t₂, …,t_n} and 2 thresholds; minsup and minconf,

• Find all association rules XY with support  minsup and confidence  minconf

• I.E: we want rules with high confidence and support

• We call these rules interesting

• We would like to

• Design an efficient algorithm for mining association rules in large data sets

• Develop an effective approach for distinguishing interesting rules from spurious ones

(8)

Basic Concepts: Frequent Patterns

• itemset: A set of one or more items

• k-itemset X = {x

₁

, …, x

_k

}

• (absolute) support, or, support count of X: Frequency or

occurrence of an itemset X

• (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X)

• An itemset X is frequent if X’s support is no less than a minsup threshold

Customer buys diaper Customer

buys both

Customer buys beer

Tid Items bought

10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs

40 Nuts, Eggs, Milk

50 Nuts, Coffee, Diaper, Eggs, Milk

(9)

Basic Concepts: Association Rules

• Find all the rules X  Y with

minimum support and confidence – support, s, probability that a

transaction contains X  Y – confidence, c, conditional

probability that a transaction having X also contains Y

Let minsup = 50%, minconf = 50%

Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3

Customer buys

diaper Customer

buys both

Customer buys beer

Nuts, Eggs, Milk 40

Nuts, Coffee, Diaper, Eggs, Milk 50

Beer, Diaper, Eggs 30

Beer, Coffee, Diaper 20

Beer, Nuts, Diaper 10

Items bought Tid

 Association rules: (many more!)

 Beer  Diaper (60%, 100%)

 Diaper  Beer (60%, 75%)

(10)

Measures of Predictive Ability

Support refers to the percentage of baskets where the rule was true (both left and right-side products were present).

Confidence measures what percentage of baskets that contained the left-hand product also contained the right-hand.

Lift measures how many times Confidence is larger

than the expected (baseline) Confidence. A lift

value that is greater than 1 is desirable.

(11)

Support and Confidence: An Illustration

A B C A C D B C D A D E B C E

Rule A  D C  A A  C

B & C  D

Support 2/5 2/5

2/5 1/5

Confidence 2/3 2/4

2/3 1/3

Lift 2

1 2

0.50

(12)

Problem Decomposition

1. Find all sets of items that have minimum support (frequent itemsets)

2. Use the frequent itemsets to generate the

desired rules

(13)

Problem Decomposition – Example

Transaction ID Items Bought

1 Shoes, Shirt, Jacket

2 Shoes,Jacket

3 Shoes, Jeans

4 Shirt, Sweatshirt

For min support = 50% = 2 trans, and min confidence = 50%

Frequent Itemset Support

{Shoes} 75%

{Shirt} 50%

{Jacket} 50%

{Shoes, Jacket} 50%

For the rule Shoes  Jacket

•Support = Sup({Shoes,Jacket)}=50%

•Confidence = =66.6%

70 50

Jacket  Shoes has 50% support and 100% confidence

(14)

The Apriori Algorithm — Example

TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5

itemset sup.

{1} 2

{2} 3

{3} 3

{4} 1

{5} 3

itemset sup.

{1} 2

{2} 3

{3} 3

{5} 3

Scan D

C₁ L₁

itemset {1 2}

{1 3}

{1 5}

{2 3}

{2 5}

{3 5}

itemset sup

{1 2} 1

{1 3} 2

{1 5} 1

{2 3} 2

{2 5} 3

{3 5} 2

itemset sup {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2

L₂

C₂ C₂

Scan D

C₃

itemset

L₃

{2 3 5}

Scan D

itemset sup

{2 3 5} 2

Database D

Min support =50% = 2 trans

(15)

KNN

(16)

Examples of Classification Task

• Predicting tumor cells as benign or malignant

• Classifying credit card transactions as legitimate or fraudulent

• Classifying secondary structures of protein as alpha-helix, beta-sheet, or random

coil

• Categorizing news stories as finance,

weather, entertainment, sports, etc

(17)

KNN - Definition

KNN is a simple algorithm that stores all available cases and classifies new

cases based on a similarity measure

(18)

KNN – different names

• K-Nearest Neighbors

• Memory-Based Reasoning

• Example-Based Reasoning

• Instance-Based Learning

• Case-Based Reasoning

• Lazy Learning

(19)

KNN – Short History

• Nearest Neighbors have been used in statistical estimation and pattern recognition already in the beginning of 1970’s ( non-parametric techniques ).

• People reason by remembering and learn by doing.

• Thinking is reminding, making analogies.

(20)

• The k-Nearest Neighbors (kNN) method provides a simple approach to calculating predictions for unknown observations.

• It calculates a prediction by looking at similar observations and uses some function of their response values to make the prediction, such as an average.

• Like all prediction methods, it starts with a training set but

• instead of producing a mathematical model it determines the optimal number of similar observations to use in making the prediction.

• During the learning phase, the best number of similar

observations is chosen (k).

(21)

+/- of kNN

+:

• Noise: kNN is relatively insensitive to errors or outliers in the data.

• Large sets: kNN can be used with large training sets.

-:

• Speed: kNN can be computationally slow when it

is applied to a new data set since a similar score

must be generated between the observations

presented to the model and every member of the

training set.

(22)

• A kNN model uses the k most similar neighbors to the observation to calculate a prediction.

• Where a response variable is continuous, the prediction is the mean of the nearest neighbors.

• Where a response variable is categorical, the

prediction could be presented as a mean or a

voting scheme could be used, that is, select the

most common classification term.

(23)

(24)

(25)

http://people.revoledu.com/kardi/tutorial/KNN/index.html

(26)

K Nearest Neighbor (KNN):

• Training set includes classes.

• Examine K items near item to be classified.

• New item placed in class with the most number of close items.

• O(q) for each tuple to be classified. (Here q

is the size of the training set.)

(27)

KNN

The test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles.

If k = 3 it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle.

If k = 5 it is assigned to the first class (3 squares vs. 2 triangles inside the outer circle).

(28)

Asumptions:

• We have a training set of observations in which each element belongs to one of a given classes (Y).

• We have some new observation, for which we do not know the class and we want to find it using kNN algorithm.

K-nearest neighbor algorithm

(29)

To calculate the distance from A (2,3) to B (7,8):

D (A,B) = sqrt((7-2)² + (8-3)²) = sqrt (25 + 25) = sqrt (50) = 7.07

0 1 2 3 4 5 6 7 8 9

0 2 4 6 8

A B

(30)

• If we have 3 points A(2,3), B(7,8) and C(5,1):

• D (A,B) = sqrt ((7-2)²+ (8-3)²) = sqrt (25 + 25) = sqrt (50) = 7.07

• D (A,C) = sqrt ((5-2)² + (3-1)²) = sqrt (9 + 4) = sqrt (13) = 3.60

• D (B,C) = sqrt ((7-5)²+ (3-8)²) = sqrt (4 + 25) = sqrt (29) = 5.38 A

B

C

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8

A B C

(31)

K-NN

• Step 1: find k nearest neighbors for a given object

• Step 2: choose the class from the neighbors (choose the class which is more frequent)

K=3 K=5

New case: Will be New case: Will be

(32)

What if we have more dimensions ?

V1 V2 V3 V4 V5

A 0.7 0.8 0.4 0.5 0.2

B 0.6 0.8 0.5 0.4 0.2

C 0.8 0.9 0.7 0.8 0.9

D (A,B) = sqrt ((0.7-0.6)² + (0.8-0.8)²+ (0.4-0.3)² + (0.5-0.4)² + (0.2-0.2)²) = sqrt (0.01 + 0.01 + 0.01) = sqrt (0.03) = 0.17

D (A,C) = sqrt ((0.7-0.8)² + (0.8-0.9)²+ (0.4-0.7)² + (0.5-0.8)² + (0.2-0.9)²) =sqrt(0.01 + 0.01 + 0.09 + 0.09 + 0.49) = sqrt (0.69) = 0.83

D (B,C) = sqrt ((0.6-0.8)² + (0.8-0.9)²+ (0.5-0.7)² + (0.4-0.8)² + (0.2-0.9)²) = sqrt (0.04 + 0.01 + 0.04+0.16 + 0.49) = sqrt (0.74) = 0.86

We are looking for the smallest distance. (A & B).

(33)

kNN advantages and disadvantages:

Advatages:

• Noise: kNN is relatively insensitive to errors or outliers in the data.

• Large sets: kNN can be used with large training sets.

Disadvantage:

• Speed: kNN can be computationally slow when it is applied to a new data set since a similar score must be generated between the observations presented to the model and every member of the training set.

(34)

SSE

Smaller SSE values indicate that the predictions are closer to the actual values.

The SSE evaluation criterion was used to assess the quality of each model.

To assess the different values for k, the sum of squares of error (SSE) evaluation criteria will be used:

(35)

Table for detecting the best values for k

•The Euclidean distance calculation was selected to represent the distance between observations. To calculate an optimal value for k, different values of k were selected between 2 and 20.

In this example, the value of k with the lowest SSE value is 6 and this value is selected for use with the kNN model.

(36)

Observation to be predicted

• To illustrate, a data set of cars will be used and a model built to test the car fuel efficiency (MPG).

• The following variables will be used as descriptors within the model: Cylinders, Displacement, Horsepower, Weight, Acceleration, Model Year and Origin.

(37)

Predicting

• Once a value for k has been set in the training phase, the model can now be used to make predictions.

• For example, an observation x has values for the descriptor variables but not for the response. Using the same technique for determining similarity as used in the model building phase, observation x is compared against all observations in the training set.

• A distance is computed between x and each training set observation. The closest k observations are selected and a prediction is made, for example, using the average value

(38)

The observation (Dodge Aspen) was presented to the kNN model built to predict car fuel efficiency (MPG). The Dodge Aspen observation was compared to all observations in the training set and an Euclidean distance was computed.

The six observations with the smallest distance scores are selected, as shown in Table. The prediction is the average of these top six observations, that is, 19.5.

the cross validated prediction is shown alongside the actual value.

(39)

58

Nearest Neighbor Classification

• Input

• A set of stored records

• k: # of nearest neighbors

• Output

• Compute distance:

• Identify k nearest neighbors

• Determine the class label of unknown record based on class labels of nearest neighbors (i.e. by taking majority vote)

 

 i pi qi

q p

d( , ) ( )²

(40)

59

K Nearest Neighbors

• K Nearest Neighbors

– Advantage

• Simple

• Powerful

• Requires no training time

– Disadvantage

• Memory intensive

• Classification/estimation is slow

(41)

KNN = k nearest neighbors

Gene 1 Gene 2

?

KNN is another method for classification. For each point it looks at its k nearest neighbors.

If red = brain tumor and yellow healthy – do I have a brain tumor?

(42)

Gene 1 Gene 2

?

For each point it looks at its k nearest neighbors. For example, the

method with k=3 looks at points 3 nearest neighbors to decide how to classify it. If the majority are “Red” it will classify the point as red.

If red = brain tumor and yellow healthy – do I have a brain tumor?

KNN = k nearest neighbors

(43)

In the above example – how will the point be classified in KNN with K=1?

Gene 1 Gene 2

?

KNN - exercise

(44)

KNN Classification

$0

$50 000

$100 000

$150 000

$200 000

$250 000

0 10 20 30 40 50 60 70

Non-Default Default

Age Loan$

(45)

KNN Classification – ^Distance

Age Loan Default Distance

25 $40,000 N 102000

35 $60,000 N 82000

45 $80,000 N 62000

20 $20,000 N 122000

35 $120,000 N 22000

52 $18,000 N 124000

23 $95,000 Y 47000

40 $62,000 Y 80000

60 $100,000 Y 42000

48 $220,000 Y 78000

33 $150,000 Y 8000

48 $142,000 ?

2 2 1

2 2

1

) ( )

( x x y y

D    

(46)

KNN Classification – Standardized Distance

Age Loan Default Distance

0.125 0.11 N 0.7652

0.375 0.21 N 0.5200

0.625 0.31 N 0.3160

0 0.01 N 0.9245

0.375 0.50 N 0.3428

0.8 0.00 N 0.6220

0.075 0.38 Y 0.6669

0.5 0.22 Y 0.4437

1 0.41 Y 0.3650

0.7 1.00 Y 0.3861

0.325 0.65 Y 0.3771

0.7 0.61 ?

Min Max

Min X

_s

X



 

(47)

KNN Regression - ^Distance

Age Loan House Price Index Distance

25 $40,000 135 102000

35 $60,000 256 82000

45 $80,000 231 62000

20 $20,000 267 122000

35 $120,000 139 22000

52 $18,000 150 124000

23 $95,000 127 47000

40 $62,000 216 80000

60 $100,000 139 42000

48 $220,000 250 78000

33 $150,000 264 8000

48 $142,000 ?

2 2 1

2 2

1

) ( )

( x x y y

D    

(48)

KNN Regression – Standardized Distance

Age Loan House Price Index Distance

0.125 0.11 135 0.7652

0.375 0.21 256 0.5200

0.625 0.31 231 0.3160

0 0.01 267 0.9245

0.375 0.50 139 0.3428

0.8 0.00 150 0.6220

0.075 0.38 127 0.6669

0.5 0.22 216 0.4437

1 0.41 139 0.3650

0.7 1.00 250 0.3861

0.325 0.65 264 0.3771

0.7 0.61 ?

Min Max

Min X

_s

X



 

(49)

KNN – Number of Neighbors

• If K=1, select the nearest neighbor

• If K>1,

– For classification select the most frequent neighbor.

– For regression calculate the average of K

neighbors.

(50)

Distance – Categorical Variables

1 0













D y

x

D y

x

X Y Distance

Male Male 0

Male Female 1

(51)

DECISION TREES

Decision trees

(52)

Example of a Decision Tree

Tid Refund Marital Status

Taxable

Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No 5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10

Refund

MarSt

TaxInc

NO YES NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree

(53)

Another Example of Decision Tree

Tid Refund Marital Status

Taxable

Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No 5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

9 No Married 75K No

10

MarSt

Refund

TaxInc

NO YES NO

NO

Yes No

Married

Single, Divorced

< 80K > 80K

There could be more than one tree that fits the same data!

(54)

Decision Tree Classification Task

Apply Model Induction

Deduction

Learn Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ?

10

Test Set

Tree Induction algorithm

Training Set

Decision Tree

(55)

Apply Model to Test Data

Refund

MarSt

TaxInc

YES NO

NO

Yes No

< 80K > 80K

Refund Marital Status

Taxable

Income Cheat

No Married 80K ?

10

Test Data Start from the root of tree.

(56)

Apply Model to Test Data

Refund

MarSt

TaxInc

YES NO

NO

Yes No

< 80K > 80K

Taxable

Income Cheat

No Married 80K ?

10

Test Data

(57)

Apply Model to Test Data

Refund

MarSt

TaxInc

YES NO

NO

Yes No

< 80K > 80K

Taxable

Income Cheat

No Married 80K ?

10

Test Data

(58)

Apply Model to Test Data

Refund

MarSt

TaxInc

YES NO

NO

Yes No

< 80K > 80K

Taxable

Income Cheat

No Married 80K ?

10

Test Data

(59)

Apply Model to Test Data

Refund

MarSt

TaxInc

YES NO

NO

Yes No

< 80K > 80K

Taxable

Income Cheat

No Married 80K ?

10

Test Data

(60)

Apply Model to Test Data

Refund

MarSt

TaxInc

YES NO

NO

Yes No

< 80K > 80K

Taxable

Income Cheat

No Married 80K ?

10

Test Data

Assign Cheat to “No”

(61)

Decision Tree Classification Task

Apply Model Induction

Deduction

Learn Model

Model

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

9 No Medium 75K No

10

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ?

10

Test Set

Tree Induction algorithm

Training Set

Decision Tree

(62)

weather

Decision tree

yes no

distance < 20 km

rainy sunny

Concepts: root, inner node, leaf, edges

(63)

Decision tree construction

1 1 1

1 2 2

2 2 2

2

y₁ y₂

a₂ a₁

2

a₃

yes no

y₂< a₁

1

yes no

y₁< a₂

2

no yes

y₂< a₁

1

no y₁< a₃

yes

1 1 1

1 2 2

2 2 2

2

y₁ y₂

a₁

(64)

Partition for node

yes no

y

_i

> 

_i

1. Quantitative data: comparison with some treshold value

2. Qualitative data: each possible value has to be used

y

_i

y_i1 y_i2 y_ik

(65)

The partition of node for qualitative data

1. For each attribute y_i calculate the value of some given measure.

2. Choose the attibute which is optimal in sense of chosen measure.

3. From a given node create a number of edges equal to the number of values of attribute y_i.

y

_i

y_i1

y_i2 y_ik

t

₁

t

₂

t

_k

(66)

• Decision trees are often generated by hand to precisely and consistently define a decision making process.

• However, they can also be generated automatically from the data.

• They consist of a series of decision points

based on certain variables

(67)

(68)

Splitting Criteria -Dividing Observations

• It is common for the split at each level to be a two-way split.

• There are methods that split more than two ways.

• However, care should be taken using these

methods since splitting the set in many ways

early in the construction of the tree may result in

missing interesting relationships that become

exposed as the tree growing process continues.

(69)

Any variable type can be split using a two-way split:

• Dichotomous: Variables with two values are the most straightforward to split since each branch represents a specific value. For example, a variable Temperature may have only two values, hot and cold. Observations will be split based on those with hot and those with cold temperature values.

• Nominal: Since nominal values are discrete values with no order, a two-way split is accomplished with one subset being comprised of a set of observations that equal a certain value and the other subset being those observations that do not equal that value. For example, a variable Color that can take the values red, green, blue, and black may be split two-ways. Observations, for example, which have Color equaling red generate one subset and those not equaling red creating the other subset, that is, green, blue and black.

(70)

Ordinal: In the case where a variable’s discrete values are ordered, the resulting subsets may be made up of more than one value, as long as the ordering is retained. For example, a variable Quality with possible values low, medium, high, and excellent may be split in four possible ways. For example, observations equaling low or medium in one subset and observations equaling high and excellent in another subset.

Another example is where low values are in one set and medium, high, and excellent values are in the other set.

Continuous: For variables with continuous values to be split two-ways, a specific cutoff value needs to be determined, where on one side of the split are values less than the cutoff and on the other side of the split are values greater than or equal to the cutoff. For example, a variable Weight which can take any value between 0 and 1,000 with a selected cutoff of 200. The first subset would be those observations where the Weight is below 200 and the other subset would be those observations where the Weight is greater than or equal to 200.

(71)

A splitting criterion has two components:

• (1) the variable to split on and

• (2) values of the variable to split on.

To determine the best split, all possible splits of all variables must be considered. Since it is necessary to rank the splits, a score should be calculated for each split.

There are many ways to rank the split.

(72)

The following describes two approaches for prioritizing splits, based on whether the response is categorical or continuous.

(73)

(74)

• The objective for an optimal split is to create subsets which results in observations with a single response value. In this example, there are 20 observations prior to splitting.

• The response variable (Temperature) has two

possible values, hot and cold. Prior to the split,

the response has an even distribution with the

number of observations where the Temperature

equals hot is ten and with the number of

observations where the Temperature equals cold

is also ten.

(75)

• Different criteria are considered for splitting these observations which results in different distributions of the response variables for each subset (N2 and N3):

• Split a: Each subset contains ten observations. All ten observations in N2 have hot temperature values, whereas the ten observations in node N3 are all cold.

• Split b: Again each subset (N2 and N3) contains ten observations.

However, in this example there is an even distribution of hot and cold values in each subset.

• Split c: In this case the splitting criterion results in two subsets where node N2 has nine observations (one hot and eight cold) and node N3 has 11 observations (nine hot and two cold).

(76)

• Split a is the best split since each node contains observations where the response is one or the other category.

• Split b results in the same even split of hot and cold values (50%

hot, 50% cold) in each of the resulting nodes (N2 and N3) and would not be considered a good split.

• Split c is a good split; however, this split is not so clean as split a since there are values of both hot and cold in both subsets.

• The proportion of hot and cold values is biased, in node N2 towards cold values and in N3 towards hot values. When determining the best splitting criteria, it is important to determine how clean each split is, based on the proportion of the different categories of the response variable (or impurity).

(77)

(78)

• S is a sample of training examples

• p is the proportion of positive examples in S

• Entropy measures the impurity of of S

• Entropy(S) = -plogp-(1-p)log(1-p)

(79)

(80)

misclassification, Gini, and entropy

• There are three primary methods for calculating impurity:

misclassification, Gini, and entropy.

• In scenario 1, all ten observations have value cold whereas in scenario 2, one observation has value hot and nine observations have value cold.

• For each scenario, an entropy score is calculated.

• Cleaner splits result in lower scores.

• In scenario 1 and scenario 11, the split cleanly breaks the

set into observations with only one value. The score for

these scenarios is 0. In scenario 6, the observations are split

evenly across the two values and this is reflected in a score

of 1. In other cases, the score reflects how well the two

values are split.

(81)

(82)

(83)

• In order to determine the best split, we now need to calculate a ranking based on how cleanly each split separates the response data.

• This is calculated on the basis of the impurity before and after the split.

• The formula for this calculation, Gain, is

shown below:

(84)

• N is the number of observations in the parent node,

• k is the number of possible resulting nodes and

• N(vj) is the number of observations for each of the j child nodes.

• vj is the set of observations for the jth node.

• It should be noted that the Gain formula can be

used with other impurity methods by replacing

the entropy calculation.

(85)

(86)

ID3 Algorithm

• The ID3 algorithm is considered as a very simple decision tree algorithm (Quinlan, 1986).

• ID3 uses information gain as splitting criteria.

• The growing stops when all instances belong to a single value of target feature or when best information gain is not greater than zero.

• ID3 does not apply any pruning procedures nor

does it handle numeric attributes or missing

values.

(87)

C4.5 Algorithm

• C4.5 is an evolution of ID3, presented by the same author (Quinlan, 1993).

• It uses gain ratio as splitting criteria.

• The splitting ceases when the number of instances to be split is below a certain threshold.

• Error–based pruning is performed after the growing phase. C4.5 can handle numeric attributes.

• It can induce from a training set that incorporates

missing values by using corrected gain ratio

criteria as presented above.

(88)

Example: Decision Tree for PlayTennis

(89)

Example: Data for PlayTennis

(90)

Decision Tree for PlayTennis

(91)

3.4 The Basic Decision Tree Learning Algorithm

• Main loop:

1. A  the “best” decision attribute for next node 2. Assign A as decision attribute for node

3. For each value of A, create new descendant of node

4. Sort training examples to leaf nodes

5. If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes

• Which attribute is best?

(92)

Entropy

S is a sample of training examples

p⊕ is the proportion of positive examples in S p⊖ is the proportion of negative examples in S Entropy measures the impurity of S

Entropy(S)  - p⊕log2 p⊕ - p⊖log2 p⊖

(93)

Information Gain

Gain(S, A) = expected reduction in entropy due to

sorting on A

(94)

Training Examples

(95)

Selecting the Next Attribute(1/2)

Which attribute is the best classifier?

(96)

Selecting the Next Attribute(2/2)

S_sunny = {D1,D2,D8,D9,D11}

Gain (S_sunny , Humidity) = .970 - (3/5) 0.0 - (2/5) 0.0 = .970

Gain (S_sunny , Temperature) = .970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0 = .570 Gain (S_sunny, Wind) = .970 - (2/5) 1.0 - (3/5) .918 = .019

(97)

Converting A Tree to Rules

IF (Outlook = Sunny) ∧ (Humidity = High) THEN PlayTennis = No

IF (Outlook = Sunny) ∧ (Humidity = Normal) THEN PlayTennis = Yes

….

(98)

Factors Affecting Sunburn

Name Hair Height Weight Lotion Result Sarah blonde average light no positive Dana blonde tall average yes negative Alex brown short average yes Negative Annie blonde short average no Positive Emily red average heavy no positive Peter brown tall heavy no Negative John brown average heavy no Negative Katie blonde short light yes Negative

(99)

Phase 1: From Data to Tree

Perform average entropy calculations on the complete data set for each of the four attributes:

Name Hair Height Weight Lotion Result

Sarah blonde average light no positive

Dana blonde tall average yes negative

Alex brown short average yes Negative

Annie blonde short average no Positive

Emily red average heavy no positive

Peter brown tall heavy no Negative

John brown average heavy no Negative

Katie blonde short light yes Negative

(100)

b₁ = blonde b₂ = red b₃ = brown

Average Entropy = 0.50

(101)

b₁ = short b₂ = average b₃ = tall

(102)

b₁ = light b₂ = average b₃ = heavy

(103)

b₁ = no b₂ = yes

(104)

the attribute "hair color" is selected as the first test because it minimizes the entropy.