Naive Bayes Classifier

(1)

Data mining

Piotr Paszek

Classification

Naive Bayes Classifier

(2)

Bayes Classification Methods

What are Bayesian classifiers?

Bayesian classifiers are statistical classifiers, based on Bayes’

theorem.

They can predict class membership probabilities such as the probability that a given tuple belongs to a particular class.

Bayesian Classifiers

Naive Bayesian Classifiers

Assume independency between the e↵ect of a given attribute on a given class and the other values of other attributes

Bayesian Belief Networks Graphical models

Allow the representation of dependencies among subsets of attributes

(Piotr Paszek) Data mining Bayesian Classifier 3 / 15

(3)

Bayes’ Theorem

P (H|X) = P (X|H) · P (H) P (X)

Let X be a data sample (to classify): class label is unknown Let H be a hypothesis that X belongs to class C

Classification is to determine P (H|X), (i.e., “a posteriori”

probability): the probability that the hypothesis holds given the observed data sample X

P (H) (“a priori” probability): the initial probability P (X): probability that sample data is observed

P (X|H) (conditional probability): the probability of observing the sample X, given that the hypothesis holds

(4)

Classification using the Bayes Theorem

Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n attribute vector X = (x1, x2, . . . , xn).

Suppose there are m classes C1, C2, . . . , Cm. Classification is to derive the maximum posteriori, i.e., the maximal P (C_i|X) for i = 1, 2, . . . , m.

From Bayes’ theorem P (C_i|X) = ^{P (X}^|C_{P (X)}ⁱ⁾^{·P (C}ⁱ⁾.

Since P (X) is constant for all classes, onlyP (X|Cⁱ)· P (Cⁱ) needs to be maximized (and computed).

The classifier predicts that the class label of tuple X is the class Ci if and only if

P (X|Cⁱ)· P (Cⁱ) > P (X|C^j)· P (C^j) for 1 j  m, j 6= i.

(5)

Naive Bayes Classifier

Assumption of class-conditional independence

Attributes are conditionally independent

(i.e., no dependence relation between attributes) P (X|Ci) = Qn

j=1P (xj|Ci).

This greatly reduces the computation cost:

Only counts the class distribution

If A_k is categorical, P (x_k|Ci) is the # of tuples in C_i having value xk for Ak divided by # of tuples of Ci in D.

If A_k is continous-valued, P (x_k|Ci) is usually computed based on Gaussian distribution with a mean µ and standard deviation

g(x, µ, ) = 1

p2⇡e ^{(x µ)2}^{2 2} .

(6)

Naive Bayes Classifier

Maximize P (X |C

i

) · P (C

i

)

To maximize P (X|Cⁱ)P (Ci),

– We need to know (compute) class prior probabilities P (Ci) If the probabilities are not known, assume that

P (C1) = P (C2) =· · · = P (Cm)) maximize P (X|Ci) Class probabilities can be estimated by P (Ci) = ^|C_|D|^i,D^|

– Assume Class Conditional Independence to reduce computational cost of P (X|Ci)

given X ={x1, . . . , x_n}: P (X|Ci) =Q_n

j=1P (x_j|Ci) The probabilities P (x1|Ci), . . . , P (x_n|Ci) can be estimated from the training tuples

(7)

Naive Bayes Classifier

Estimating P (x

_k

|C

i

)

Categorical Attributes

Recall that xk refers to the value of attribute Ak for tuple X P (x_k|Ci) = ^|{x2C^i,D_|C^:A^k^(x)=x^k^}|

i,D|

is the number of tuples of class Ci in D having the value xk for A_k, divided by the number of tuples of class Ci in D

Continuous-Valued Attributes

A continuous-valued attribute is assumed to have a Gaussian (Normal) distribution with mean µ and standard deviation P (x_k|Ci) = g(x_k, µ_Ci, _Ci)

Estimate µCi and Ci the mean and standard variation of the values of attribute Ak for training tuples of class Ci

(8)

Naive Bayes Classifier – Example

ID age income student status buys computer

1  30 high no single no

2  30 high no married no

3 31..40 high no single yes

4 >40 medium no single yes

5 >40 low yes single yes

6 >40 low yes married no

7 31..40 low yes married yes

8  30 medium no single no

9  30 low yes single yes

10 >40 medium yes single yes

11  30 medium yes married yes

12 31..40 medium no married yes

13 31..40 high yes single yes

14 >40 medium no married no

Han J., Kamber M., Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2006

(9)

Naive Bayes Classifier – Example I

Tuple to classify is:

X = (age 30, income = medium, student = yes, status = single).

Class:

C1 (buys computer = yes), C2 (buys computer = no).

Maximize:

P (X|Cⁱ)· P (Cⁱ), i = 1, 2

(10)

Naive Bayes Classifier – Example II

P (C1) = 9/14 = 0.64 P (C₂) = 5/14 = 0.36

P (age 30 | C¹) = 2/5 = 0.4 P (age 30 | C2) = 3/5 = 0.6

(11)

Naive Bayes Classifier – Example III

P (X | C1) = 0.173 P (C1) = 9/14 = 0.643 P (X | C²) = 0.007 P (C2) = 5/14 = 0.357

P (X | C1)· P (C1) = 0.173· 0.643 =0.111 P (X | C2)· P (C2) = 0.007· 0.357 = 0.003

We choose the maximum value of

P (X|Cⁱ)· P (Cⁱ) i = 1, 2 So the naive Bayes classifier will classify X to class:

C1 (buys computer = yes).

(12)

Naive Bayes Classifier: Zero-Probability Problem

From the assumption of class-conditional independence follow that if some conditional probability is equal zero, then the predicted

probability will be zero.

Example

Suppose a dataset with 1000 tuples: income = low (0), income = medium (990), and income = high (10).

So: P rob(income = low) = 0,

P rob(income = low|Ci) = 0 for i = 1, 2.

and for any X such that income(X) = low P (X|Ci) =Qn

j=1P (x_j|Ci) = 0 for i = 1, 2.

The classifier can’t predict (select) the class label of such tuple.

(13)

Avoiding the Zero-Probability Problem

Naive Bayesian prediction requires each conditional probability be non-zero. Otherwise, the predicted probability will be zero Solution:

Use Laplacian correction: – adding 1 to each case

Example

Suppose a dataset with 1000 tuples, income = low (0), income = medium (990), and income = high (10).

After Laplacian correction:

P rob(income = low) = 1/1003

P rob(income = medium) = 991/1003 P rob(income = high) = 11/1003

The “corrected” probability estimates are close to their

“uncorrected” counterparts.

(14)

Naive Bayes Classifier: Comments

Advantages

Easy to implement

Good results obtained in most of the cases

Disadvantages

Assumption: class conditional independence, therefore loss of accuracy

Practically, dependencies exist among variables (e.g. medical data)

Dependencies among these cannot be modeled by Naive Bayes Classifier

How to deal with these dependencies?

Bayesian Belief Networks