Data mining
Piotr Paszek
Classification
Naive Bayes Classifier
Bayes Classification Methods
What are Bayesian classifiers?
Bayesian classifiers are statistical classifiers, based on Bayes’
theorem.
They can predict class membership probabilities such as the probability that a given tuple belongs to a particular class.
Bayesian Classifiers
Naive Bayesian Classifiers
Assume independency between the e↵ect of a given attribute on a given class and the other values of other attributes
Bayesian Belief Networks Graphical models
Allow the representation of dependencies among subsets of attributes
(Piotr Paszek) Data mining Bayesian Classifier 3 / 15
Bayes’ Theorem
Bayes’ Theorem
P (H|X) = P (X|H) · P (H) P (X)
Let X be a data sample (to classify): class label is unknown Let H be a hypothesis that X belongs to class C
Classification is to determine P (H|X), (i.e., “a posteriori”
probability): the probability that the hypothesis holds given the observed data sample X
P (H) (“a priori” probability): the initial probability P (X): probability that sample data is observed
P (X|H) (conditional probability): the probability of observing the sample X, given that the hypothesis holds
Classification using the Bayes Theorem
Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n attribute vector X = (x1, x2, . . . , xn).
Suppose there are m classes C1, C2, . . . , Cm. Classification is to derive the maximum posteriori, i.e., the maximal P (Ci|X) for i = 1, 2, . . . , m.
From Bayes’ theorem P (Ci|X) = P (X|CP (X)i)·P (Ci).
Since P (X) is constant for all classes, onlyP (X|Ci)· P (Ci) needs to be maximized (and computed).
The classifier predicts that the class label of tuple X is the class Ci if and only if
P (X|Ci)· P (Ci) > P (X|Cj)· P (Cj) for 1 j m, j 6= i.
(Piotr Paszek) Data mining Bayesian Classifier 5 / 15
Naive Bayes Classifier
Assumption of class-conditional independence
Attributes are conditionally independent
(i.e., no dependence relation between attributes) P (X|Ci) = Qn
j=1P (xj|Ci).
This greatly reduces the computation cost:
Only counts the class distribution
If Ak is categorical, P (xk|Ci) is the # of tuples in Ci having value xk for Ak divided by # of tuples of Ci in D.
If Ak is continous-valued, P (xk|Ci) is usually computed based on Gaussian distribution with a mean µ and standard deviation
g(x, µ, ) = 1
p2⇡e (x µ)22 2 .
Naive Bayes Classifier
Maximize P (X |C
i) · P (C
i)
To maximize P (X|Ci)P (Ci),
– We need to know (compute) class prior probabilities P (Ci) If the probabilities are not known, assume that
P (C1) = P (C2) =· · · = P (Cm)) maximize P (X|Ci) Class probabilities can be estimated by P (Ci) = |C|D|i,D|
– Assume Class Conditional Independence to reduce computational cost of P (X|Ci)
given X ={x1, . . . , xn}: P (X|Ci) =Qn
j=1P (xj|Ci) The probabilities P (x1|Ci), . . . , P (xn|Ci) can be estimated from the training tuples
(Piotr Paszek) Data mining Bayesian Classifier 7 / 15
Naive Bayes Classifier
Estimating P (x
k|C
i)
Categorical Attributes
Recall that xk refers to the value of attribute Ak for tuple X P (xk|Ci) = |{x2Ci,D|C:Ak(x)=xk}|
i,D|
is the number of tuples of class Ci in D having the value xk for Ak, divided by the number of tuples of class Ci in D
Continuous-Valued Attributes
A continuous-valued attribute is assumed to have a Gaussian (Normal) distribution with mean µ and standard deviation P (xk|Ci) = g(xk, µCi, Ci)
Estimate µCi and Ci the mean and standard variation of the values of attribute Ak for training tuples of class Ci
Naive Bayes Classifier – Example
ID age income student status buys computer
1 30 high no single no
2 30 high no married no
3 31..40 high no single yes
4 >40 medium no single yes
5 >40 low yes single yes
6 >40 low yes married no
7 31..40 low yes married yes
8 30 medium no single no
9 30 low yes single yes
10 >40 medium yes single yes
11 30 medium yes married yes
12 31..40 medium no married yes
13 31..40 high yes single yes
14 >40 medium no married no
Han J., Kamber M., Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2006
(Piotr Paszek) Data mining Bayesian Classifier 9 / 15
Naive Bayes Classifier – Example I
Tuple to classify is:
X = (age 30, income = medium, student = yes, status = single).
Class:
C1 (buys computer = yes), C2 (buys computer = no).
Maximize:
P (X|Ci)· P (Ci), i = 1, 2
Naive Bayes Classifier – Example II
P (C1) = 9/14 = 0.64 P (C2) = 5/14 = 0.36
P (age 30 | C1) = 2/5 = 0.4 P (age 30 | C2) = 3/5 = 0.6
P (income = medium | C1) = 4/6 = 0.67 P (income = medium | C2) = 2/6 = 0.33 P (student = yes| C1) = 6/7 = 0.86 P (student = yes| C2) = 1/7 = 0.14 P (status = single| C1) = 6/8 = 0.75 P (status = single| C2) = 2/8 = 0.25 P (X | C1) = 0.4· 0.67 · 0.86 · 0.75 = 0.173 P (X | C2) = 0.6· 0.33 · 0.14 · 0.25 = 0.007
(Piotr Paszek) Data mining Bayesian Classifier 11 / 15
Naive Bayes Classifier – Example III
P (X | C1) = 0.173 P (C1) = 9/14 = 0.643 P (X | C2) = 0.007 P (C2) = 5/14 = 0.357
P (X | C1)· P (C1) = 0.173· 0.643 =0.111 P (X | C2)· P (C2) = 0.007· 0.357 = 0.003
We choose the maximum value of
P (X|Ci)· P (Ci) i = 1, 2 So the naive Bayes classifier will classify X to class:
C1 (buys computer = yes).
Naive Bayes Classifier: Zero-Probability Problem
From the assumption of class-conditional independence follow that if some conditional probability is equal zero, then the predicted
probability will be zero.
Example
Suppose a dataset with 1000 tuples: income = low (0), income = medium (990), and income = high (10).
So: P rob(income = low) = 0,
P rob(income = low|Ci) = 0 for i = 1, 2.
and for any X such that income(X) = low P (X|Ci) =Qn
j=1P (xj|Ci) = 0 for i = 1, 2.
The classifier can’t predict (select) the class label of such tuple.
(Piotr Paszek) Data mining Bayesian Classifier 13 / 15
Avoiding the Zero-Probability Problem
Naive Bayesian prediction requires each conditional probability be non-zero. Otherwise, the predicted probability will be zero Solution:
Use Laplacian correction: – adding 1 to each case
Example
Suppose a dataset with 1000 tuples, income = low (0), income = medium (990), and income = high (10).
After Laplacian correction:
P rob(income = low) = 1/1003
P rob(income = medium) = 991/1003 P rob(income = high) = 11/1003
The “corrected” probability estimates are close to their
“uncorrected” counterparts.
Naive Bayes Classifier: Comments
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore loss of accuracy
Practically, dependencies exist among variables (e.g. medical data)
Dependencies among these cannot be modeled by Naive Bayes Classifier
How to deal with these dependencies?
Bayesian Belief Networks
(Piotr Paszek) Data mining Bayesian Classifier 15 / 15