Data mining
Piotr Paszek
Classification
k-NN Classifier
Lazy vs. Eager Learning
1 Eager learning (e.g. decision tree)
Given a set of training tuples, constructs a classification model before receiving new (e.g., test) data to classify
Do lot of work on training data
Do less work when test tuples are presented
2 Lazy learning (e.g., instance-based learning)
Simply stores training data (or only minor processing) and waits until it is given a test tuple
Do less work on training data
Do more work when test tuples are presented
Lazy Learner: Instance-Based Methods
Instance-based learning:
Store training examples and delay the processing (lazy evaluation) until a new instance must be classified Typical approaches
k-nearest neighbor approach (k-NN)
Instances represented as points in a Euclidean space Case-based reasoning
Uses symbolic representations and knowledge-based inference
k-Nearest Neighbor Classifier
Nearest-neighbor classifiers compare a given test tuple with training tuples that are similar
Training tuples are described by n attributes (n-dimensional space)
Find the k-nearest tuples from the training set to the unknown tuple
k-NN classify an unknown example with the most common class among k closest examples (nearest neighbor)
The closeness between tuples is defined in terms of distance metric (e.g., Euclidian distance)
Distance
Metric
Let d be a two-argument function (e.g, the distance between two objects).
Function d is a metric if:
1 d(x, y) 0;
2 d(x, y) = 0 if x = y;
3 d(x, y) = d(y, x);
4 d(x, z) d(x, y) + d(y, z).
Distance (numeric attributes)
Let x = [x1, x2, . . . , xn] and y = [y1, y2, . . . , yn] are two points in the n-dimensional space (Euclidean).
Euclidian distance
de(x, y) = vu utXn
i=1
(xi yi)2
Manhattan distance (taxicab metric)
dm(x, y) = Xn
i=1
|xi yi|
Distance (numeric attributes)
Minkowski distance
Lm(x, y) = ( Xn
i=1
|xi yi|q)1q,
where q is a positive natural number.
For q = 1 Manhattan distance, for q = 2 Euclidian distance.
Max distance
d1(x, y) = maxni=1|xi yi|.
Distance (nominal or categorical attributes)
Let x = [x1, x2, . . . , xn] and y = [y1, y2, . . . , yn] are two vectors (xi is nominal attribute).
d(x, y) = Xn
i=1
(xi, yi)
(xi, yi) =
⇢ 0 xi = yi
1 xi 6= yi
Normalization
To improve the performance of the k-NN algorithm, the commonly used technique is normalization of data from the training set.
As a result, all dimensions for which the distance is calculated have the same level of significance.
k-NN Classifiers
Classification
The unknown tuple is assigned the most common class among its k nearest neighbor
When k = 1 the unknown tuple is assigned the class of the training tuple that is closest to it
1-NN scheme has a miss-classification probability that is no worse than twice that of the situation where we know the precise probability density of each function
Prediction
Nearest neighbor classifiers can also be used for prediction Return a real-valued prediction for a given unknown tuple The classifier returns the average value of the real-valued labels associated with the k-nearest neighbors of the unknown tuple
How to Determine the Value of k
Larger k may lead to better performance
But if we set k too large we may end up looking at samples that are not neighbors (are far away from the query)
We can use test set (validation) to find best k Rule of thumb is
k < sqrt(n), where n is the number of training examples
We can use validation to find k
Start with k = 1; We use a test set to estimate the error rate of the classifier
We increment k and estimate error rate for new k We chose the k value that gives the minimum error rate
Shortcomings of k-NN Algorithms
First: no time required to estimate parameters from training data, but the time to find the nearest neighbor can be prohibitive Some ideas to overcome this problem
Reduce the time taken to compute distances by working in reduced dimension (use Principal component analysis (PCA)) Use sophisticated data structure such as trees to speed up the identification of the nearest neighbor
Edit the training data to remove redundant
e.g., remove observations in the training data that have no e↵ect on the classification because they are surrounded by observations that all belong to the same class
Shortcomings of k-NN Algorithms
Second: “the Curse of Dimensionality”
Let p be the number of dimensions
The expected distance to the nearest neighbor goes up dramatically with p unless the size of the training data set increases exponentially with p
Some ideas to overcome this problem
Reduce the dimensionality of the space of attributes Select subsets of the predictor variables by combining them using methods such as principal components, singular value decomposition and factor analysis
k-NN Classifiers – Summary
Advantages
Can be applied to the data from any distribution for example, data does not have to be separable with a linear boundary Very simple and intuitive
Good classification if the number of samples is large enough Disadvantages
Choosing k may be tricky
Test stage is computationally expensive
No training stage, all the work is done during the test stage This is actually the opposite of what we want. Usually we can a↵ord training step to take a long time, but we want fast test step
Case-Based Reasoning (CBR)
CBR: Uses a database of problem solutions to solve new problems
Store symbolic description (tuples or cases) Applications: Customer-service, legal ruling Methodology
instances represented by rich symbolic descriptions (e.g., function graphs)
Search for similar cases, multiple retrieved cases may be combined
Tight coupling between case retrieval, knowledge-based reasoning, and problem solving
Challenges
Find a good similarity metric
Indexing based on syntactic similarity measure, and when failure, backtracking, and adapting to additional cases