Lazy vs. Eager Learning

(1)

Data mining

Piotr Paszek

Classification

k-NN Classifier

(2)

Lazy vs. Eager Learning

1 Eager learning (e.g. decision tree)

Given a set of training tuples, constructs a classification model before receiving new (e.g., test) data to classify

Do lot of work on training data

Do less work when test tuples are presented

2 Lazy learning (e.g., instance-based learning)

Simply stores training data (or only minor processing) and waits until it is given a test tuple

Do less work on training data

Do more work when test tuples are presented

(3)

Lazy Learner: Instance-Based Methods

Instance-based learning:

Store training examples and delay the processing (lazy evaluation) until a new instance must be classified Typical approaches

k-nearest neighbor approach (k-NN)

Instances represented as points in a Euclidean space Case-based reasoning

Uses symbolic representations and knowledge-based inference

(4)

k-Nearest Neighbor Classifier

Nearest-neighbor classifiers compare a given test tuple with training tuples that are similar

Training tuples are described by n attributes (n-dimensional space)

Find the k-nearest tuples from the training set to the unknown tuple

k-NN classify an unknown example with the most common class among k closest examples (nearest neighbor)

The closeness between tuples is defined in terms of distance metric (e.g., Euclidian distance)

(5)

Distance

Metric

Let d be a two-argument function (e.g, the distance between two objects).

Function d is a metric if:

1 d(x, y) 0;

2 d(x, y) = 0 if x = y;

3 d(x, y) = d(y, x);

4 d(x, z) d(x, y) + d(y, z).

(6)

Distance (numeric attributes)

Let x = [x1, x2, . . . , xn] and y = [y1, y2, . . . , yn] are two points in the n-dimensional space (Euclidean).

Euclidian distance

de(x, y) = vu utXⁿ

i=1

(xi yi)²

Manhattan distance (taxicab metric)

dm(x, y) = Xn

i=1

|xi yi|

(7)

Distance (numeric attributes)

Minkowski distance

Lm(x, y) = ( Xn

i=1

|xi yi|^q)¹^q,

where q is a positive natural number.

For q = 1 Manhattan distance, for q = 2 Euclidian distance.

Max distance

d₁(x, y) = maxⁿ_i=1|xⁱ yi|.

(8)

Distance (nominal or categorical attributes)

Let x = [x₁, x₂, . . . , x_n] and y = [y₁, y₂, . . . , y_n] are two vectors (xi is nominal attribute).

d(x, y) = Xn

i=1

(xi, yi)

(xi, yi) =

⇢ 0 xi = yi

1 xi 6= yi

(9)

Normalization

To improve the performance of the k-NN algorithm, the commonly used technique is normalization of data from the training set.

As a result, all dimensions for which the distance is calculated have the same level of significance.

(10)

k-NN Classifiers

Classification

The unknown tuple is assigned the most common class among its k nearest neighbor

When k = 1 the unknown tuple is assigned the class of the training tuple that is closest to it

1-NN scheme has a miss-classification probability that is no worse than twice that of the situation where we know the precise probability density of each function

Prediction

Nearest neighbor classifiers can also be used for prediction Return a real-valued prediction for a given unknown tuple The classifier returns the average value of the real-valued labels associated with the k-nearest neighbors of the unknown tuple

(11)

How to Determine the Value of k

Larger k may lead to better performance

But if we set k too large we may end up looking at samples that are not neighbors (are far away from the query)

We can use test set (validation) to find best k Rule of thumb is

k < sqrt(n), where n is the number of training examples

(12)

We can use validation to find k

Start with k = 1; We use a test set to estimate the error rate of the classifier

We increment k and estimate error rate for new k We chose the k value that gives the minimum error rate

(13)

Shortcomings of k-NN Algorithms

First: no time required to estimate parameters from training data, but the time to find the nearest neighbor can be prohibitive Some ideas to overcome this problem

Reduce the time taken to compute distances by working in reduced dimension (use Principal component analysis (PCA)) Use sophisticated data structure such as trees to speed up the identification of the nearest neighbor

Edit the training data to remove redundant

e.g., remove observations in the training data that have no e↵ect on the classification because they are surrounded by observations that all belong to the same class

(14)

Shortcomings of k-NN Algorithms

Second: “the Curse of Dimensionality”

Let p be the number of dimensions

The expected distance to the nearest neighbor goes up dramatically with p unless the size of the training data set increases exponentially with p

Some ideas to overcome this problem

Reduce the dimensionality of the space of attributes Select subsets of the predictor variables by combining them using methods such as principal components, singular value decomposition and factor analysis

(15)

k-NN Classifiers – Summary

Advantages

Can be applied to the data from any distribution for example, data does not have to be separable with a linear boundary Very simple and intuitive

Good classification if the number of samples is large enough Disadvantages

Choosing k may be tricky

Test stage is computationally expensive

No training stage, all the work is done during the test stage This is actually the opposite of what we want. Usually we can a↵ord training step to take a long time, but we want fast test step

(16)

Case-Based Reasoning (CBR)

CBR: Uses a database of problem solutions to solve new problems

Store symbolic description (tuples or cases) Applications: Customer-service, legal ruling Methodology

instances represented by rich symbolic descriptions (e.g., function graphs)

Search for similar cases, multiple retrieved cases may be combined

Tight coupling between case retrieval, knowledge-based reasoning, and problem solving

Challenges

Find a good similarity metric

Indexing based on syntactic similarity measure, and when failure, backtracking, and adapting to additional cases