Cluster Analysis

(1)

Data mining

Piotr Paszek

Cluster Analysis

(2)

Cluster Analysis (Clustering)

Cluster: A collection of data objects

similar(or related) to one another within the same group dissimilar(or unrelated) to the objects in other groups Cluster analysis (or clustering, data segmentation, . . . )

Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes

Typical applications

As a stand-alone toolto get insight into data distribution As a preprocessing stepfor other algorithms

(3)

Clustering as a Preprocessing Tool (Utility)

Summarization:

Preprocessing for regression, principal component analysis, classification, and association analysis

Compression:

Image processing: vector quantization Finding k-nearest Neighbors

Localizing search to one or a small number of clusters Outlier detection

Outliers are often viewed as those far away from any cluster

(4)

Quality: What Is Good Clustering?

A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters

low inter-class similarity: distinctive between clusters The quality of a clustering method depends on

the similarity measure used by the method its implementation, and

Its ability to discover some or all of the hidden patterns

(5)

Measure the Quality of Clustering

Dissimilarity/Similarity metric

Similarity is expressed in terms of a distance function, typically metric: d(i, j)

The definitions ofdistance functionsare usually rather di↵erent for interval-scaled, boolean, categorical, ordinal ratio, and vector variables

Weights should be associated with di↵erent variables based on applications and data semantics

Quality of clustering:

There is usually a separate quality function that measures the goodness of a cluster.

It is hard to define “similar enough” or “good enough”

The answer is typically highly subjective

(6)

Considerations for Cluster Analysis

Partitioning criteria

Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)

Separation of clusters

Exclusive (e.g., one customer belongs to only one region) vs.

non-exclusive (e.g., one document may belong to more than one class)

Similarity measure

Distance-based (e.g., Euclidean, road network, vector) vs.

connectivity-based (e.g., density or contiguity) Clustering space

Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)

(7)

Cluster analysis – Requirements

Scalability

Clustering all the data instead of only on samples Ability to deal with di↵erent types of attributes

Numerical, binary, categorical, ordinal, linked, and mixture of these

Constraint-based clustering

User may give inputs on constraints

Use domain knowledge to determine input parameters Interpretability and usability

Others

Discovery of clusters with arbitrary shape Ability to deal with noisy data

Incremental clustering and insensitivity to input order

(8)

Basic Clustering Methods I

Partitioning methods

Find mutually exclusive clusters of spherical shape Distance-based

May use mean or medoid (etc.) to represent cluster center E↵ective for small to medium-size data sets

Typical methods: k-means, k-medoids, CLARANS Hierarchical methods

Clustering is a hierarchical decomposition (i.e., multiple levels) Cannot correct erroneous merges or splits

May incorporate other techniques like microclustering or consider object “linkages”

Typical methods: Diana, Agnes, BIRCH, CAMELEON

(9)

Basic Clustering Methods II

Density-based methods

Can find arbitrarily shaped clusters

Clusters are dense regions of objects in space that are separated by low-density regions

Cluster density: Each point must have a minimum number of points within its “neighborhood”

May filter out outliers

Typical methods: DBSACN, OPTICS, DenClue Grid-based methods

Use a multiresolution grid data structure

Fast processing time (typically independent of the number of data objects, yet dependent on grid size)

Typical methods: STING, WaveCluster, CLIQUE

(10)

Partitioning Algorithms: Basic Concept

Suppose a data set D, contains n objects in Euclidean space.

Partitioning methods distribute the objects in D into k clusters, C ={C1, . . . , C_k} that is, Ci ⇢ D and Ci\ Cj =; for

1 i, j  k.

The clusters are formed to optimize an objective partitioning criterion, such as a dissimilarity function based ondistance, so that the objects within a cluster are “similar” to one another and “dissimilar” to objects in other clusters in terms of the data set attributes.

(11)

Distance (numeric attributes)

Let x = [x1, x2, . . . , xn] and y = [y1, y2, . . . , yn] are two points in the n-dimensional space (Euclidean)

Euclidian distance

de(x, y) = vu ut

Xn i=1

(xi yi)²

Manhattan distance

dm(x, y) = Xn

i=1

|xi yi|

(12)

Distance (numeric attributes)

Minkowski distance

d_m,q(x, y) = ( Xn

i=1

|xi y_i|^q)¹^q where q is a positive natural number

Max distance

d₁(x, y) = maxⁿ_i=1|xⁱ yi|

(13)

Distance (nominal or categorical attributes)

Let x = [x₁, x₂, . . . , x_n] and y = [y₁, y₂, . . . , y_n] are two vectors (xi is nominal attribute)

d(x, y) = Xn

i=1

(xi, yi)

(xi, yi) =

⇢ 0 xi = yi

1 xi 6= yi

(14)

Measure the Quality of Clustering C

Let

C ={C1,· · · , Ck}, [^ki=1C_i = D, C_i\ Cj =;

Within-cluster variation function

E(C) = Xk

i=1

X

p2Ci

d²(p, ci) where

p2 D,

ci – is the centroid or medoid of cluster Ci

d – is a distance function (e.g. de, dm)

(15)

Partitioning Algorithms: Basic Concept

Given k, find a partition of k clusters that optimizes the chosen partitioning criterion

Global optimal: exhaustively enumerate all partitions Heuristic methods:

k-means (MacQueen67):

Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman &

Rousseeuw87):

Each cluster is represented by one of the objects in the cluster

(16)

k-means Algorithm

Given k, the k-means algorithm is implemented in four steps:

1 Partition objects into k nonempty subsets

2 Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster)

3 Assign each object to the cluster with the nearest seed point

4 Go back to Step 2, stop when the assignment does not change

(17)

k-means Algorithm

Input: k – the number of clusters, D – a data set Output: C ={C1,· · · , Ck} – a set of k clusters

(1) arbitrarily choose k objects from D as the initial cluster centers;

(2) repeat

(3) (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster;

(4) update the cluster means, that is, calculate the mean value of the objects for each cluster;

(5) until no change;

(18)

Measure the Quality of Clustering C

Let C = {C¹,· · · , C^k}, [i=1^k Ci = D, Ci\ C^j =;

Within-cluster variation function

E(C) = Xk

i=1

X

p2Ci

d²(p, ci)

p2 D,

ci – centroid of cluster Ci

ci = 1

|Ci| ·X

p2Ci

p d – distance function (e.g. de, dm)

(19)

k-means Algorithm

Strength:

Efficient: O(tkn), where n is # objects, k is # clusters, and t is

# iterations

Comment: Often terminates at a local optimal

Weakness

Applicable only to objects in a continuous n-dimensional space Using the k-modes method for categorical data

In comparison, k-medoids can be applied to a wide range of data Need to specify k, the number of clusters

Sensitive to noisy data and outliers

(20)

k-medoid Algorithm

The k-means algorithm is sensitive to outliers !

Since an object with an extremely large value may substantially distort the distribution of the data

k-medoids:

Instead of taking the mean value of the object in a cluster as a reference point, medoidscan be used, which is the most centrally locatedobject in a cluster

(21)

k-medoid Algorithm

k-Medoids Clustering: Find representative objects (medoids) in clusters

Partitioning Around Medoids (Kaufmann & Rousseeuw 1987) Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering

PAM works e↵ectively for small data sets, but does not scale well for large data sets (due to the computational complexity) Efficiency improvement on PAM

CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples CLARANS (Ng & Han, 1994): Randomized re-sampling

(22)

k-medoid Algorithm

Input: k – the number of clusters, D – a data set Output: C ={C1,· · · , Ck} – a set of k clusters

(1) arbitrarily choose k objects in D as the initial representative objects or seeds;

(2) repeat

(3) assign each remaining object to the cluster with the nearest representative object;

(4) randomly select a nonrepresentative object, orandom;

(5) compute the total cost S, of swapping representative object o_j with orandom;

(6) if S < 0 then swap oj with orandom to form the new set of k representative objects;

(23)

PAM Algorithm

Strength:

PAM is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean

PAM (k-medoids) can be applied to a wide range of data

Weakness

PAM works efficiently for small data sets but does not scale well for large data sets

Computational complexity: O(tk(n k)²), where n is # objects, k is # clusters, and t is # iterations

To deal with larger data sets, a sampling-based method called

(24)

Hierarchical Methods

Ahierarchical clustering method works by grouping data objects into a hierarchy or tree of clusters.

This method can be either agglomerative or divisive, depending on whether the hierarchical decomposition is formed in a bottom-up (merging) or top-down (splitting) fashion.

Representing data objects in the form of a hierarchy is useful for data summarization and visualization.

Use distance matrix as clustering criteria.

This method does not require the number of clusters k as an input, but needs a termination condition

(25)

Agglomerative Hierarchical Clustering Method

An agglomerative hierarchical clustering method uses a bottom-up strategy

It typically starts by letting each object form its own cluster and iteratively merges clusters into larger and larger clusters, until all the objects are in a single cluster or certain termination

conditions are satisfied

The single cluster becomes the hierarchys root

For the merging step, it finds the two clusters that are closest to each other (according to some similarity measure), and combines the two to form one cluster

Because two clusters are merged per iteration, where each cluster contains at least one object, an agglomerative method

(26)

Divisive Hierarchical Clustering Method

A divisive hierarchical clustering method employs a top-down strategy

It starts by placing all objects in one cluster, which is the hierarchys root

It then divides the root cluster into several smaller subclusters, and recursively partitions those clusters into smaller ones The partitioning process continues until each cluster at the lowest level is coherent enough – either containing only one object, or the objects within a cluster are sufficiently similar to each other

In either agglomerative or divisive hierarchical clustering, a user can specify the desired number of clusters as a termination

(27)

Dendrogram: Shows How Clusters are Merged

Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the

(28)

Distance between Clusters

Whether using an hierarchical clustering method (agglomerative or divisive), a core need is to measure the distance between two clusters, where each cluster is generally a set of objects

Linkage Measures

Single link (minimum distance):

smallest distance between an element in one cluster and an element in the other

distmin(Ci, Cj) = minp2Ci,q2Cj{dist(p, q)}

Complete link (maximum distance):

largest distance between an element in one cluster and an element in the other

distmax(Ci, Cj) = maxp2Ci,q2Cj{dist(p, q)}

(29)

Distance between Clusters

Linkage Measures

Centroid (mean distance):

distance between the centroids of two clusters distmean(Ci, Cj) = dist(mi, mj) where mi is the mean for cluster Ci

Medoid (mean distance):

distance between the medoids of two clusters distmed(Ci, Cj) = dist(Mi, Mj)

where Mi is the medoid (centrally located object in the cluster) for cluster C_i

Average distance:

average distance between an element in one cluster and an element in the other

(30)

Density-Based Methods - Why?

Partitioning and hierarchical methods are designed to find spherical-shaped clusters.

They have difficulty finding clusters of arbitrary shape such as the S shape and oval clusters.

To find clusters of arbitrary shape we can model clusters as dense regions in the data space, separated by sparse regions.

This is the main strategy behind density-based clustering methods

(31)

Density-Based Clustering Methods

Clustering based on density (local cluster criterion), such as density-connected points

Major features:

Discover clusters of arbitrary shape Handle noise

One scan

Need density parameters as termination condition Several interesting studies:

DBSCAN: Ester, et al. (KDD96) OPTICS: Ankerst, et al (SIGMOD99).

DENCLUE: Hinneburg & D. Keim (KDD98)

CLIQUE: Agrawal, et al. (SIGMOD98) (more grid-based)

(32)

Density-Based Clustering: Basic Concepts

Two parameters:

Eps: Maximum radius of the neighbourhood

MinPts: Minimum number of points in an eps-neighbourhood of that point

NEps(p) :={q 2 D|dist(p, q)  Eps}

(eps-neighbourhood of point p) Directly density-reachable:

A point p is directly density-reachable from a point q with respect to Eps, M inP ts if

p2 NEps(q)

|NEps(q)| M inP ts (core point condition)

(33)

Density-Reachable and Density-Connected

Density-reachable

A point p is density-reachable from a point q with respect to Eps, M inP tsif there is a chain of points

p1, . . . , pn, p1 = q, pn= p such that pi+1 is directly density-reachable from pi

(34)

Density-Reachable and Density-Connected

Density-connected

A point p is density-connected to a point q with respect to Eps, M inP tsif there is a point o such that both, p and q are density-reachable from o with respect to Eps, M inP ts

(35)

DBSCAN: The Algorithm

Arbitrary select a point p

Retrieve all points density-reachable from p with respect to Eps, M inP ts

If p is a core point, a cluster is formed

If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database

Continue the process until all of the points have been processed

(36)

Grid-Based Clustering Method

The clustering methods discussed so far are data-driven – they partition the set of objects and adapt to the distribution of the objects in the embedding space.

Alternatively, a grid-based clustering method takes a

space-driven approach by partitioning the embedding space into cells independent of the distribution of the input objects.

The grid-based clustering approach uses a multiresolution grid data structure. It quantizes the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed.

The main advantage of the approach is its fast processing time, which is typically independent of the number of data objects, yet dependent on only the number of cells in each dimension in the

(37)

Grid-Based Clustering Method

Using multi-resolution grid data structure Several interesting methods

STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997)

WaveClusterby Sheikholeslami, Chatterjee, and Zhang (VLDB98)

A multi-resolution clustering approach using wavelet method CLIQUE: Agrawal, et al. (SIGMOD98)

Both grid-based and subspace clustering

(38)

Determine the Number of Clusters

Empirical method

# of clusters⇡p_n

2 for a dataset of n points Elbow method

Use the turning point in the curve of sum of within cluster variance w.r.t the # of clusters

Cross validation method

Divide a given data set into m parts Use m1 parts to obtain a clustering model

Use the remaining part to test the quality of the clustering For any k > 0, repeat it m times, compare the overall quality measure w.r.t. di↵erent ks, and find # of clusters that fits the data the best

(39)

Measuring Clustering Quality

Two methods: extrinsic vs. intrinsic

Extrinsic: supervised, i.e., the ground truth is available Compare a clustering against the ground truth using certain clustering quality measure

Ex. BCubed precision and recall metrics

Intrinsic: unsupervised, i.e., the ground truth is unavailable Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are Ex. Silhouette coefficient

(40)

Summary

Cluster analysis groups objects based on their similarity and has wide applications

Measure of similarity can be computed for various types of data Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods

K-means and K-medoids algorithms are popular partitioning-based clustering algorithms

Birch and Chameleon are interesting hierarchical clustering algorithms, and there are also probabilistic hierarchical clustering algorithms

DBSCAN, OPTICS, and DENCLU are interesting density-based algorithms

STING and CLIQUE are grid-based methods, where CLIQUE is also a subspace clustering algorithm