Data mining
Piotr Paszek
Cluster Analysis
Cluster Analysis (Clustering)
Cluster: A collection of data objects
similar(or related) to one another within the same group dissimilar(or unrelated) to the objects in other groups Cluster analysis (or clustering, data segmentation, . . . )
Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes
Typical applications
As a stand-alone toolto get insight into data distribution As a preprocessing stepfor other algorithms
Clustering as a Preprocessing Tool (Utility)
Summarization:
Preprocessing for regression, principal component analysis, classification, and association analysis
Compression:
Image processing: vector quantization Finding k-nearest Neighbors
Localizing search to one or a small number of clusters Outlier detection
Outliers are often viewed as those far away from any cluster
Quality: What Is Good Clustering?
A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters The quality of a clustering method depends on
the similarity measure used by the method its implementation, and
Its ability to discover some or all of the hidden patterns
Measure the Quality of Clustering
Dissimilarity/Similarity metric
Similarity is expressed in terms of a distance function, typically metric: d(i, j)
The definitions ofdistance functionsare usually rather di↵erent for interval-scaled, boolean, categorical, ordinal ratio, and vector variables
Weights should be associated with di↵erent variables based on applications and data semantics
Quality of clustering:
There is usually a separate quality function that measures the goodness of a cluster.
It is hard to define “similar enough” or “good enough”
The answer is typically highly subjective
Considerations for Cluster Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs.
non-exclusive (e.g., one document may belong to more than one class)
Similarity measure
Distance-based (e.g., Euclidean, road network, vector) vs.
connectivity-based (e.g., density or contiguity) Clustering space
Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)
Cluster analysis – Requirements
Scalability
Clustering all the data instead of only on samples Ability to deal with di↵erent types of attributes
Numerical, binary, categorical, ordinal, linked, and mixture of these
Constraint-based clustering
User may give inputs on constraints
Use domain knowledge to determine input parameters Interpretability and usability
Others
Discovery of clusters with arbitrary shape Ability to deal with noisy data
Incremental clustering and insensitivity to input order
Basic Clustering Methods I
Partitioning methods
Find mutually exclusive clusters of spherical shape Distance-based
May use mean or medoid (etc.) to represent cluster center E↵ective for small to medium-size data sets
Typical methods: k-means, k-medoids, CLARANS Hierarchical methods
Clustering is a hierarchical decomposition (i.e., multiple levels) Cannot correct erroneous merges or splits
May incorporate other techniques like microclustering or consider object “linkages”
Typical methods: Diana, Agnes, BIRCH, CAMELEON
Basic Clustering Methods II
Density-based methods
Can find arbitrarily shaped clusters
Clusters are dense regions of objects in space that are separated by low-density regions
Cluster density: Each point must have a minimum number of points within its “neighborhood”
May filter out outliers
Typical methods: DBSACN, OPTICS, DenClue Grid-based methods
Use a multiresolution grid data structure
Fast processing time (typically independent of the number of data objects, yet dependent on grid size)
Typical methods: STING, WaveCluster, CLIQUE
Partitioning Algorithms: Basic Concept
Suppose a data set D, contains n objects in Euclidean space.
Partitioning methods distribute the objects in D into k clusters, C ={C1, . . . , Ck} that is, Ci ⇢ D and Ci\ Cj =; for
1 i, j k.
The clusters are formed to optimize an objective partitioning criterion, such as a dissimilarity function based ondistance, so that the objects within a cluster are “similar” to one another and “dissimilar” to objects in other clusters in terms of the data set attributes.
Distance (numeric attributes)
Let x = [x1, x2, . . . , xn] and y = [y1, y2, . . . , yn] are two points in the n-dimensional space (Euclidean)
Euclidian distance
de(x, y) = vu ut
Xn i=1
(xi yi)2
Manhattan distance
dm(x, y) = Xn
i=1
|xi yi|
Distance (numeric attributes)
Minkowski distance
dm,q(x, y) = ( Xn
i=1
|xi yi|q)1q where q is a positive natural number
Max distance
d1(x, y) = maxni=1|xi yi|
Distance (nominal or categorical attributes)
Let x = [x1, x2, . . . , xn] and y = [y1, y2, . . . , yn] are two vectors (xi is nominal attribute)
d(x, y) = Xn
i=1
(xi, yi)
(xi, yi) =
⇢ 0 xi = yi
1 xi 6= yi
Measure the Quality of Clustering C
Let
C ={C1,· · · , Ck}, [ki=1Ci = D, Ci\ Cj =;
Within-cluster variation function
E(C) = Xk
i=1
X
p2Ci
d2(p, ci) where
p2 D,
ci – is the centroid or medoid of cluster Ci
d – is a distance function (e.g. de, dm)
Partitioning Algorithms: Basic Concept
Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
Global optimal: exhaustively enumerate all partitions Heuristic methods:
k-means (MacQueen67):
Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw87):
Each cluster is represented by one of the objects in the cluster
k-means Algorithm
Given k, the k-means algorithm is implemented in four steps:
1 Partition objects into k nonempty subsets
2 Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster)
3 Assign each object to the cluster with the nearest seed point
4 Go back to Step 2, stop when the assignment does not change
k-means Algorithm
Input: k – the number of clusters, D – a data set Output: C ={C1,· · · , Ck} – a set of k clusters
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster;
(4) update the cluster means, that is, calculate the mean value of the objects for each cluster;
(5) until no change;
Measure the Quality of Clustering C
Let C = {C1,· · · , Ck}, [i=1k Ci = D, Ci\ Cj =;
Within-cluster variation function
E(C) = Xk
i=1
X
p2Ci
d2(p, ci)
p2 D,
ci – centroid of cluster Ci
ci = 1
|Ci| ·X
p2Ci
p d – distance function (e.g. de, dm)
k-means Algorithm
Strength:
Efficient: O(tkn), where n is # objects, k is # clusters, and t is
# iterations
Comment: Often terminates at a local optimal
Weakness
Applicable only to objects in a continuous n-dimensional space Using the k-modes method for categorical data
In comparison, k-medoids can be applied to a wide range of data Need to specify k, the number of clusters
Sensitive to noisy data and outliers
k-medoid Algorithm
The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may substantially distort the distribution of the data
k-medoids:
Instead of taking the mean value of the object in a cluster as a reference point, medoidscan be used, which is the most centrally locatedobject in a cluster
k-medoid Algorithm
k-Medoids Clustering: Find representative objects (medoids) in clusters
Partitioning Around Medoids (Kaufmann & Rousseeuw 1987) Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering
PAM works e↵ectively for small data sets, but does not scale well for large data sets (due to the computational complexity) Efficiency improvement on PAM
CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples CLARANS (Ng & Han, 1994): Randomized re-sampling
k-medoid Algorithm
Input: k – the number of clusters, D – a data set Output: C ={C1,· · · , Ck} – a set of k clusters
(1) arbitrarily choose k objects in D as the initial representative objects or seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest representative object;
(4) randomly select a nonrepresentative object, orandom;
(5) compute the total cost S, of swapping representative object oj with orandom;
(6) if S < 0 then swap oj with orandom to form the new set of k representative objects;
PAM Algorithm
Strength:
PAM is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean
PAM (k-medoids) can be applied to a wide range of data
Weakness
PAM works efficiently for small data sets but does not scale well for large data sets
Computational complexity: O(tk(n k)2), where n is # objects, k is # clusters, and t is # iterations
To deal with larger data sets, a sampling-based method called
Hierarchical Methods
Ahierarchical clustering method works by grouping data objects into a hierarchy or tree of clusters.
This method can be either agglomerative or divisive, depending on whether the hierarchical decomposition is formed in a bottom-up (merging) or top-down (splitting) fashion.
Representing data objects in the form of a hierarchy is useful for data summarization and visualization.
Use distance matrix as clustering criteria.
This method does not require the number of clusters k as an input, but needs a termination condition
Agglomerative Hierarchical Clustering Method
An agglomerative hierarchical clustering method uses a bottom-up strategy
It typically starts by letting each object form its own cluster and iteratively merges clusters into larger and larger clusters, until all the objects are in a single cluster or certain termination
conditions are satisfied
The single cluster becomes the hierarchys root
For the merging step, it finds the two clusters that are closest to each other (according to some similarity measure), and combines the two to form one cluster
Because two clusters are merged per iteration, where each cluster contains at least one object, an agglomerative method
Divisive Hierarchical Clustering Method
A divisive hierarchical clustering method employs a top-down strategy
It starts by placing all objects in one cluster, which is the hierarchys root
It then divides the root cluster into several smaller subclusters, and recursively partitions those clusters into smaller ones The partitioning process continues until each cluster at the lowest level is coherent enough – either containing only one object, or the objects within a cluster are sufficiently similar to each other
In either agglomerative or divisive hierarchical clustering, a user can specify the desired number of clusters as a termination
Dendrogram: Shows How Clusters are Merged
Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
Distance between Clusters
Whether using an hierarchical clustering method (agglomerative or divisive), a core need is to measure the distance between two clusters, where each cluster is generally a set of objects
Linkage Measures
Single link (minimum distance):
smallest distance between an element in one cluster and an element in the other
distmin(Ci, Cj) = minp2Ci,q2Cj{dist(p, q)}
Complete link (maximum distance):
largest distance between an element in one cluster and an element in the other
distmax(Ci, Cj) = maxp2Ci,q2Cj{dist(p, q)}
Distance between Clusters
Linkage Measures
Centroid (mean distance):
distance between the centroids of two clusters distmean(Ci, Cj) = dist(mi, mj) where mi is the mean for cluster Ci
Medoid (mean distance):
distance between the medoids of two clusters distmed(Ci, Cj) = dist(Mi, Mj)
where Mi is the medoid (centrally located object in the cluster) for cluster Ci
Average distance:
average distance between an element in one cluster and an element in the other
Density-Based Methods - Why?
Partitioning and hierarchical methods are designed to find spherical-shaped clusters.
They have difficulty finding clusters of arbitrary shape such as the S shape and oval clusters.
To find clusters of arbitrary shape we can model clusters as dense regions in the data space, separated by sparse regions.
This is the main strategy behind density-based clustering methods
Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such as density-connected points
Major features:
Discover clusters of arbitrary shape Handle noise
One scan
Need density parameters as termination condition Several interesting studies:
DBSCAN: Ester, et al. (KDD96) OPTICS: Ankerst, et al (SIGMOD99).
DENCLUE: Hinneburg & D. Keim (KDD98)
CLIQUE: Agrawal, et al. (SIGMOD98) (more grid-based)
Density-Based Clustering: Basic Concepts
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an eps-neighbourhood of that point
NEps(p) :={q 2 D|dist(p, q) Eps}
(eps-neighbourhood of point p) Directly density-reachable:
A point p is directly density-reachable from a point q with respect to Eps, M inP ts if
p2 NEps(q)
|NEps(q)| M inP ts (core point condition)
Density-Reachable and Density-Connected
Density-reachable
A point p is density-reachable from a point q with respect to Eps, M inP tsif there is a chain of points
p1, . . . , pn, p1 = q, pn= p such that pi+1 is directly density-reachable from pi
Density-Reachable and Density-Connected
Density-connected
A point p is density-connected to a point q with respect to Eps, M inP tsif there is a point o such that both, p and q are density-reachable from o with respect to Eps, M inP ts
DBSCAN: The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p with respect to Eps, M inP ts
If p is a core point, a cluster is formed
If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database
Continue the process until all of the points have been processed
Grid-Based Clustering Method
The clustering methods discussed so far are data-driven – they partition the set of objects and adapt to the distribution of the objects in the embedding space.
Alternatively, a grid-based clustering method takes a
space-driven approach by partitioning the embedding space into cells independent of the distribution of the input objects.
The grid-based clustering approach uses a multiresolution grid data structure. It quantizes the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed.
The main advantage of the approach is its fast processing time, which is typically independent of the number of data objects, yet dependent on only the number of cells in each dimension in the
Grid-Based Clustering Method
Using multi-resolution grid data structure Several interesting methods
STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997)
WaveClusterby Sheikholeslami, Chatterjee, and Zhang (VLDB98)
A multi-resolution clustering approach using wavelet method CLIQUE: Agrawal, et al. (SIGMOD98)
Both grid-based and subspace clustering
Determine the Number of Clusters
Empirical method
# of clusters⇡pn
2 for a dataset of n points Elbow method
Use the turning point in the curve of sum of within cluster variance w.r.t the # of clusters
Cross validation method
Divide a given data set into m parts Use m1 parts to obtain a clustering model
Use the remaining part to test the quality of the clustering For any k > 0, repeat it m times, compare the overall quality measure w.r.t. di↵erent ks, and find # of clusters that fits the data the best
Measuring Clustering Quality
Two methods: extrinsic vs. intrinsic
Extrinsic: supervised, i.e., the ground truth is available Compare a clustering against the ground truth using certain clustering quality measure
Ex. BCubed precision and recall metrics
Intrinsic: unsupervised, i.e., the ground truth is unavailable Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are Ex. Silhouette coefficient
Summary
Cluster analysis groups objects based on their similarity and has wide applications
Measure of similarity can be computed for various types of data Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods
K-means and K-medoids algorithms are popular partitioning-based clustering algorithms
Birch and Chameleon are interesting hierarchical clustering algorithms, and there are also probabilistic hierarchical clustering algorithms
DBSCAN, OPTICS, and DENCLU are interesting density-based algorithms
STING and CLIQUE are grid-based methods, where CLIQUE is also a subspace clustering algorithm