Cluster analysis with clusterSim computer program and R environment

(1)

M a re k Walesiak

CLUSTER ANALYSIS WITH c l u s t e r S i m

COMPUTER PROGRAM AND R ENVIRONMENT

A B S T R A C T . T he article presents auxiliary functions o f c l u s t e r S i m package (see W alesiak & D udek (2 0 0 6 )) and selected functions o f packages s t a t s , c l u s t e r , and a d e 4 , w hich are applied to solvin g clustering problem s. In addition, the exam p les o f the procedures for so lv in g different clustering problem s are presented. T h ese proce dures, w h ich are not available in statistical packages (S P S S , Statistica, S A S ), can help so lv in g a broad range o f classification problem s.

K ey words: cluster analysis, R, c l u s t e r S i m , data analysis.

I. INTRODU CTIO N

In a typical cluster analysis study seven major steps are distinguished (see Milligan (1996), 342-343): selection o f objects and variables, decisions concern ing variable normalization, selection o f a distance measure, selection o f cluster ing method, determining the number o f clusters, cluster validation, describing and profiling clusters. The article presents functions o f c l u s t e r S i m package and selected functions o f packages s t a t s , c l u s t e r , and a d e 4 , which are applied to solving clustering problems.

II. T H E PA CK A G ES AND FUNCTIONS O F R C O M PU T E R PR O G R A M IN A T Y PIC A L C LU STER ANALYSIS PR O C E D U R E

Table 1 contains selected packages and functions o f R program applied on each step o f typical cluster analysis study.

* Professor, Chair o f Department o f Econometrics and Computer Science, Wrocław Univer sity o f Economics.

(2)

Table I The packages and functions o f R computer program in a typical cluster analysis study

No. Steps in a typical cluster analysis study

Selected

packages Functions

1 Selection o f objects and variables c l u s t e r S i m H I N o v . M o d 2 Decisions conccming variable

normalization c l u s t e r S i m d a t a . N o r m a l i z a t i o n

3 Selection o f a distance measure

c l u s t e r S i m s t a t s a d e 4 d i s t . BC, d i s t . GDM, d i s t .S M d i s t d i s t . b i n a r y

4 Selection o f clustering method

c l u s t e r s t a t s c l u s t e r S i m a g n e s , d i a n a , p a m k m e a n s , h c l u s t i n i t i a l . C e n t e r s

5 Determining the number

o f clusters c l u s t e r S i m

i n d e x . G l , i n d e x . G 2 , i n  d e x . G3, i n d e x . S, i n d e x . K L , i n d e x . H , i n d e x . G a p 6 Cluster validation c l u s t e r S i m r e p l i c a t i o n . M o d 7 Describing and profiling clusters c l u s t e r S i m c l u s t e r .D e s c r i p t i o n

Source: own presentation.

Step 1. Selection o f objects and variables. Carmone, Kara, and Maxwell (1999) proposed the Heuristic Identification o f Noisy Variables (HINoV) method based on /с-means cluster analysis on each variable and corrected Rand index for each resulting pair o f partitions. The HINoV algorithm can identify noisy vari ables in a data set and yield better cluster recovery. As a result o f this algorithm, we receive the contribution o f each variable to cluster structure. Package c l u s t e r S i m contains extended version o f HINoV method for nonmetric data: H IN o V .M o d (x , t y p e = " m e t r i c " , s = 2 , u , d i s t a n c e = N U L L ,

m e t h o d = " k m e a n s " , In d e x = " c R A N D " ) where:

x - data matrix;

s - for metric data (1 - ratio; 2 - interval or mixed); u - number o f clusters (for metric data);

d i s t a n c e - NULL for k m e a n s and nonmetric data, for ratio data ( " d l" - Manhattan, "d2" - Euclidean, "d3" - Chebychev (maximum), "d4" - squared Euclidean, "d5" - GDM1, "d6" - Canbena, "d7" - Bray & Cur tis), for interval and mixed data ( " d l" , "d2", "d3", "d4", "d5");

(3)

m e th o d - classification method: "km eans" (default) , " s i n g l e " , "com p l e t e " , " a v e r a g e " , " m c q u it t y " , "m e d ia n ", " c e n t r o i d " , "W ard", "pam" (NULL for nonmetric data);

In d e x -"c R A N D " - corrected Rand index, "RAND" - Rand index.

Step 2. Decisions concerning variable nonnalization. Function

d a t a . N o r m a l i z a t i o n ( x , t y p e = " n 0 " ) calculates normalization data using the formula o f variable normalization nO - n i l for data matrix x (nO — without normalization, n l - standardization, n2 - Weber standardization, n3 - unitization, n4 - unitization with zero minimum, n5 - normalization with range [-1; 1], n 6 - n l l - quotient transformations with different base) - details see Walesiak (2006).

Step 3. Selection o f a distance measure. The packages c l u s t e S i m , s t a t s and a d e 4 contain distance measures for metric and nonmetric data (see 1 able 2).

Table 2 Distance measures for metric and nonmetric data

Package Syntax

c l u s t e r S i m d i s t . G D M ( x , m e t h o d = " G D M l ") - function calculates Generalized Distance Measure for variables measured on metric scale (GDM1) or or dinal scale ( G D M 2)

d i s t . BC (x ) - function calculates the Bray-Curtis distance measure for ratio data

d i s t . S M ( x ) - function calculates the Sokal-Michener distance measure for nominal variables

s t a t s d i s t f x , m e t h o d = " e u c l i d e a n " , p = 2) x data matrix or " d is t " object

m e t h o d distance measure: " e u c l i d e a n " , "maximum", " m a n h a tta n " , " C a n b e r r a " , " b i n a r y " , " m i n -

k o w s k i ”

p the power for the Minkowski distance

a d e 4 d i s t . b i n a r y ( d f , m e th o d = N U L L )

d f a data frame with positive or zero values. Used with a s , m a t r i x ( l * ( d f >0))

m e t h o d an imeger between 1 and 10 (distance measure d = V l - s ):

1 = Jaccard, 2 = Sokal & Michener, 3 = Sokal & Sneath (1), 4 = Rogers & Tanimoto, 5 = Czekanowski, 6 = Gower & Legendre (1), 7 = Ochiai, 8 = Sokal & Sneath (2), 9 = Phi o f Pearson, 10 = Gower & Legendre (2)

(4)

Step 4. Selection o f clustering method. The most frequently applied cluster ing methods are available in packages s t a t s ( h c l u s t - hierarchical agglomerative methods; k m e a n s - Л-means method) and c l u s t e r (pam - parti tioning around medoids; a g n e s - hierarchical agglomerative methods; d i a n a - hierarchical divisive method). Example syntax for function k m e a n s for clus tering data:

k m e a n s ( x , c e n t e r s , i t e r . m a x = 1 0 , n s t a r t = 1 , a l g o r i t h m = с ( " H a r t ig a n - W o n g " , " L l o y d " , " F o r g y " , "M a c - Q u e e n " ) )

where: x - data matrix; c e n t e r s - either the number o f clusters or a set o f ini tial cluster centers; i t e r .m ax - the maximum number o f iterations al lowed; n s t a r t - if centers is a number, how many random sets should be chosen?; a l g o r i t h m - applied algorithm.

Function i n i t i a l . C e n te r s ( x , k ) o f c l u s t e r S i m package calcu lates initial cluster centers for Л-means algorithm (x - data matrix, к - number o f initial cluster centers).

Step 5. Determining the number o f clusters. Package c l u s t e r S i m contains seven cluster quality indices necessary in determination o f the number of clus ters in a data set (Calinski & Harabasz, Baker & Hubert, Hubert & Levine, Sil houette, Krzanowski & Lai, Hartigan, gap). For example function i n d e x . H ( x , c l a l l ) calculates Hartigan index for data matrix x and two vectors o f in tegers с l a l l indicating the cluster to which each object is allocated in partition of n objects into u, and u + 1 clusters (details and others indices see Walesiak (2007)).

Step 6. Cluster validation. In replication analysis (see Breckenridge (2000)) we compare the results o f classification of two random samples obtained from a data set. The level o f agreement between the two partitions (mean corrected Rand index) reflects the stability of the clustering in the data. Package

c l u s t e r S i m contains r e p l i c a t i o n . Mod function:

r e p l i c a t i o n . M o d ( x , v = " m " , u = 2 , c e n t r o - t y p e s = " c e n t r o i d s " ,

n o r m a liz a t io n = N U L L , d is ta n c e = N U L L , m e th o d = " k m e a n s " ,

S =10, fix e d A s a m p le = N U L L )

where: x - data matrix, v - type o f data: metric ( " r " - ratio, " i " - interval, "m" - mixed), nonmetric ("o " - ordinal, "n " - multistate nominal, "b " - b i nary), u - number o f clusters, c e n t r o t y p e s - " c e n t r o i d s " , "me d o id s " ; n o r m a l i z a t i o n - normalization fonnula n l - n l l (see stage 2); d i s t a n c e - NULL for "km eans", distance measure (see stage 3); m e th o d - classification method (see stage 4); S - number o f

(5)

Simula-tions; f i x e d A s a m p l e - if NULL A sample is generated randomly, oth erwise this parameter contains object numbers arbitrarily assigned to A sample.

Step 7. Describing and profiling clusters. Function c l u s t e r . D e s c r i p t i o n ( x , c l ) o f c l u s t e r S i m package calculates descriptive statistics separately for each cluster and variable in classification c l : arithmetic mean and standard deviation, median and median absolute deviation, mode.

III. T IIE EX A M PLE PROCEDURES W IT H SEL EC TED FU NCTIO NS O F R PACKAGES

The 75 observations were generated from standard two-dimensional spheri cal normal distribution into five clusters o f size 15 each with means: / / , = [ 0 O f , / / 2 = [ 0 1 0 f , / / 3 = [5 S j , / / 4 = [lO o f , / / 5 = [1 0 1 0 j ,

f l 0 ]

and covariance matrices: Z, = E , = Z, = X. = = . I n addition, three

1 2 3 4 5 Q 1

noisy variables are included in the model to obscure the underlying clustering structure to be recovered. 75 observations for these variables were generated

'5 2 6

with means and covariance matrix: /j = [5 5 l , S j , S = 2 1 - 5

6 - 5 2

Finally, the data were standardized via formula , , n l” . To help isolate noisy vari ables H IN o V . Mod procedure was applied (see example 1).

Example 1 > l i b r a r y ( c l u s t e r ) > l i b r a r y ( c l u s t e r S i m ) > x < - r e a d . c s v 2 ( "C: / D a t a _ 7 5 x 5 . c s v " , h e a d e r = T R U E , s t r i p . w h i t e = T R U E , r o w . n a m e s = l ) > x < ~ a s . m a t r i x ( x ) > z < - d a t a . N o r m a l i z a t i o n ( x , t y p e = " n l " ) > z < - a s . d a t a . f r a m e ( z ) > r l < - H I N o V . M o d ( z , t y p e = " m e t r i c " , s = 2 , 5 , m e t h o d = " k m e a n s",I n d e x = “c R A N D " ) > o p t i o n s ( O u t D e c = > p l o t ( r l $ s t o p r i [ , 2 ] , t y p e = " p " , p c h = 0 , x l a b = " N u m b e r o f v a r i a b l e ", y l a b = " t o p r i " , x a x t = " n ")

(6)

> a x i s ( 1 , a t = c ( 1 : m a x ( r l $ s t o p r i [ , 1 ] ) ) ,

l a b e l s = r l $ s t o p r i [ , 1 ] )

The result o f this procedure is shown in Figure 1.

Based on scree diagram (Figure 1) three noisy variables v_3, v_4, and v_5 were eliminated via HINoV method.

In procedure of example 2 the following assumptions is taken into account:

- for clustering o f 75 objects in two-dimensional space (file

d a t a _ 7 5 x 2 , c s v ) the Л-means method was applied,

- the estimated number of clusters is the smallest м е [ 2 ;1 () ] such that

H ( u ) < 1 0 ,

Number of variable

Figure 1. Scree diagram Source: own research.

- w r i t e , t a b l e function allow to save results in files: values o f index

H ( u ) , a vector o f integers indicating the cluster to which each object is allo

cated („cluster”), a matrix o f cluster centers („centers”), the within-cluster sum o f squares for each cluster(„withinss”), the number o f objects in each cluster („size”).

Example 2 (first six instructions from example 1). > m i n _ u = 2

> m a x _ u = 1 0 > m i n < - 0

(7)

>

r e s u l t s [ , 1 ] <- min_u:max_u > f i n d <- FALSE > f o r (u i n mi n_ u:max_u) > { > e l l <- kmeans ( z , z [ i n i t i a l . C e n t e r s (z , u ) , ] ) > c l 2 <- kmeans ( z , z [ i n i t i a l . C e n t e r s (z , u + 1 ) , ] ) > c l a l l < - c b i n d ( c l l $ c l u s t e r , c l 2 $ c l u s t e r ) > r e s u l t s [ u - m i n _ u + l

,

2 ] <- H <- i n d e x . H ( z , c l a l l ) > i f ( ( r e s u l t s [ u - m i n _ u + l , 2 ] < 1 0 ) &&( ! f i nd ) ) > ( > l k < - u > min<-H > c l o p t < - c l l > find<-TRUE > ) > } > i f ( f i n d ) > { > p r i n t ( p a s t e ( "minimal и f o r H<=10 e q u a l s " , l k , " f o r H =", m i n ) ) > } e l s e > { > p r i n t ( " C l a s s i f i c a t i o n n o t f i n d " ) > } > w r i t e . t a b l e ( r e s u l t s , f i l e = " C : / H _ r e s u l t s . c s v " , s e p = " ; " , d e c = “, " , row.names=TRUE, c o l . names=FALSE) > w r i t e . t a b l e ( c l o p t $ c l u s t e r , f i l e = " C : / c l u s t e r . c s v " , s e p = " ; " , d e c = " , " , row. names=TRUE, c o l .names=FALSE) > w r i t e . t a b l e ( c l o p t $ c e n t e r s , f i l e = " C : / c e n t e r s . c s v " , s e p = ";", d e c

= ", ",

row. names=TRUE, c o l . names=FALSE) > w r i t e . t a b l e ( c l o p t $ w i t h i n s s , f i l e = " C : / w i t h i n s s . c s v " , s e p = " ; " , d e c = ", ", row. names=TRUE, c o l . names=FALSE) > w r i t e . t a b l e ( c l o p t $ s i z e , f i l e = " C : / s i z e . c s v " , s e p = " ; " , d e c = " , " , r ow. names=TRUE, c o l . names=FALSE)

> p l o t ( r e s u l t s , t y p e = " p " , p ch=0, x l a b = " u " , y l a b =" H " , x a x t = " n ")

> a b l i n e ( h = 1 0 , u n t f = FALSE) > a x i s ( 1 , с (mi n_ u:max_u))

(8)

The results o f this procedure are following: > [ 1 ] "minimal и f o r H<=10 e q u a l s 5 f o r H = 5 , 1 0 7 8 4 2 3 6 3 5 5 1 7 6 " s 8 S

Figure 2. Graphical presentation o f Hartigan H index

In example 3, the stability of the clustering in the data was done by replica tion analysis (function r e p l i c a t i o n . Mod from c l u s t e r S i m package).

Example 3. > l i b r a r y ( c l u s t e r S i m ) > x < - r e a d . c s v 2 ( " C : / D a t a _ 7 5 x 2 . c s v " , h e a d e r = T R U E , s t r i p . w h i t e = T R U E , r o w . n a m e s = l ) > x < - a s . m a t r i x ( x ) > x < - a s . d a t a . f r a m e ( x ) > o p t i o n s ( Out Dec = ", ") > w < - r e p l i c a t i o n . M o d ( x , v - " m " , u = 5 , c e n t r o - t y p e s = "c e n t r o i d s ", n o r m a l i z a t i o n = " n l ", m e t h o d - " k m e a n s ", S=10, f i xe dA s am p le = NU LL ) > p r i n t ( w $ c R a n d )

The result o f this procedure is following:

> [ 1 ] 0 , 9 7 9 4 5 9 1

The high level o f agreement between the two partitions reflects the stability o f the clustering in the data.

(9)

IV. SUMM ARY

In articlc, selected packages o f R environment applied in seven major steps of cluster analysis study were presented. The selected functions of packages c l u s t e r S i m , s t a t s , c l u s t e r , and a d e 4 , which are applied to solving clustering problems, were characterized. Additionally, the examples o f the pro cedures for solving different clustering problems are presented which are not available in commercial statistical packages.

RE FE R E N C E S

B reckenridge J.N. (2 0 0 0 ), V a lid a tin g c lu s te r a n a ly s is: c o n s is te n t r e p lic a tio n a n d s y m m e 

try, “M ultivariate B ehavioral R esearch”, 35 (2 ), 2 6 1 -2 8 5

C arm one F.J., Kara A ., M axw ell S. (1 9 9 9 ), H IN o V : a n e w m e th o d to im p ro v e m a rk e t

se g m e n t d e fin itio n b y id e n tify in g n o is y v a ria b le s, “Journal o f M arketing R esearch” ,

N ovem b er, v ol. 3 6 , 5 0 1 -5 0 9 .

M illigan G .W . (1 9 9 6 ), C lu ste rin g v a lid a tio n : r e s u lts a n d im p lic a tio n s f o r a p p lie d a n a ly 

ses, W: P. A rabie, L.J. Hubert, G. de S oete (E d s.), C lu ste rin g a n d c la ssific a tio n ,

W orld S cien tific, Singapore, 3 4 1 -3 7 5 .

R D ev elo p m en t C ore T eam (2 0 0 7 ), R : A la n g u a g e a n d e n v ir o n m e n t f o r s ta tis tic a l c o m 

p u tin g . R Foundation for Statistical C om puting, V ienna, U R L h ttp ://w w w .

R -project.org.

W alesiak M. (2 0 0 6 ), U o g ó ln io n a m ia ra o d le g ło śc i w s ta ty s ty c z n e j a n a liz ie w ie lo w y m ia 

ro w ej. W ydanie drugie rozszerzone. W ydaw nictw o A E w e W rocław iu.

W alesiak M. (2 0 0 7 ), W yb ra n e za g a d n ie n ia k la s y fik a c ji o b ie k tó w z w y k o rzy s ta n ie m p r o 

g r a m u k o m p u te r o w e g o c l u s t e r S i m d la śr o d o w is k a R , Prace N au k ow e A E w e

W rocław iu, nr 1 1 6 9 ,4 6 - 5 6 .

W alesiak M ., D ud ek A. (2 0 0 6 ), S y m u la c y jn a o p ty m a liza c ja w y b o ru p r o c e d u r y k la s y fi

k a c y jn e j d la d a n e g o ty p u d a n y c h - o p r o g r a m o w a n ie k o m p u te r o w e i w y n ik i b a d a ń ,

Prace N au k ow e A E w e W rocław iu nr 1126, 1 2 0 -1 2 9 .

M a r e k W a lesia k

ZA G A D N IEN IA A NALIZY SK U PIEŃ Z W Y K O R Z Y ST A N IE M PR O G R A M U K O M P U T E R O W E G O c l u s t e r S i m I ŚR O D O W ISK A R

W artykule scharakteryzow ano funkcje p o m o cn icze pakietu c l u s t e r S i m oraz w ybrane funkcje p ak ietów s t a t s , c l u s t e r i a d e 4 słu żące zagadnieniu an alizy sku pień. Ponadto zaprezentow ano przykładow e procedury, w yk orzystu jące analizow ane funkcje, ułatw iające potencjalnem u u żytk ow n ik ow i realizację w ielu zagadnień k la sy fi kacyjnych niedostępnych w p odstaw ow ych pakietach statystycznych (np. SP S S , Statistica, SA S).