• Nie Znaleziono Wyników

Cluster analysis with clusterSim computer program and R environment

N/A
N/A
Protected

Academic year: 2021

Share "Cluster analysis with clusterSim computer program and R environment"

Copied!
9
0
0

Pełen tekst

(1)

M a re k Walesiak

CLUSTER ANALYSIS WITH c l u s t e r S i m

COMPUTER PROGRAM AND R ENVIRONMENT

A B S T R A C T . T he article presents auxiliary functions o f c l u s t e r S i m package (see W alesiak & D udek (2 0 0 6 )) and selected functions o f packages s t a t s , c l u s t e r , and a d e 4 , w hich are applied to solvin g clustering problem s. In addition, the exam p les o f the procedures for so lv in g different clustering problem s are presented. T h ese proce­ dures, w h ich are not available in statistical packages (S P S S , Statistica, S A S ), can help so lv in g a broad range o f classification problem s.

K ey words: cluster analysis, R, c l u s t e r S i m , data analysis.

I. INTRODU CTIO N

In a typical cluster analysis study seven major steps are distinguished (see Milligan (1996), 342-343): selection o f objects and variables, decisions concern­ ing variable normalization, selection o f a distance measure, selection o f cluster­ ing method, determining the number o f clusters, cluster validation, describing and profiling clusters. The article presents functions o f c l u s t e r S i m package and selected functions o f packages s t a t s , c l u s t e r , and a d e 4 , which are applied to solving clustering problems.

II. T H E PA CK A G ES AND FUNCTIONS O F R C O M PU T E R PR O G R A M IN A T Y PIC A L C LU STER ANALYSIS PR O C E D U R E

Table 1 contains selected packages and functions o f R program applied on each step o f typical cluster analysis study.

* Professor, Chair o f Department o f Econometrics and Computer Science, Wrocław Univer­ sity o f Economics.

(2)

Table I The packages and functions o f R computer program in a typical cluster analysis study

No. Steps in a typical cluster analysis study

Selected

packages Functions

1 Selection o f objects and variables c l u s t e r S i m H I N o v . M o d 2 Decisions conccming variable

normalization c l u s t e r S i m d a t a . N o r m a l i z a t i o n

3 Selection o f a distance measure

c l u s t e r S i m s t a t s a d e 4 d i s t . BC, d i s t . GDM, d i s t .S M d i s t d i s t . b i n a r y

4 Selection o f clustering method

c l u s t e r s t a t s c l u s t e r S i m a g n e s , d i a n a , p a m k m e a n s , h c l u s t i n i t i a l . C e n t e r s

5 Determining the number

o f clusters c l u s t e r S i m

i n d e x . G l , i n d e x . G 2 , i n ­ d e x . G3, i n d e x . S, i n d e x . K L , i n d e x . H , i n d e x . G a p 6 Cluster validation c l u s t e r S i m r e p l i c a t i o n . M o d 7 Describing and profiling clusters c l u s t e r S i m c l u s t e r .D e s c r i p t i o n

Source: own presentation.

Step 1. Selection o f objects and variables. Carmone, Kara, and Maxwell (1999) proposed the Heuristic Identification o f Noisy Variables (HINoV) method based on /с-means cluster analysis on each variable and corrected Rand index for each resulting pair o f partitions. The HINoV algorithm can identify noisy vari­ ables in a data set and yield better cluster recovery. As a result o f this algorithm, we receive the contribution o f each variable to cluster structure. Package c l u s t e r S i m contains extended version o f HINoV method for nonmetric data: H IN o V .M o d (x , t y p e = " m e t r i c " , s = 2 , u , d i s t a n c e = N U L L ,

m e t h o d = " k m e a n s " , In d e x = " c R A N D " ) where:

x - data matrix;

s - for metric data (1 - ratio; 2 - interval or mixed); u - number o f clusters (for metric data);

d i s t a n c e - NULL for k m e a n s and nonmetric data, for ratio data ( " d l" - Manhattan, "d2" - Euclidean, "d3" - Chebychev (maximum), "d4" - squared Euclidean, "d5" - GDM1, "d6" - Canbena, "d7" - Bray & Cur­ tis), for interval and mixed data ( " d l" , "d2", "d3", "d4", "d5");

(3)

m e th o d - classification method: "km eans" (default) , " s i n g l e " , "com ­ p l e t e " , " a v e r a g e " , " m c q u it t y " , "m e d ia n ", " c e n t r o i d " , "W ard", "pam" (NULL for nonmetric data);

In d e x -"c R A N D " - corrected Rand index, "RAND" - Rand index.

Step 2. Decisions concerning variable nonnalization. Function

d a t a . N o r m a l i z a t i o n ( x , t y p e = " n 0 " ) calculates normalization data using the formula o f variable normalization nO - n i l for data matrix x (nO — without normalization, n l - standardization, n2 - Weber standardization, n3 - unitization, n4 - unitization with zero minimum, n5 - normalization with range [-1; 1], n 6 - n l l - quotient transformations with different base) - details see Walesiak (2006).

Step 3. Selection o f a distance measure. The packages c l u s t e S i m , s t a t s and a d e 4 contain distance measures for metric and nonmetric data (see 1 able 2).

Table 2 Distance measures for metric and nonmetric data

Package Syntax

c l u s t e r S i m d i s t . G D M ( x , m e t h o d = " G D M l ") - function calculates Generalized Distance Measure for variables measured on metric scale (GDM1) or or­ dinal scale ( G D M 2)

d i s t . BC (x ) - function calculates the Bray-Curtis distance measure for ratio data

d i s t . S M ( x ) - function calculates the Sokal-Michener distance measure for nominal variables

s t a t s d i s t f x , m e t h o d = " e u c l i d e a n " , p = 2) x data matrix or " d is t " object

m e t h o d distance measure: " e u c l i d e a n " , "maximum", " m a n h a tta n " , " C a n b e r r a " , " b i n a r y " , " m i n -

k o w s k i ”

p the power for the Minkowski distance

a d e 4 d i s t . b i n a r y ( d f , m e th o d = N U L L )

d f a data frame with positive or zero values. Used with a s , m a t r i x ( l * ( d f >0))

m e t h o d an imeger between 1 and 10 (distance measure d = V l - s ):

1 = Jaccard, 2 = Sokal & Michener, 3 = Sokal & Sneath (1), 4 = Rogers & Tanimoto, 5 = Czekanowski, 6 = Gower & Legendre (1), 7 = Ochiai, 8 = Sokal & Sneath (2), 9 = Phi o f Pearson, 10 = Gower & Legendre (2)

(4)

Step 4. Selection o f clustering method. The most frequently applied cluster­ ing methods are available in packages s t a t s ( h c l u s t - hierarchical ag- glomerative methods; k m e a n s - Л-means method) and c l u s t e r (pam - parti­ tioning around medoids; a g n e s - hierarchical agglomerative methods; d i a n a - hierarchical divisive method). Example syntax for function k m e a n s for clus­ tering data:

k m e a n s ( x , c e n t e r s , i t e r . m a x = 1 0 , n s t a r t = 1 , a l g o ­ r i t h m = с ( " H a r t ig a n - W o n g " , " L l o y d " , " F o r g y " , "M a c - Q u e e n " ) )

where: x - data matrix; c e n t e r s - either the number o f clusters or a set o f ini­ tial cluster centers; i t e r .m ax - the maximum number o f iterations al­ lowed; n s t a r t - if centers is a number, how many random sets should be chosen?; a l g o r i t h m - applied algorithm.

Function i n i t i a l . C e n te r s ( x , k ) o f c l u s t e r S i m package calcu­ lates initial cluster centers for Л-means algorithm (x - data matrix, к - number o f initial cluster centers).

Step 5. Determining the number o f clusters. Package c l u s t e r S i m contains seven cluster quality indices necessary in determination o f the number of clus­ ters in a data set (Calinski & Harabasz, Baker & Hubert, Hubert & Levine, Sil­ houette, Krzanowski & Lai, Hartigan, gap). For example function i n d e x . H ( x , c l a l l ) calculates Hartigan index for data matrix x and two vectors o f in­ tegers с l a l l indicating the cluster to which each object is allocated in partition of n objects into u, and u + 1 clusters (details and others indices see Walesiak (2007)).

Step 6. Cluster validation. In replication analysis (see Breckenridge (2000)) we compare the results o f classification of two random samples obtained from a data set. The level o f agreement between the two partitions (mean corrected Rand index) reflects the stability of the clustering in the data. Package

c l u s t e r S i m contains r e p l i c a t i o n . Mod function:

r e p l i c a t i o n . M o d ( x , v = " m " , u = 2 , c e n t r o - t y p e s = " c e n t r o i d s " ,

n o r m a liz a t io n = N U L L , d is ta n c e = N U L L , m e th o d = " k m e a n s " ,

S =10, fix e d A s a m p le = N U L L )

where: x - data matrix, v - type o f data: metric ( " r " - ratio, " i " - interval, "m" - mixed), nonmetric ("o " - ordinal, "n " - multistate nominal, "b " - b i­ nary), u - number o f clusters, c e n t r o t y p e s - " c e n t r o i d s " , "me­ d o id s " ; n o r m a l i z a t i o n - normalization fonnula n l - n l l (see stage 2); d i s t a n c e - NULL for "km eans", distance measure (see stage 3); m e th o d - classification method (see stage 4); S - number o f

(5)

Simula-tions; f i x e d A s a m p l e - if NULL A sample is generated randomly, oth­ erwise this parameter contains object numbers arbitrarily assigned to A sample.

Step 7. Describing and profiling clusters. Function c l u s t e r . D e s c r i p t i o n ( x , c l ) o f c l u s t e r S i m package calculates descriptive statistics separately for each cluster and variable in classification c l : arithmetic mean and standard deviation, median and median absolute deviation, mode.

III. T IIE EX A M PLE PROCEDURES W IT H SEL EC TED FU NCTIO NS O F R PACKAGES

The 75 observations were generated from standard two-dimensional spheri­ cal normal distribution into five clusters o f size 15 each with means: / / , = [ 0 O f , / / 2 = [ 0 1 0 f , / / 3 = [5 S j , / / 4 = [lO o f , / / 5 = [1 0 1 0 j ,

f l 0 ]

and covariance matrices: Z, = E , = Z, = X. = = . I n addition, three

1 2 3 4 5 Q 1

noisy variables are included in the model to obscure the underlying clustering structure to be recovered. 75 observations for these variables were generated

'5 2 6

with means and covariance matrix: /j = [5 5 l , S j , S = 2 1 - 5

6 - 5 2

Finally, the data were standardized via formula , , n l” . To help isolate noisy vari­ ables H IN o V . Mod procedure was applied (see example 1).

Example 1 > l i b r a r y ( c l u s t e r ) > l i b r a r y ( c l u s t e r S i m ) > x < - r e a d . c s v 2 ( "C: / D a t a _ 7 5 x 5 . c s v " , h e a d e r = T R U E , s t r i p . w h i t e = T R U E , r o w . n a m e s = l ) > x < ~ a s . m a t r i x ( x ) > z < - d a t a . N o r m a l i z a t i o n ( x , t y p e = " n l " ) > z < - a s . d a t a . f r a m e ( z ) > r l < - H I N o V . M o d ( z , t y p e = " m e t r i c " , s = 2 , 5 , m e t h o d = " k m e a n s",I n d e x = “c R A N D " ) > o p t i o n s ( O u t D e c = > p l o t ( r l $ s t o p r i [ , 2 ] , t y p e = " p " , p c h = 0 , x l a b = " N u m b e r o f v a r i a b l e ", y l a b = " t o p r i " , x a x t = " n ")

(6)

> a x i s ( 1 , a t = c ( 1 : m a x ( r l $ s t o p r i [ , 1 ] ) ) ,

l a b e l s = r l $ s t o p r i [ , 1 ] )

The result o f this procedure is shown in Figure 1.

Based on scree diagram (Figure 1) three noisy variables v_3, v_4, and v_5 were eliminated via HINoV method.

In procedure of example 2 the following assumptions is taken into account:

- for clustering o f 75 objects in two-dimensional space (file

d a t a _ 7 5 x 2 , c s v ) the Л-means method was applied,

- the estimated number of clusters is the smallest м е [ 2 ;1 () ] such that

H ( u ) < 1 0 ,

Number of variable

Figure 1. Scree diagram Source: own research.

- w r i t e , t a b l e function allow to save results in files: values o f index

H ( u ) , a vector o f integers indicating the cluster to which each object is allo­

cated („cluster”), a matrix o f cluster centers („centers”), the within-cluster sum o f squares for each cluster(„withinss”), the number o f objects in each cluster („size”).

Example 2 (first six instructions from example 1). > m i n _ u = 2

> m a x _ u = 1 0 > m i n < - 0

(7)

>

r e s u l t s [ , 1 ] <- min_u:max_u > f i n d <- FALSE > f o r (u i n mi n_ u:max_u) > { > e l l <- kmeans ( z , z [ i n i t i a l . C e n t e r s (z , u ) , ] ) > c l 2 <- kmeans ( z , z [ i n i t i a l . C e n t e r s (z , u + 1 ) , ] ) > c l a l l < - c b i n d ( c l l $ c l u s t e r , c l 2 $ c l u s t e r ) > r e s u l t s [ u - m i n _ u + l

,

2 ] <- H <- i n d e x . H ( z , c l a l l ) > i f ( ( r e s u l t s [ u - m i n _ u + l , 2 ] < 1 0 ) &&( ! f i nd ) ) > ( > l k < - u > min<-H > c l o p t < - c l l > find<-TRUE > ) > } > i f ( f i n d ) > { > p r i n t ( p a s t e ( "minimal и f o r H<=10 e q u a l s " , l k , " f o r H =", m i n ) ) > } e l s e > { > p r i n t ( " C l a s s i f i c a t i o n n o t f i n d " ) > } > w r i t e . t a b l e ( r e s u l t s , f i l e = " C : / H _ r e s u l t s . c s v " , s e p = " ; " , d e c = “, " , row.names=TRUE, c o l . names=FALSE) > w r i t e . t a b l e ( c l o p t $ c l u s t e r , f i l e = " C : / c l u s t e r . c s v " , s e p = " ; " , d e c = " , " , row. names=TRUE, c o l .names=FALSE) > w r i t e . t a b l e ( c l o p t $ c e n t e r s , f i l e = " C : / c e n t e r s . c s v " , s e p = ";", d e c

= ", ",

row. names=TRUE, c o l . names=FALSE) > w r i t e . t a b l e ( c l o p t $ w i t h i n s s , f i l e = " C : / w i t h i n s s . c s v " , s e p = " ; " , d e c = ", ", row. names=TRUE, c o l . names=FALSE) > w r i t e . t a b l e ( c l o p t $ s i z e , f i l e = " C : / s i z e . c s v " , s e p = " ; " , d e c = " , " , r ow. names=TRUE, c o l . names=FALSE)

> p l o t ( r e s u l t s , t y p e = " p " , p ch=0, x l a b = " u " , y l a b =" H " , x a x t = " n ")

> a b l i n e ( h = 1 0 , u n t f = FALSE) > a x i s ( 1 , с (mi n_ u:max_u))

(8)

The results o f this procedure are following: > [ 1 ] "minimal и f o r H<=10 e q u a l s 5 f o r H = 5 , 1 0 7 8 4 2 3 6 3 5 5 1 7 6 " s 8 S

Figure 2. Graphical presentation o f Hartigan H index

In example 3, the stability of the clustering in the data was done by replica­ tion analysis (function r e p l i c a t i o n . Mod from c l u s t e r S i m package).

Example 3. > l i b r a r y ( c l u s t e r S i m ) > x < - r e a d . c s v 2 ( " C : / D a t a _ 7 5 x 2 . c s v " , h e a d e r = T R U E , s t r i p . w h i t e = T R U E , r o w . n a m e s = l ) > x < - a s . m a t r i x ( x ) > x < - a s . d a t a . f r a m e ( x ) > o p t i o n s ( Out Dec = ", ") > w < - r e p l i c a t i o n . M o d ( x , v - " m " , u = 5 , c e n t r o - t y p e s = "c e n t r o i d s ", n o r m a l i z a t i o n = " n l ", m e t h o d - " k m e a n s ", S=10, f i xe dA s am p le = NU LL ) > p r i n t ( w $ c R a n d )

The result o f this procedure is following:

> [ 1 ] 0 , 9 7 9 4 5 9 1

The high level o f agreement between the two partitions reflects the stability o f the clustering in the data.

(9)

IV. SUMM ARY

In articlc, selected packages o f R environment applied in seven major steps of cluster analysis study were presented. The selected functions of packages c l u s t e r S i m , s t a t s , c l u s t e r , and a d e 4 , which are applied to solving clustering problems, were characterized. Additionally, the examples o f the pro­ cedures for solving different clustering problems are presented which are not available in commercial statistical packages.

RE FE R E N C E S

B reckenridge J.N. (2 0 0 0 ), V a lid a tin g c lu s te r a n a ly s is: c o n s is te n t r e p lic a tio n a n d s y m m e ­

try, “M ultivariate B ehavioral R esearch”, 35 (2 ), 2 6 1 -2 8 5

C arm one F.J., Kara A ., M axw ell S. (1 9 9 9 ), H IN o V : a n e w m e th o d to im p ro v e m a rk e t

se g m e n t d e fin itio n b y id e n tify in g n o is y v a ria b le s, “Journal o f M arketing R esearch” ,

N ovem b er, v ol. 3 6 , 5 0 1 -5 0 9 .

M illigan G .W . (1 9 9 6 ), C lu ste rin g v a lid a tio n : r e s u lts a n d im p lic a tio n s f o r a p p lie d a n a ly ­

ses, W: P. A rabie, L.J. Hubert, G. de S oete (E d s.), C lu ste rin g a n d c la ssific a tio n ,

W orld S cien tific, Singapore, 3 4 1 -3 7 5 .

R D ev elo p m en t C ore T eam (2 0 0 7 ), R : A la n g u a g e a n d e n v ir o n m e n t f o r s ta tis tic a l c o m ­

p u tin g . R Foundation for Statistical C om puting, V ienna, U R L h ttp ://w w w .

R -project.org.

W alesiak M. (2 0 0 6 ), U o g ó ln io n a m ia ra o d le g ło śc i w s ta ty s ty c z n e j a n a liz ie w ie lo w y m ia ­

ro w ej. W ydanie drugie rozszerzone. W ydaw nictw o A E w e W rocław iu.

W alesiak M. (2 0 0 7 ), W yb ra n e za g a d n ie n ia k la s y fik a c ji o b ie k tó w z w y k o rzy s ta n ie m p r o ­

g r a m u k o m p u te r o w e g o c l u s t e r S i m d la śr o d o w is k a R , Prace N au k ow e A E w e

W rocław iu, nr 1 1 6 9 ,4 6 - 5 6 .

W alesiak M ., D ud ek A. (2 0 0 6 ), S y m u la c y jn a o p ty m a liza c ja w y b o ru p r o c e d u r y k la s y fi­

k a c y jn e j d la d a n e g o ty p u d a n y c h - o p r o g r a m o w a n ie k o m p u te r o w e i w y n ik i b a d a ń ,

Prace N au k ow e A E w e W rocław iu nr 1126, 1 2 0 -1 2 9 .

M a r e k W a lesia k

ZA G A D N IEN IA A NALIZY SK U PIEŃ Z W Y K O R Z Y ST A N IE M PR O G R A M U K O M P U T E R O W E G O c l u s t e r S i m I ŚR O D O W ISK A R

W artykule scharakteryzow ano funkcje p o m o cn icze pakietu c l u s t e r S i m oraz w ybrane funkcje p ak ietów s t a t s , c l u s t e r i a d e 4 słu żące zagadnieniu an alizy sku­ pień. Ponadto zaprezentow ano przykładow e procedury, w yk orzystu jące analizow ane funkcje, ułatw iające potencjalnem u u żytk ow n ik ow i realizację w ielu zagadnień k la sy fi­ kacyjnych niedostępnych w p odstaw ow ych pakietach statystycznych (np. SP S S , Statistica, SA S).

Cytaty

Powiązane dokumenty

Figure 1 shows the relationship between the values of adjusted Rand in- dices averaged over fifty replications and models 2-10 with the HINoV-selected variables (¯b) and values

• Open source software and dataset of 179 Python projects (including annotated pull requests) [ 20 ], which can be used to construct a developer network, and estimate and validate

ANNA JAB O SKA: The image of parochial education in the end of the 17th century at Gniezno archdeaconry in the light of the visitation of Stanis aw Lipski .... ANNA

• Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less

Dział ten wydaje się najważniejszy i najciekawszy, zarówno z punktu widze­ nia praktyki Kongregacji Soboru, jak również jeśli chodzi o informacje dotyczące spraw prowadzonym

Trzeba m i uwolnić się od państw a i od szkieletu ludzkiego. Szabla została sprzedana. niem iecki lu tn ik m istrz Klotz. Podniecała m nie niekłam ana namiętność

Within an ERANET-ERACOBUILD project, this study investigates the opportunities and barriers to establish a “one stop shop” with an integrated supply side, to counteract the

voices speak volumes’ and that as an audience member ‘you need to participate with all your senses.’ The two reviews which end this journal are similarly attuned to the