• Nie Znaleziono Wyników

clusterSim COMPUTE"R

N/A
N/A
Protected

Academic year: 2022

Share "clusterSim COMPUTE"R"

Copied!
9
0
0

Pełen tekst

(1)

AC1-'A UNIVERSITATIS LODZIENSIS FOLIA OECONOMICA 216, 2008

Marek fValesiak*

CLUSTER ANALYSIS WITH clusterSim COMPUTE"R PROGRAM AND R ENVIRONMEN"T

ABS'fRACT. The article presents auxiliary fW1ctions of c l liS t er Sim package (see Walesiak & Dudek (2006») and selected functions ofpackages stats, cluster, and ade4, which are applied lo solving clustering problems. In addition) the examples af the procedures for solving different clustering problems are presenled. These proce­

dures, which are not available in statistical packages (SPSS, Statistica, SAS), can help soJving a broad range

or

classification problems.

Key \vords: cluster analysis, R, clusterSim, data analysis.

f. INTRODUC'fION

In a typical cluster analysis study seven lnajor steps are distinguishcd (see Milligan (1996)) 342-343): selection of objectsand variabies, decisions concem­

ing variable nornlalization, selection

ar

a distance measure, selection

or

cluster­

ing n1ethod, determining the number

ar

clusters,

cluster validatian,

describing and profiling clusters. rfhe article presents functions af clusterSim package and selected functions

af

packages

stats, cluster,

and ade4, which are applied to solving c]ustering problems.

II. TIIE Pi\CKAGES AND FUNC1"IONS OF R COM,PUTER PROGRAM IN A

l"YPICAL

CLUSTER

ANALYSIS PROCEDURE

Tablc l contains selected packages and functions of R prograIl1 applied on each step oftypical cluster analysis study.

• Profcssor. Chair of Ocpal11nent of EconOlllctrics and Computcr Science. Wrocław Univcr­

sity nr Econolllics.

[303]

(2)

304 fv1arck WaJcsiak

T:'lhlc I Tht: p4łckagc~ ano fllrlClil.1IlS

or

R l,;olnputcr pn)grmn in a [ypical cluster analysis sludy No St(.~ps in a typieal

~Iustcr allalysi~ sludy

Sclcercd

packages Functiolls

I Sclcction of obic<.:ts <Jnd vnriahlcs clust.erSim HINoV. ~1od

:! Dccisions concl:ming \~Iri~hlc

nonnali7A1tion clusterSim da ta . NO.rmali za t i on

"\

-' Sclectioll of a JislnłlCC Incasure

clusterSim

stats ade4

dis t. Be, dis t. GDM, dist. S1"l

disL

dist.binary 4 SCICCIIOfl

nr

clustcring Inclhou

cluster stats clust:erSim

agnes, diana, panl kmeans.hclust initial.Centers

5 Dcl~nllilling (he tlulnbcr

uf t:luslcrs clusterSim

index.Gl,index.G2,in­

dex. G3, index. St index. KL,

index.H1 index.Gap

6 Clustcr validation clusterSim replication.Mod

7 Dcscribin~ and protiling clustcrs clusterSim cluster.Description SnUTce: O\\.'ll prc~cnłalinJl.

Step 1. Selection

or

objccts and variabIes. Canl10nct Kara, and Max\vell (1999) proposed the J-Ieuristic Identi fication of Noisy Variables (HINo V) lnethod bascd on k-nleans cluster analysis on each variable and corrected Rand index for cach resulting pair

ar

partilions. The HINo V algoritlun can idcntify noisy vari­

ablcs in a dala set and yicId better clusler rccovery. As a resulL or this algorithln~

''''e rcceivc the contribution af each variable to clusler structure. Package clusterSilTI contains extended version

or

HINoJ/nlethod for nonnlctric data:

HINoV.MOd(x, ty·pe= U1netric " , 5=2, u, distance=NULLJ 1

method= 11 kmeans Ił, Index= łl c RANI) " )

,vherc:

x ~ data lllalrix:

s -- for 111etric data (l - ratio: 2 - interval or ll1ixed);.

u - nU111bcr

or

clustcrs (for ll1etric data);

distance - NUL,lJ for kmeans and nON11etric data, for ratio data C'dl" ­ tvtanhattan, "d211 - Euclidean, "d3 ~ Chebychev (max iInunl) , "d4tI ­

squared Euclidcan, "d511 - GDM1, "d6" --Canberra, ud7" - Bray & Cur­

lis), for intcrval and ll1ixcd data (UdlII, lId2H , ud3", "d4"t IIdS");

(3)

305 Clustcr nnalysis \yith c l us terS im eotnputef progran1 ...

method - classification lnethod: "kmeans" (dcfault) ~ flsingle , "COITI­

plete'" "averagełt ,

Hmcqui t

ty", "median", "centroid",

"Ward", "parn" (NULL for nonnletric data);

Index--"cRAND" - correctcd l{and index, "RAND" - Rand index.

Step 2. l)ecisions concernlng variablc nonllalization. Function data .Norrnalization (x, type= linO ") calculates non'llalization data using the fonnula of variable nonnalization nO - nIl for data luatrix x (nO ­

\vithout normalization, nl - standardizatioI1, n2 - Weber standardization, n3 -­

unitization, n4 - unitization with zero mininluln, n5 - nornlalization \vith range [-1; l], n6-nl1 - quotient transfom13tions \vith different basc) - details see Walesiak (2006).

Step 3. Selection afa distance mcasurc. The packages clusteSim, stats and ade4 contain distancc ll1easures for luetric and nonnletric data (see Table 2).

Tablc 2 Dislance lucasurcs for Illctric and nonll1ctric data

Packagc Syntax

clusterSim cli s t .GDM ( x, method= lO GDMl .. ) - function calculate.s Gcncralizcd Distancc Measurc. for variablcs lncasurcd on Inctric scale (GOM1) or or­

dinal scalc (GDH2)

di s t . Be (x) - function calcu latcs the Bray-Curtis distnncc 1l1CaSUrc for rDtio data

dis t . SM (x) - funclion calcu latcs the Sokal-M ichcncr distnncc IJlcasurc

for nOlninal variablcs

stats dist(x, rnethod="euclidean", p

=

2) x dala Inatrix or .. dis t objcct

method distancc Ineusurc: "eue lidean", Ol maximum".

"manhattan", "canberra", "binary", "n\l.n­

kowski"

p the powcr for lhc Millkowski dislance ade4

method

dist.binary(df, method

=

NULL)

df a data frarne \vith positive or zero vnlucs. Uscd \vith as.matrix(l* (df>O)

an intcger bct\vecn l and \0 (distancc Illc.asurc d = ~ ):

l = Jaccard. 2 = Sokal & M ichcner. 3 = Sokal & Sncath (').

4 == Rogcrs & Taniuloto, 5 == CzckanovJski, 6 = Go\ver L~

Lcgcndre (l), 7 == Ochiai, 8 = Soknł & Sncath (2), 9 = Phi of Pcnrson, la = GO\\lcr & Legcndrc (2)

Source: own prcscn(ati()n.

(4)

306 rv1arck Walcsiuk

Step 4. Selection ol' clustering method. 1"'he 1110St frequently appJied cluster­

ing 11lCthods are available in packages stats (helust - hierarchicaJ ag­

glołllerative Inethods;

k.means -'

k-n1ealls lnethod) and

cI

liS

ter

(pam - parti­

tioning around medoids; agnes - hierarchical aggloluerative methods~ diana '- hicrarchicaJ divisive Inethod). Exronple syntax for function

Janeans

for clus­

tering data:

~leans(x,

centers, iter.max = la,

nstart

= l,

algo­

ll ll

rithnl = c ("I-Iartigan-Wong , IILloyd I t1Forgy", IIMac­

Queen ) )

\vhcrc: x - data nlatrix; cen ters ~ either the nUJllber af cJusters ar a set of ini­

tial cluster ccnters; i ter. max - the nlaximuJ11 nunlber of iterations a1­

lo\·ved~ rlS tart - if centers is a number, haw Illany random sets should be chosen?; algori thrn - applied algoritlull.

Funclion ini tial . Centers (x, k) of clusterSim package calcu­

lates inilia} cluster centcn~ f<)r k-nlcans algoritłun (x - data Inatrix, k -- l1unlber ol' initial cluster centers).

Step 5. Detem1ining thc nUlnber of clusters. Package clusterSim contains sevcn cluster quality indices necessary

in

delenl1ination of the number of clus­

tcrs in a data set (Calinski L~ I-Iarabasz, Bakcr & liubert, I-Iubert & Levine, Sil­

houcttc, Krzanowski & Lai, l-Iartigan, gap). For example function index. H ( x , c l a 11) calc.uJates I-Iartigan index for data n1atrix x and two vectors 01' in­

tcgcrs c lalI indicating the cluster to which each object is allocated in partition

01'

Il objccts into u) and u

+

1 cJusters (details and others indices see Walesiak (2007».

Step 6. CJuster validation. In replication analysis (sec Breckenridge (2000)) wc cOlnparc the resulLs af classification af t\VO random salnples obtained fron1 a data set. Thc level

or

agreenlent belwecn the two partitions (nlcan con-ecled H.and index) rctlects the stability of the clustering in the data. Package clus terSim conlains repI ication. Mad function:

replication.Mod(x,

v=lIm" ,

u=2,

centro­

types=" cen troids",

normalization=NULL, distance=NULL, method= "kmeans II I

8=10, fixedAsample=NULL)

\vherc: x - data 111atrix, v - typc af data: nletric (tirli - ratio, fi i 'I - interval, Hm'l - n1ixed), nonnlctI1c (" 0 1l - ordinal, "nu - nlullistate nOlninal, lib" -- bi­

nary), u -- nllInber

or

clustcrs, centrotypes - "centroidsIl, "me­

doids"; normalization - nonl1alization fonnula nl-nl1 (see stage 2); distaI1ce - NULL for "kroeans'\ distance measure (see stage 3); me thod. - classification tllcthod (see stage 4); S - nu·tnber

or

sirnula­

(5)

307 Clustcr analysi.'i \vith c l us terSim cOlnputcr progranl..

tions; f ixedAsample - if NULL A sanlple is generated randoJ11ly, olh­

erwise this paran1eter contains object nU111bers arbitrarily assigned to li salnple.

Step 7. Describing and protiling clusters. Function

cluster .

Description (x, c l ) af c l liS terS im package calculates descriptive statistics separately for each cluster and variable in classification

cI:

aritlunetic nlean and standard dcviation, median

and

median absolute deviation, n10de.

In. rfHE EXAMPLE PROCEDURES WI]'I-I SELECTED FUNCTIONS OF R PACKAGES

-rhe 75 observations were generated fr0111 standard t\vo-dill1cnsional sphcri­

cal nonnal distribution into five clusters of size 15 each \vith lneans:

fll

= (O Or,

f.J2 =

[O lOy,

1-'3 =

[5

S]T, 1-'.. =

[10

O]T, fis =

[la lOJ" ,

and covariance matrices: LI

= L

2

=

L J

= L

4

= L

5

= [~ ~].

In addition, three noisy variables are included in thc IllOdcl lo obscure thc underlying clustering structure to be rccovcred. 75 observatiollS for thcse variabies wcre generated

5 2 6

\vith ll1eans and covariancc ll1atrix: I-J = [5 5 7,5

JT,

L == 2 l - 5

6

-5

2

Flllally, the data were standardized via fon11ula "nl". To help isolate noisy vari­

ables HINoV . Mad procedure was applied (sce examplc l).

Exanlple

1

> library(c]uster}

> libra~(clusterSim)

> x<-read. csv2 ( liC: /Da ta_ 75x5. csv" I

header=TRUE, strip. whi te=2'RUE, row. nalnes=1.)

> x<-as.matrix(x)

> z<-data.Normalization (x, type="nl")

> z<-as.data.frame(z)

> rl<-HINoV.Mod(z, type=umetric", 5=2, 5,

method= IIkmeans", Index= "cRAND" )

> options (OutDec = Ił,")

> plot (rl$stopri [, 2] , type="p", pch=O, xlab= "Number ot variable", ylab=" topri ", xaxt="n")

(6)

308 tvlarck Walesiak

> axis(l,at=c(l:max(rl$stopri[,l])),

labels=rl$stopri[,l])

'rhe resuIt of this procedure is shown in Figure l.

Based on serce diagram (Fi!,rurc 1) three noisy variabIes v_3, v_4, and v_5 were climinatcd via HINoV nlcthod.

In proccdure af exan1ple 2 the follo\ving assw11plions is laken into accounl:

- for clustering af 75 objects in two-dinlensionaI space (f11e da ta_'7 Sx2 _esy) thc k-n1eans 1l1ethod \vas applied,

- the cstimated olullber of clusters is the s111allest u E [2~

10]

such that

H(u)ś:lO,

° 0

o ci

\ll o .

~ ci

8 ci

o~

0,,­

---o

NUT1bct

o,

varlabSC

Figure 1. Serce diugnun Sourcc: own rcscnrch.

write. table rUBelion allow lo save results in files: values of index II(Li), a vcctor ol' intcgcrs indicating the cluster to \vhich each object is alJo­

catcd ("cluster"), a nlatrix

ar

cluster centers ("centers"). the within-cluster surn

or

squares for each cluster("vJithinss"), thc number of obJects in each cluster

(Hsize~').

Exan1ple 2 (fin~t six instructions f1'0l11 exan1plc l).

> min_u=2

> max_u==10

> m.in <- O

> I'esults <- ar.cay(O,c(fllax_u-nlin_u+l, 2))

(7)

Clustcr nnnlysis \vllh cIus terSim COlllputcr prognUll..

> resu.lts[,lj <- m.in_u:max_u

> find <- FALSE

> for (u in min_u:max_u)

> (

> c.Il <- kmealJs(z, Z[ilJitial.Centers(z, uj,))

> c12 <- kmeans.(z, z[initial.Centers(z, u+l), J)

> clal1<- cbind(cll$cluster,c12$cluster)

> resuJ ts {u-m.in_u+l, 2] <- H <- i11dex. /-i (z, clal.I)

> i f ((results{u-min_u+l,2]<10)

.> (

> lk<-u

> fllin<-H

> clopt<-c.Il

> find<-TRUE

> )

> }

> J~f ([ind)

> {

&&(!find))

> print (paste ("min.ima.l u f 01.- If<=10 eCIuals", .lk, II for l-J

='1, min) )

> Jelse

> (

> print ( "Classi fica tion not [ind")

> )

> write.table(results, file="C:/H_results.csv lJ,

sep="; ", dec=",", l-OW. names=TRUE, col. names=FALSE)

> wri te. tab]

e (

clopt$cl

uster,

file="C: lel uster. csv / I ,

sep= I'; ", dec=" I " , row. names=TRUE, col. naJnes=FALSE)

> wri te. tabIe (el opt$centers, f i le="e: lee!] ter·s. csv",

sep=" i ", dec="," I rawo nan1es='J'1:<'UE, col. nan1es=FALSE)

> ~~/ri te. table (clopt$wi thiIJSS, file= liC: lwi thJ~nss. CSV", sep="; ", dec=", Ił, row.names=TRUE, cO].. nan1es=F"'AIJSE')

> write. table(clopt$size, file="C:/s.ize.csv", sep="; ", dec=", ", rOt-\'. na.mes=TRUE, col. names=F'l'ilJSE)

> pJot(.results, type="p", pch=O, x.lab="u", ylab= "}{", xaxt="n")

> abline(h=10, untf

=

F~LSE)

> axis(l,c(min_u:max_u»)

(8)

310 Marek Walesił,k

'The results of this proccdurc are following:

> [11 "minimal u for H<=10

equals

5

for

H = 5,10784236355176"

o

II')

r.

o u O" u

10

Figurc 2. Graphicnt prcscntalion of Hartigun H indcx

[n exanlple 3, the stability of the clustering in the data was done by replica­

tian analysis (function repl icat ion . Mad frorll c l us terSim package).

Exanlple 3.

> library(clusterSim)

> x<-read. csv2 (

"e:

/Da ta_75x2. csv",

header=TRUE,strip.white=TRUE,row.names=l)

>

x

<-as.matrix(x)

> X <-as.data.frame(x)

> options (QutDec

= ",")

> w<-replication.Mod(x,v="m",u=5,

centro­

types= Iłcentroids"I IJormalization= "nl ", method="kmeans",S=10, fixedAsample==NULL)

> print (w$cRand)

'fhe rcsult

or

this procedure is follo\ving:

> [11 0,9794591

-rhe high level af agreernent between t11e two partitions reflects the stability of [he cluslering in the data.

(9)

CłUSlcr analysis \\!ith clusterSim COlnputcr prognllll. ,. 311

IV. SlJ1\'IIVll\RY

In articlc, sclccted packages

or

R cnvironrncnt applied in seVCll l11ajor slcps af cluster analysis study were presented. ~rhe selectcd functions of packages clusterSim, stats, cluster, and ade4, which are applied lo solving clustering problclTIs, were characteńzed.Additionally,thc cxalnplcs ol' the pro­

cedures tor solving diffcrent clustering problems are presentcd which arc not available in conunercial statistical packages.

REFERENCES

I3rcckcnridgc J.N. (2000)9 Valitiallllg clusler ana(vsis: cons/slenI rep/l(.'a/loll and ."}'nlJllC­

II}', HMuhivariate Behavioral Rcsearch", 35 (2), 261-285

('annone F.J., Kara A., Maxwell S. (1999), IIINoV: a nen' file/holi lo 1I11pro\'e IJlarke!

seglnellf definitiol1 by idelll{/j'illg noisy variabIes, uJounlal oC Marketing H.csearch".

Novenlber, vol. 36, 501-509.

M illigan G. W. (1996), Chlslcring validalioll: resu/ts and i111phcaliolls .for l1ppl{(~d aJ1(I~V­

ses, W: P. Arabie, L.J. I-Iubert, G. de Soete (Eds.), CJuslering and classijicatlołl.

\Vorld SClenlilic, Singapore, 341-375.

R Development C~ore TCalTI (2007), Ił: ,/1 lallguage and environnlcnl.fór s/lllislica! C0111­

puting. I-t Foundation for Slatistical Cornputing. \'ienna , URL http://\v\V\v.

R-project.org.

\Valesiak M. (2006), Uogó/rllOJlll nlin,." odleg/o,'=Cl u' s/(lIysryczne.; analizie \Vie!oH:vnllll­

rowej. Wydanie drugie rozszerzone. Wyda\vnict\vo AE we Wroc·łavvlu.

Walesiak M. (2007), JVybrane zagadnienia klasy./ikacji obiektó\v z \1~VkO,.zy.sl(lnleI11 pro­

graniu k017lpUlerołvego ej usterSim d/a !h'odol-viska R, Prace Nauko\ve AE \Vc

\Vrocławiu, nr 1169,46-56.

Walesiak M., Dudek A. (2006), SYI1JUfacyjlla op/yn1{lliza(~ia H'yborll p rocetlulJ' kla.\y/i­

kllcyjllej d/a danego ~)}PU danych - oprogranzo\1/anie konlpulerOlt'e l u:vniki badali.

Prace Nauk()\ve AE we Wrocła\vlu nr 1126, 120-129.

A111rek lValcsulk

ZACA.DNIENIA ANALIZY SKUPlli:Ń Z \VYKORZYS'I'ANIEIVI PlłOGlłAi\1lJ KOMPUTEROWEGO clusterSim I ŚRODO\VISKA R

W artykule scharaklcryzO\vano funkcje pOnlOCJlICZe pakietu c l us t erSirn oraz wybrane funkcje pakictó\\' stats, cluster i ade4 służące zagadnieniu analtzy sku­

pien. Ponadto zaprezentowano przykładowe procedury, wykorzystujące analizo\vane funkcje, ułatwiające potencjalnenlu utytko\vnikowi realizację wielu zagadnień klasyfi­

kacyjnych niedostępnych w podsta\vowych pakietach statystycznych (np. SPSS~ Statistica~

SAS).

Cytaty

Powiązane dokumenty

Być może dopiero później — o czym wprost już nie pisała — gdy zamknęła się za murami szkoły w Kórniku, zyskała poczucie, że dzieje znów toczą się tak, jak

The basic result asserts the relation between compact convex subsets of a locally convex Hausdorff space and their extreme points.. Theorem 3 (Krein-Milman, see [3, 9,

lem of the boundary correspondence under quasiconformal (abbr.: qc.) mappings of Jordan domains in the extended plane C.. In other words,

formly convex and uniformly starlike, and some related classes of univalent functions. We also introduce a class of functions ST«) which is given by the property that the image of

In [13] Rogers gave another corrected version of Hansell’s statement, namely that if every continuous function from a closed subset of X to Y can be extended continuously to X, and

zakładając, że misją szkoły ma stać się wzbogacanie zdolności jednostki do stawania się podmiotem własnego działania (dylak 2009), osobisty rozwój i eks- presja własnego

1) w obszarze kognitywnym (wiedza o) – przez poszerzanie wiedzy o za- gadnienia odnoszące się do aspektów uniwersalnych (np. modele kul- tury, wymiary, pojęcie szoku kulturowego)

- pozwala na generowanie danych metrycznych (ilorazowych i przedziałowych), porządkowych oraz symbolicznych przedziałowych dla danej liczby wymiarów (zmiennych) - np..