• Nie Znaleziono Wyników

The Choice of Variable Normalization Method in Cluster Analysis

N/A
N/A
Protected

Academic year: 2021

Share "The Choice of Variable Normalization Method in Cluster Analysis"

Copied!
16
0
0

Pełen tekst

(1)

The Choice of Variable Normalization Method in Cluster Analysis

Andrzej DUDEK

Wroclaw University of Economics and Business, Wroclaw, Poland ORCID 0000-0002-4943-8703

andrzej.dudek@ue.wroc.pl Marek WALESIAK

Wroclaw University of Economics and Business, Wroclaw, Poland ORCID 0000-0003-0922-2323

marek.walesiak@ue.wroc.pl

Abstract

One of the stages in cluster analysis, carried out on the basis of metric data (interval, ratio), is the choice of variable normalization method. This paper presents the proposal of two procedures (for clustering algorithms based on distance matrix and data matrix), which allows for the isolation of the groups of normalization methods that lead to similar clustering results. The proposal can reduce the problem of choosing the normalization method in cluster analysis. The results are illustrated via simulation study and empirical example with application of clusterSim package and R program.

Keywords: Normalization of Variables, Cluster Analysis, Real Estate Market, Clustersim Introduction

The normalization of variables in the statistical multivariate analysis is carried out when the variables describing the analyzed objects are measured on metric scales (interval or ratio). The characteristics of measurement scales were discussed e.g. in the studies by (Stevens (1946); Walesiak (2011), pp.

13-16). The purpose of normalization is to achieve the comparability of variables.

The comparison of normalization methods can be analyzed from the perspective of a particular statis- tical method application in a multivariate analysis. In a situation where methods of cluster analysis, multidimensional scaling, linear ordering based on aggregate measures are used in empirical studies, metric variables must be brought to comparability through normalization transformations. Other methods of statistical multivariate analysis do not require prior normalization transformation.

For the purposes of cluster analysis Milligan and Cooper (1988) conducted simulation studies dis- cussing the influence of variable normalization method choice on exploring class structure (6 normal- ization methods from Table 1 were used: n1, n4, n6, n7, n8, n10). Similar studies, based on real data sets, were carried out by Schaffer and Green (1996).

The presented article follows a different approach. The study covers 18 variable normalization meth- ods available in the data.Normalization function of clusterSim package (Walesiak and Dudek (2019)). The overview of variable normalization methods is presented in the study by Walesiak (2014). Two research procedures were suggested, which allows for the isolation of the groups of normalization methods that lead to similar clustering results. The presented proposals can reduce the problem of choosing the normalization method in cluster analysis.

(2)

Cluster Analysis For Metric Data – General Procedure

The general procedure used in cluster analysis of the set of objects is as follows (see e.g. Milligan (1996), pp. 342-343; Walesiak (2004)):

a) based on metric data and distance matrix:

→ → → → → → → → → , (1)

where: – choice of research problem in cluster analysis, – selection of objects to cluster,

– selection of variables,

– collecting data and the construction of data matrix ( – value for j-th variable for i- th object),

– choice of variable normalization method ( – normalized value for j-th variable for i-th object),

– selection of distance measure and the construction of distance matrix , – selection of clustering method,

– determining the number of clusters, – validation of clustering results, – interpretation and profiling of clusters.

b) based on metric data and normalized data matrix:

→ → → → → → → → , (2)

Among the basic cluster analysis methods, based on distance matrix, the following are included: ag- glomerative and deglomerative hierarchical methods, k-medoids method (pam), spectral clustering method and k-means method for the data matrix based methods. The characteristics of the above- mentioned methods is provided e.g. in the studies by Anderberg (1973), Kaufman and Rousseeuw (1990), Gordon (1999), Ng, Jordan and Weiss (2002), Everitt et al. (2011). These methods are avail- able in the following packages: cluster (Maechler et al. (2019)) – functions: agnes, diana and pam; stats (R Core Team (2019)) – functions kmeans and hclust; clusterSim (Walesiak and Dudek (2019)) – function speccl; kernlab (Karatzoglou, Smola and Hornik (2019)) – function specc.

Normalization of variables

The choice of variable normalization method remains one of the cluster analysis stages carried out based on metric data. The purpose of variable normalization is their comparability by means of de- priving the measurement results of values and unifying their orders of magnitude. The first aim of normalization is unequivocal. It constitutes the sine qua non condition of normalization. The second aim is not unequivocal, hence it allows various solutions in this matter. The unification of magnitude orders is obtained e.g. by the unification of all variable values in terms of variability measured by standard deviation (median absolute deviation for positional measures) or by means of ensuring range stability for the normalized values of variables. In general terms, the unification of magnitude orders is obtained by introducing the uniformly specified zero value for all variables ( parameter in formu- la (3)), to be followed by rescaling variable values ( in formula (3)).

Table 1 presents the normalization methods of linear transformation (see e.g. Jajuga and Walesiak (2000), pp. 106-107; Zelias (2002), p. 792):

= + = ( >0), (3)

(3)

where: – value for j-th variable for i-th object, – normalized value for j-th variable for i-th ob- ject, – shift parameter to arbitrary zero for j-th variable, – scale parameter for j-th vari- able, = ! ⁄ , = 1⁄ – parameters for j-th variable presented in Table 1.

Column 1 in Table 1 presents the type of normalization method adopted as the function da- ta.Normalization of clusterSim package (Walesiak and Dudek (2019)):

data.Normalization(x,type="n0",normalization="column") where: x – vector, matrix or dataset,

type – type of normalization (see column 1 in Table 1),

"n0" – without normalization, "column" – normalization by variable, "row" – normaliza- tion by object.

Table 1: Normalization methods

Type Method of normalization

Parameter Measurement scale

$% &% before

normalization

after normalization n1 Standardization 1 '⁄ ! ̅ '⁄ ratio or inter-

val interval

n2 Positional

standardization 1 )⁄ !)* ⁄) ratio or inter-

val interval

n3 Unitization 1 +⁄ ! ̅ +⁄ ratio or inter-

val interval

n3a Positional

unitization 1 +⁄ !)* ⁄ + ratio or inter-

val interval

n4 Unitization with

zero minimum 1 +⁄ min/ 0 +1 ratio or inter-

val interval

n5 Normalization in range !1; 1

1 max5 ! ̅ 5

! ̅

max5 ! ̅ 5 ratio or inter-

val interval

n5a

Positional normal- ization in range

!1; 1

1 max5 ! )* 5

!)*

max5 ! )* 5 ratio or inter-

val interval

n6

Quotient transformations

1 '⁄ 0 ratio ratio

n6a 1 )⁄ 0 ratio ratio

n7 1 +⁄ 0 ratio ratio

n8 1 max/ 01 0 ratio ratio

n9 1 ̅⁄ 0 ratio ratio

n9a 1 )*⁄ 0 ratio ratio

n10 1 67

1 89 0 ratio ratio

n11 1 :67 ;

< 89 0 ratio ratio

n12 Normalization

1

=∑ ? ! ̅ @789 ;

! ̅

=∑ ? ! ̅ @789 ;

ratio or inter-

val interval

n12a Positional normalization

1

=∑ ? ! )* @789 ;

!)*

=∑ ? ! )* @789 ;

ratio or inter-

val interval

(4)

n13

Normalization with zero being the central point

1

+ 2⁄ ! )

+ 2⁄ ratio or inter-

val interval

̅ – mean for j-th variable, ' – standard deviation for j-th variable, + – range for j-th variable, ) = Bmax/ 0 + min/ 0C 2⁄ – mid-range for j-th variable, )* = )* ? @ – median for j-th varia- ble, ) = ) ? @ – median absolute deviation for j-th variable.

Source: authors’ compilation based on the studies by: Anderberg (1973); Borys (1978); Grabinski (1992), pp.

35-38; Jajuga (1981); Jajuga and Walesiak (2000); Milligan and Cooper (1988); Mlodak (2006);

Rybaczuk (2002), p. 147; Walesiak (2002), p. 19; Walesiak (2014), pp. 364-365; Walesiak (2018).

Table 1 presents the normalization formulas according to variables. Analogical formulas can be pre- sented for normalization by objects. Normalization by objects is founded when all variables are ex- pressed in the same unit of measurement. It is true in case of e.g. structural studies. Further discussion will refer to the normalization by variables.

All the discussed normalization methods, being linear transformations of each variable (separately), retain the skewness and kurtosis of the distribution of variables. In addition, for each pair of variables all normalization methods retain the value of Pearson product-moment correlation coefficient (see Ja- juga and Walesiak (2000), p. 111).

Research procedure allowing the isolation of the groups of normalization meth- ods leading to similar clustering results of the set of objects

In the cluster analysis, carried out on the basis of distance matrix, the research procedure which al- lows isolating the groups of variable normalization methods that lead to similar clustering results of the set of objects covers the following steps:

1. The following stages are carried out in accordance with the general cluster analysis procedure

→ → → → . All acceptable methods presented in Table 1 are used in the normaliza- tion of variables (18 normalization methods are possible for ratio variables, whereas for interval variables – 10 normalization methods).

Table 2: Distance measures for metric (interval, ratio) data

Name Distance DEF Range Allowed

normalization

Package (function)

Minkowski G H 1 :6 5 !J 5I

89

K 0; ∞ n1-n13 stats

(dist)

Manhattan G = 1 6 5 !J 5

89 0; ∞ n1-n13 stats

(dist)

Euclidean G = 2 :6 ? !J @;

89 0; ∞ n1-n13 stats

(dist) Chebyshev (maxi-

mum) G → ∞ max5 ! 5 0; ∞ n1-n13 stats

(dist) GDM1

Walesiak (2002);

Jajuga, Walesiak and Bak (2003)

1 2 !

∑ + ∑ ∑7 N N

NO ,N89 J89 J89

2 ∑ ∑J89 7N89 ;N ∙ ∑ ∑J89 7N89 ;N R,S

I = ! I for G = T, U

V = ! V for + = W, U

0; 1 n1-n13 clusterSim (dist.GDM)

(5)

Bray-Curtis Bray and Curtis (1957); Cormack (1971), p. 367

∑ 5 !J89 5

∑ 5 +J89 5 0; 1 n6-n11 clusterSim

(dist.BC) Canberra

Lance and Wil- liams (1966)

6 5 ! 5

5 + 5

J

89 = 6 5 ! 5

5 + 5

J

89 0; 1 n6-n11 stats

(dist) i, k, l = 1 … , n – object number, n – number of objects, j = 1, … , m – variable number, m – number of variables, z]^?z_^, z`^@ – normalized value for j-th variable for i-th (k-th, l-th) object.

Source: authors’ compilation.

2. For all normalized data matrices V distances between objects are calculated (see e.g. distance measures in Table 2) and grouped in distance matrices V (+ – number of normalization meth- od). 18 distance matrices are obtained for ratio variables and 10 distance matrices for interval var- iables. Canberra distance measure does not depend on scale parameter (Pawelek (2008), p. 94).

Therefore, the acceptable normalization methods n6-n11 do not change this distance value. Can- berra distance will not be discussed further in the article.

3. Cluster analysis is carried out for each distance matrix V in the clustering from 2 to a classes (maximum a is b ! 1). One of many classification methods based on distance matrix can be ap- plied in this case (see e.g. Everitt et al. (2011)). Next the compatibility of clustering results, ob- tained for different normalization methods using the adjusted Rand index is compared in pairs for the same number of clusters (Hubert and Arabie (1985), p. 198):

cVd,e =g ∑ B7K,f ; C ∑ BKf K7; C∙K ∑ B7f ; Cf∙1B7;C hi∑ B7∙K

K ; Cj∑ B7f ; Cf∙ k ∑ B7∙K

K ; C∑ B7f ; Cf∙ 1B7;C (4)

where: +, ' – numbers of normalization methods,

l = 2, … , a, l – number of classes, b – number of classified objects,

Ve, de– clustering of set of objects into l classes for + and s normalization method, G = 1, … , l – class number in Ve clustering, n = 1, … , l – class number in de clustering, bIo – number of objects covered by G and n classes simultaneously in Ve and de cluster- ings,

b∙I – number of objects in G class for Ve clustering, bo∙ – number of objects in n class for de clustering.

The adjusted Rand index cVd,e takes values in the interval !∞; 1 . The comparison results are aver- aged for the clustering results from 2 to a classes:

cpVd=utvhw 9qrs,t (5)

The adjusted Rand index is available in classAgreement function of e1071 library (Meyer et al.

(2019)).

4. For the cluster analysis purposes the averaged values of adjusted Rand index are transformed into distances:

Vd= 1 ! cpVd (6)

Cluster analysis (using the same classification method as in step 3) is carried out on the basis of dis- tance matrix Vd , which allows isolating the groups of variable normalization methods that lead to similar clustering results. Due to the possibility of graphic presentation of the normalization methods’

classification results in a dendrogram form, agglomeration hierarchical methods are used in the arti- cle (hclust function of stats package).

(6)

The observations regarding normalization methods for GDM1 distance measure and Bray-Curtis dis- tance are presented in Table 3.

Table 3: Groups of normalization methods resulting in identical distances in the distance matrix determined using GDM1 distance and Bray-Curtis distance

Groups of normalization methods

Normalization methods

GDM1 distance Bray-Curtis distance

A n1, n6, n12 –

B n2, n6a –

C n3, n3a, n4, n7, n13 –

D n9, n10 n9, n10

Source: authors’ compilation.

The identical distance matrices for A, B, C and D groups of methods result from the fact that GDM1 measure does not depend on the shift parameter applied in normalization methods. Moreover, mul- tiplying the normalized values by the constant changes neither GDM1 nor Bray-Curtis distance.

It was demonstrated in the study of Pawelek (2008), p. 94 that the values of Minkowski distance measures (Manhattan, Euclidean, Chebyshev) do not depend on shift parameter applied in the normalization methods. Therefore, identical distance matrices are obtained for the groups of normali- zation methods presented in Table 4.

Table 4: The groups of normalization methods resulting in identical distance matrices for Minkowski distances

Groups of normalization methods Normalization methods

N1 N2

A n1, n6 n1, n6, n12*

B n2, n6a n2, n6a

C n3, n3a, n4, n7 n3, n3a, n4, n7, n13*

D – n9, n10*

* – for this normalization method distance matrix is multiplied by a constant.

N2 – after dividing distances in each distance matrix by the maximum value.

Source: authors’ compilation.

In the cluster analysis, carried out on the basis of distance matrix, the research procedure, which al- lows isolating the groups of variable normalization methods that lead to similar clustering results of the set of objects, covers the below presented steps:

1. The following stages are carried out in accordance with the general cluster analysis procedure

→ → → → . All acceptable methods presented in Table 1 are used in the normalization of variables (18 normalization methods are possible for ratio variables, whereas for interval variables – 10 normalization methods).

2. For each normalized data matrix V cluster analysis is carried out (e.g. using k-means meth- od) from 2 to a classes (maximum a is b ! 1). Next the compatibility of classification results, ob- tained for different normalization methods using the adjusted Rand index cVd,e, is compared in pairs according to formula (4). The comparison results are averaged for the clustering results from 2 to a classes in accordance with (5).

3. For the cluster analysis purposes the averaged values of adjusted Rand index are transformed into distances according to formula (6). Cluster analysis (using classification methods based on dis-

(7)

tance matrix) is conducted on the basis of distance matrix Vd , which allows isolating the groups of variable normalization methods leading to similar clustering results of the set of objects. Due to the possibility of graphic presentation of the normalization method classification results in a dendrogram form the agglomeration hierarchical methods are used in the article.

For the k-means classification method the identical results of the clustering results of the set of ob- jects are obtained for the groups of normalization methods presented in Table 5.

Table 5: Groups of normalization methods leading to identical clustering results of the set of objects for k-means method

Groups of normalization methods Normalization methods

A n1, n6, n12

B n2, n6a

C n3, n3a, n4, n7, n13

D n9, n10

Source: authors’ compilation.

In case of k-means method the identical clustering results of the set of objects for the groups of A, B, C and D normalization methods result from applying the criterion of minimizing the inter-class co- variance matrix trace (see formula 5.9 in the study by Everitt et al. (2011), p. 114).

An application to Boston data

The set of data covering the real estate market in Boston is accessible at UCI Machine Learning Re- pository: Housing Data Set. Moreover, the set of data is available in MASS (Ripley (2019)) package.

The set covers 14 variables. The study uses 12 metric variables. The left out variables are the second ZN (proportion of residential land zoned for lots over 25,000 sq. ft.) and the fourth one CHAS (Charles River dummy variable, 1 if tract bounds river; 0 otherwise):

CRIM – per capita crime rate by town,

INDUS – proportion of non-retail business acres per town, NOX – nitric oxides concentration (parts per 10 million), RM – average number of rooms per dwelling,

AGE – proportion of owner-occupied units built prior to 1940, DIS – weighted distances to five Boston employment centers, RAD – index of accessibility to radial highways,

TAX – full-value property-tax rate per $10,000, PTRATIO – pupil-teacher ratio by town,

B – 1000x ! 0.63|; where is the proportion of blacks by town, LSTAT – % lower status of the population,

MEDV – median value of owner-occupied homes in $1000's.

The research procedure from part 4 was used in the article (cluster analysis carried out based on the normalized data matrix using k-means method – see script from Appendix) allowing to isolate the groups of variable normalization methods leading to similar clustering results of the set of objects.

The measurement of variables on a ratio scale allows the application of all normalization methods (hence the study covered 18 methods). Due to the fact that the groups of A, B, C and D normalization methods give, for k-means method, identical clustering results of the set of objects (see Table 5) fur- ther analysis covers the first methods of the groups indicated (n1, n2, n3, n9), as well as the other methods (n5, n5a, n8, n9a, n11, n12a).

The groups of variable normalization methods leading to similar clustering results of the set of ob- jects are presented by means of a dendrogram on Figure 1.

(8)

Fig. 1: The dendrogram of normalization methods similarity in cluster analysis of 506 real estates in Boston using k-means method

Source: authors’ compilation using R program.

Five groups of normalization methods were isolated on the basis of the dendrogram, which lead to similar clustering results of the set of objects (normalization methods giving identical cluster analysis results are presented in brackets):

group 1 (3 methods): (n2, n6a), n9a group 2 (3 methods): (n9, n10), n11 group 3 (4 methods): (n1, n6, n12), n12a group 4 (1 method): n8

group 5 (7 methods): (n3, n3a, n4, n7, n13), n5, 5a

The indices for determining the relevant number of clusters in a data set presented in packages nbClust (Charrad et al. (2014); Charrad et al. (2015)) and clusterSim (Walesiak and Dudek (2019)) can be used in the selection of the number of clusters.

In the analyzed case the significant differences between cluster analysis results occur in case of nor- malization methods from different groups. The presented proposal allows reducing the problem of variable normalization method choice.

The results of simulation analyses

Data sets are generated in nine different scenarios (see Table 6) using R program and packages:

clusterSim (Walesiak and Dudek (2019)), clusterGeneration (Qiu and Joe (2015)), sn (Az- zalini (2019)), cluster (R Core Team (2019)). Simulation models differ in the number of clusters, the number of objects, the number of outliers, the number of variables, the number of noisy (irrele- vant) variables, and the density and shape of clusters (data are generated from different distributions).

The research procedure discussed in section 4 (cluster analysis carried out based on the normalized data matrix using k-means method) was used in simulation analyses for each model from Table 6:

1. Normalization of variables was carried out using 10 methods: n1, n2, n3, n5, n5a, n8, n9, n9a, n11, n12a.

(9)

2. For each normalized data matrix V (+ – the number of normalization method) cluster anal- ysis is carried out using k-means method from 2 to a cluster (a = 3√b). Next the compatibility of clustering results, obtained for different normalization methods, using the adjusted Rand index cVd,e

is compared in pairs according to formula (4). The comparison results are averaged for the clustering results from 2 to a = 3√b clusters in accordance with (5).

3. For the cluster analysis purposes the averaged values of adjusted Rand index are transformed into distances according to formula (6).

4. Steps 1 – 3 are repeated 20 times (twenty realizations were generated from each model). Re- sults of twenty matrices Vd are averaged ̅Vd .

5. Cluster analysis (using the group average method) is carried out on the basis of distance ma- trix ̅Vd, which allows for the isolation of the groups of variable normalization methods that lead to similar clustering results.

(10)

Table 6: Characteristics of data sets used in simulations

No. Nam

e Distribution nc n

o o v n

v Parameters R package

(function)

1 sM Multivariate

normal 5 0 2 1

Centroid of clusters:

(10; 10), (10; 20), (15;

15), (20; 10), (20; 20), Σ = i0.5 0.30.2 0.5k

clusterSim (cluster.Gen)

2 eM Multivariate

normal 3 10 3 1

Centroid of clusters:

(1.5, 6; –3), (3; 12; –6), (4.5; 18; –9), Σ: € = 1x1 • ‚ • 3| €9;=

= !0.9, €= 0.9

clusterSim (cluster.Gen)

3 fM Multivariate

normal 4 20 2 1

Centroid of clusters:

(10; 10), (15; 19), (25;

10), (15; 1), Σ9= i1 00 1k, Σ;= i 1 !1!1 1 k, Σƒ= i 1 0.33

0.33 1 k, Σ= i 1 0.22

0.22 1 k

clusterSim (cluster.Gen)

4 bDS Multivariate

normal 4 15 3 1

Degree of separation = 0.93 (Qiu and Joe (2006))

clusterGeneration (genRandomClust)

5 sDS Multivariate

normal 5 † 0 5 0 Degree of separation = 0.36 (Qiu and Joe (2006))

clusterGeneration (genRandomClust)

6 sT

Multivariate skew-t distribution

3 ‡ 0 2 0 Ω (‰ = 1, ‰N = 0.5), Š = 0.5, 6 degrees of freedom

sn (rmst)

7 sC

Skew Cau- chy distribu- tion

3 ‹ 0 2 0 Ω (‰ = 1, ‰N = 0.3), Š = 0.2, 7 degrees of freedom

sn (rmsc)

8 u Uniform

distribution 3 Π0 3 0

Uniform distribution in ranges (-3,-5,-6) ÷ (-2,- 1,-4); (0,-3,1) ÷ (2,6,3);

(3,2.5,2) ÷ (7, 4.5, 8)

stats (runif)

9 eC Mixed

distributions 11 0 2 0

Four clusters from mul- tivariate normal distri- bution, two “bulls-eye”, three elongated clusters from multivariate nor- mal distribution, two

“worms” clusters

clusterSim (cluster.Gen, shapes.bulls.eye, shapes.worms)

nc – the number of clusters, no – the number of objects, o – the number of outliers, v – the number of variables, nv – the number of noisy variables, • – covariance matrix, Ž – a symmetric positive-definite matrix of dimension x , |; Š – a numeric vector of length which regulates the slant of the density, = U*b••ℎx UGℎ |; = x40, 40, 40, 40, 40|; = x50, 50, 50|; = x35,40, 45, 50|; = x20, 30, 40, 65|; † = x35, 35, 35, 35, 35|;

‡ = x100, 100,100|; ‹ = Œ = x100, 100, 100|; = x15, 15, 15, 15, 160, 160,15, 15, 15, 40, 40|. Models: sM – simpleMultivariate, eM – elongatedMultivariate, fM – fancyMulivariate, bDS – bigDegreeOfSeparation, sDS – smallDegreeOfSeparation, sT – skewT, sC – skewCauchy, u – uniff, eC – elevenClusters.

Source: authors’ compilation.

(11)

The groups of variable normalization methods leading to similar clustering results of the set of ob- jects are presented by dendrograms on Figures 2 and 3 (for data sets generated via nine models from Table 6 and average model).

Fig. 2: Dendrograms of variable normalization methods leading to similar clustering results for data sets generated via models 1 6 from Table 6

Source: authors’ compilation using R program

(12)

Fig. 3: Dendrograms of variable normalization methods leading to similar clustering results for data sets generated via models 7 9 from Table 6 and average model

Source: authors’ compilation using R program

The following conclusions can be drawn from the results presented on Figures 2 and 3:

− we can isolate three groups of normalization methods, based on dendrogram for data set “all”, which lead to similar clustering results of the set of objects: (n1, n3, n5, n5a, 12a); (n8, n9, n9a, n11);

(n2),

− similar clustering results of the set of objects for data sets generated for nine models from Ta- ble 6 we receive for normalization methods n1 and n12a,

− similar clustering results of the set of objects for data sets generated for nine models from Ta- ble 6 we receive for normalization methods n3, n5 and n5a,

− for normalization method n2 we obtain significantly different cluster analysis results.

Final Remarks

The article presents the proposal of two research procedures which allow reducing the problem of variable normalization method choice in cluster analysis carried out on the basis of distance matrix and normalized data matrix.

(13)

The normalization methods were identified for GDM1 and Bray-Curtis distance measures, which give identical distance matrices and thus identical clustering results of the set of objects. Analogical observations were presented for Minkowski distance measures (Manhattan, Euclidean, Chebyshev).

The groups of normalization methods, leading to identical clustering results of the set of objects, were identified for the k-means method.

In practice, without taking into account the proposed procedures, making a choice of normalization method of variables in cluster analysis for metric data we had the list of 18 methods to choose from (see Table 1). Considerations included in Table 3, 4 and 5 reduce this number to 10 methods of nor- malization.

The choice still becomes arbitrary and difficult to justify. The proposed approach does not solve the problem completely, but it can reduce the possibilities of choosing the normalization method. De fac- to we can choose as many normalization methods, as there are separate groups of them (normaliza- tion methods located in the same groups give identical or similar clustering results).

The proposed reduction of normalization methods used in simulation studies could have large impact of complexity of simulations studies eliminating redundant paths. It also explains why sometimes for various normalization methods could give identical results

The research results were illustrated by simulation study and two empirical examples using the pro- posed R program scripts.

Appendix

Script (Boston data with 506 objects – cluster analysis based on metric data and normalized data matrix)

library(clusterSim) data(Boston)

x<-Boston[,c(-2,-4)]

n<-nrow(x) m<-ncol(x)

metnor<-c("n1","n2","n3","n5","n5a","n8","n9","n9a","n11","n12a") mn<-length(metnor)

c<-30

maxNumberOfClasses<-c

resultsClusters<-array(0,c(mn,maxNumberOfClasses-1,n)) resultsRand<-array(0,c(mn,mn))

for(i in 1:mn){

for(j in 2:maxNumberOfClasses){

for(k in 1:m){

nn<-as.matrix(data.Normalization(x,type=metnor[i])) clusters<-kmeans(nn, nn[initial.Centers(nn,j),])$cluster resultsClusters[i,j-1,]<-clusters

} } }

for(i in 1:mn){

for(j in 1:mn){

avgRand<-0

for(k in 2:(maxNumberOfClasses)){

avgRand<-avgRand+comparing.Partitions(cl1=resultsClusters[i,k-1,], cl2=resultsClusters[j,k-1,],type="crand");

}

(14)

resultsRand[i,j]<-avgRand/(maxNumberOfClasses-1) }

}

print("Matrix of average values of the adjusted Rand index") rownames(resultsRand)<-metnor

colnames(resultsRand)<-metnor print(round(resultsRand,3))

hc<-hclust(as.dist(1-resultsRand),method="complete")

plot(hc,labels=metnor,sub=NULL,ann=FALSE,axes=FALSE,main=NULL) title(xlab="Type of normalization method",ylab="Height",font.main=1) axis(2)

Acknowledgments

The project is financed by the Ministry of Science and Higher Education in Poland under the program

“Regional Initiative of Excellence” 2019-2022, project number 015/RID/2018/19, total funding amount 10,721,040 PLN.

References

• Anderberg, M.R. (1973), Cluster Analysis for Applications, Academic Press, New York, San Francisco, London.

• Azzalini, A. (2019), sn: The Skew-Normal and Related Distributions such as the Skew-t, R package version 1.5-4, http://CRAN.R-project.org/package=sn.

• Borys, T. (1978), ‘Metody normowania cech w statystycznych badaniach porownawczych [Methods of Characteristics Normalization in Statistical Comparative Studies]’, Przeglad Statystyczny, 25 (2), 227-239.

• Bray, J.R. and Curtis, J.T. (1957), ‘An Ordination of the Upland Forest Communities of Southern Wisconsin’, Ecological Monographs, 27 (4), 325-349.

• Charrad, M., Ghazzali, N., Boiteau, V. and Niknafs, A. (2014), ‘NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set’, Journal of Statistical Software, 61(6), 2-36, http://www.jstatsoft.org/article/view/v061i06.

• Charrad, M., Ghazzali, N., Boiteau, V. and Niknafs, A. (2015), NbClust Package for Deter- mining the Best Number of Clusters, R package version 3.0, http://CRAN.R- project.org/package=NbClust.

Cormack, R.R. (1971), ‘A Review of Classification’, Journal of the Royal Statistical Socie- ty. Series A, 134 (3), 321-367.

• Everitt, B.S., Landau, S., Leese, M. and Stahl, D. (2011), Cluster Analysis, John Wiley &

Sons, Chichester.

• Gordon, A.D. (1999), Classification, Chapman and Hall/CRC, London.

• Grabinski, T. (1992), Metody taksonometrii [Taxonometric Methods], Wydawnictwo Akad- emii Ekonomicznej w Krakowie, Krakow.

Hubert, L. and Arabie, P. (1985), ‘Comparing Partitions’, Journal of Classification, 2(1), 193-218.

(15)

• Jajuga, K. (1981), Metody analizy wielowymiarowej w ilosciowych badaniach przestrzen- nych [Methods of Multidimensional Analysis in Spatial Research of Quantitative Data], Doctoral thesis, Akademia Ekonomiczna we Wroclawiu, Wroclaw.

• Jajuga, K. and Walesiak, M. (2000), Standardisation of Data Set under Different Measure- ment Scales, Decker, R. and Gaul, W. (eds.), Classification and Information Processing at the Turn of the Millennium, 105-112, Springer-Verlag, Berlin, Heidelberg.

• Jajuga, K., Walesiak, M. and Bak, A. (2003), On the General Distance Measure, Schwaiger, M. and Opitz, O. (eds.), Exploratory Data Analysis in Empirical Research, 104-109, Spring- er-Verlag, Berlin, Heidelberg.

• Karatzoglou, A., Smola, A. and Hornik, K. (2019), kernlab: Kernel-Based Machine Learn- ing Lab, R package version 0.9-29, http://CRAN.R-project.org/package=kernlab.

• Kaufman, L. and Rousseeuw, P.J. (1990), Finding Groups in Data: an Introduction to Clus- ter Analysis, John Wiley & Sons, New York.

• Lance, G.N. and Williams, W.T. (1966), ‘Computer Programs for Hierarchical Polythetic Classification (“Similarity Analyses”)’, The Computer Journal, 9 (1), 60-64.

• Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M. and Hornik, K. (2019), cluster: Cluster Analysis Basics and Extensions, R package version 2.1.0, http://CRAN.R- project.org/package=cluster.

• Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A. and Leisch, F. (2019), e1071: Misc Functions of the Department of Statistics, Probability Theory Group, R package version 1.7- 3, http://CRAN.R-project.org/package=e1071.

• Milligan, G.W. and Cooper, M.C. (1988), ‘A Study of Standardization of Variables in Clus- ter Analysis’, Journal of Classification, 5 (2), 181-204.

• Milligan, G.W. (1996), Clustering Validation: Results and Implications for Applied Anal- yses, Arabie, P., Hubert, L.J. and de Soete, G. (eds.), Clustering and Classification, 341-375, World Scientific, Singapore.

• Mlodak, A. (2006), Analiza taksonomiczna w statystyce regionalnej [Taxonomic Analysis in Regional Statistics], Difin, Warszawa.

• Ng, A., Jordan, M. and Weiss, Y. (2002), On Spectral Clustering: Analysis and an Algo- rithm, Dietterich, T., Becker, S. and Ghahramani, Z. (eds.), Advances in Neural Information Processing Systems 14. 849-856. MIT Press.

• Pawelek, B. (2008), Metody normalizacji zmiennych w badaniach porownawczych zlozonych zjawisk ekonomicznych [Normalisation of Variables Methods in Comparative Research on Complex Economic Phenomena], Wydawnictwo Uniwersytetu Ekonomicznego w Krakowie, Krakow.

• Qiu, W. and Joe, H. (2006), ‘Generation of random clusters with specified degree of separa- tion’, Journal of Classification, 23(2), 315-334, DOI 10.1007/s00357-006-0018-y.

• Qiu, W. and Joe, H. (2015), clusterGeneration: Random Cluster Generation (with Specified Degree of Separation), R package version 1.3.4, http://CRAN.R- project.org/package=clusterGeneration.

(16)

• R Core Team (2019), R: A Language and Environment for Statistical Computing, R Founda- tion for Statistical Computing, Vienna, http://www.R-project.org.

• Ripley, B. (2019), MASS: Support Functions and Datasets for Venables and Ripley's MASS, R package version 7.3-51.4, http://CRAN.R-project.org/package=MASS.

• Rybaczuk, M. (2002), ‘Graficzna prezentacja struktury danych wielowymiarowych [Graph- ical Presentation of Multidimensional Data Structure]’, Prace Naukowe Akademii Ekonomicznej we Wroclawiu, 942, 146-153.

• Schaffer, C.M. and Green, P.E. (1996), ‘An Empirical Comparison of Variable Standardiza- tion Methods in Cluster Analysis’, Multivariate Behavioral Research, 31 (2), 149-167.

Stevens, S.S. (1946), ‘On the Theory of Scales of Measurement’, Science, 103 (2684), 677- 680.

• Walesiak, M. (2002), Uogolniona miara odleglosci w statystycznej analizie wielowymiaro- wej [The Generalized Distance Measure in Multivariate Statistical Analysis], Wydawnictwo Akademii Ekonomicznej we Wroclawiu, Wroclaw.

• Walesiak, M. (2004), ‘Problemy decyzyjne w procesie klasyfikacji zbioru obiektow [Deci- sion Problems in a Cluster Analysis Procedure]’, Prace Naukowe Akademii Ekonomicznej we Wroclawiu, 1010, 52-71.

• Walesiak, M. (2011), Uogolniona miara odleglosci GDM w statystycznej analizie wie- lowymiarowej z wykorzystaniem programu R [The Generalized Distance Measure GDM in Multivariate Statistical Analysis with R], Wydawnictwo Uniwersytetu Ekonomicznego we Wroclawiu, Wroclaw.

• Walesiak, M. (2014), ‘Przeglad formul normalizacji wartosci zmiennych oraz ich wlasnosci w statystycznej analizie wielowymiarowej [Data Normalization in Multivariate Data Analy- sis. An Overview and Properties]’, Przeglad Statystyczny, 61 (4), 363-372.

• Walesiak, M. (2018), ‘The choice of normalization method and rankings of the set of objects based on composite indicator values’, Statistics in Transition new series, 19 (4), 693–710, DOI https://doi.org/10.21307/stattrans-2018-036.

• Walesiak, M. and Dudek, A. (2019), clusterSim: Searching for Optimal Clustering Proce- dure for a Data Set, R package version 0.48-2, http://CRAN.R- project.org/package=clusterSim.

• Zelias, A. (2002), ‘Some Notes on the Selection of Normalisation of Diagnostic Variables’, Statistics in Transition, 5 (5), 787-802.

Cytaty

Powiązane dokumenty

Informacje źródłowe wykorzystane w niniejszym artykule wskazują, że an- tyczni autorzy piszący o zamachu dokonanym na G. Juliusza Cezara 15 marca 44 roku obok głównych

For these and some other criteria that can be measured, ASML specified desired values. For dependency and shar- ing these are specified relative to a system-wide average.

Nic udało się więc sprowadzić formy sonetu do jej istoty, która pozostaje nieuchwytna, choć przecież został przywołany schemat łatwo rozpoznawalny w swym ogólnym zarysie

Brak wpływu komórkowych czynników wzrostu lub fizycznego kon- taktu z otaczającymi je komórkami jest prawdopodob- nie rodzajem wewnętrznie wbudowanego mechanizmu

Zmiana prędkości obrotowej silnika w czasie rozruchu silnika typu SO-3 podczas lotu (widoczne dużo mniejsze załamanie przyrostu prędkości

traditio brevi manu)... Form uła

The driver’s monitoring state (i.e., whether the driver responded by attending to the road and touching the steering wheel), driving performance (braking and steering behaviour