• Nie Znaleziono Wyników

A Monte Carlo investigation of two distance measures between statistical populations and their application to cluster analysis

N/A
N/A
Protected

Academic year: 2021

Share "A Monte Carlo investigation of two distance measures between statistical populations and their application to cluster analysis"

Copied!
9
0
0

Pełen tekst

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S

FO LIA O EC O N O M IC A 141, 1997

Agnieszka Rossa*

A M O N T E C A R LO IN V E STIG A TIO N O F TW O D ISTA N C E M EA SU RES BETW EEN STA TISTICAL PO PU LA TIO N S A N D T H E IR A PPLIC A T IO N T O CL U STE R A N A LY SIS

Abstract. The paper deals with a sim ulation study o f one o f the well-known

hierarchical cluster analysis m ethods applied to classifying the statistical p o p u lations. In particular, the problem o f clustering the univariate norm al populations is studied. T w o measures o f the distance between statistical po p u latio n s are considered: the M ahalanobis distance m easure which is defined fo r norm ally distributed populations under assum ption that the covariance matrices are equal and the K u llback-Leibler divergence (the so called Generalized M ahalanobis D istance) the use o f which is extended on populations o f any distribution.

The simulation study is concerned with the set o f 15 univariate norm al populations, variances o f which are chanched during successive steps. T he aim is to study robustness o f the nearest neighbour method to departure from the variance equality assum ption when the M ahalanobis distance form ula is applied. The differences between two cluster families, obtained for the same set of populations but with the different distance m atrices applied, are studied. The distance between both final cluster sets is m easured by means o f the M arczewski-Steinhaus distance.

Key words: hierarchical cluster analysis methods, robustness o f the nearest neighbour m ethod, the M ahalanobis distance, the K u llback-Leibler divergence, the M arczewski-Steinhaus distance measure.

1. T H E BASIC NO TIO N S

Let n m u ltiv ariate statistical p o p u latio n s П 1; П2, П „ be given distributed according to density functions f y, f 2, /„ , respectively. The starting point of the hierarchical cluster analysis procedures is constructing a distance m atrix D, elements of which express distances between each of the two populations П, and Пj(i,j — 1,2, ...,n).

(2)

D = [dy] i = 1 , 2 , n; j = 1 ,2 ,..., n;

One o f the m ost p o p u lar distance form ulae applied to m easuring distances between statistical populations is the M ahalanobis distance m easure ( M a h a l a n o b i s 1936), defined under assum ption that the populations are normally distributed and have a common covariance m atrix П, ~ N ( ß t, ]T) for 1= 1,2,...,« . The M ahalanobis distance between two populations П, and Пу takes the form

Д(Л j) = ( ß i ~ Vj ) (1.1)

K u 11 b а с к and L e i b 1 e r (1951) introduced a distance measure between statistical populations called “divergence” . The K ullback-Leibler divergence is a m ore general m easure than the M ahalanobis distance. It can be used w ithout lim itation to the case o f norm al populations with equal covariance m atrices.

Let two multivariate populations ГТг and П , have the respective probability densities f [( x 1, x 2, x k) and f j ( x lt x 2, ..., x k) which are equivalent, i.e.

X2’

•">

Хк

) == 0 ■<=>-

x 2,

xt) = 0

a A

for any A e B ( R k). Then the divergence between П, and П ; was defined by K ullback and Leibler in the following form

J( Uj ) = J ( f i(x) - / j ( x ) ) • log dx ( 1 2)

Rl J j W v '

where x = ( x u x 2,..., x k).

In the case o f normal populations (when П, ~ N(n, Z;) and ~ N(/x, £ ,)), the divergence form ula becomes

J ( h J) = 27> [ ( £j - Zj)(Zj 1 - 2 fx ) ] + ~ ( / i , - lij)T( L r1 + 2 7 l)(ßt - /Xj) (1.3) and in the particular case, when = 3C the divergence J(i, j ) has the form

J ( h j) = (f*i - H j ) T Z ~ l ( n i - n J) = A ( i , j ) ( 1 . 4 ) One can easily see from the equality (1.4) that the M ahalanobis distance is a special case of the K ullback-Leibler divergence for norm al populations with a com m on covariance m atrix.

(3)

2. D E P A R T U R E FR O M T H E C O V A R IA N C E EQ U A LITY A SSU M PTIO N - A SIM U LA TIO N STU D Y

The M ahalanobis distance is a distance m easure often applied, used in the cluster analysis for m easuring distances between statistical populations. However, its use is limited to the case of norm al populations with the equal covariance matrices. In practice, the M ahalanobis distance is employed even when the assum ptions are not satisfied. The following question arises: how m uch disregarding the above m entioned assum ptions affects the final results of clustering? D o they deviate m uch from the correct results or not? Let us consider the following example.

Example

Let 4 univariate norm al populations П х, П 2, П 3, П4 be given

n t ~ N ( 5, 1) n2 ~ iV (l, 5) n 3 ~JV (6, 9) П4 ~ЛГ(0, 15)

It can be easily seen, that the given standard deviations differ m uch from one another. Thus, the variance equality assum ption is not satisfied. In that case the M ahalanobis distance form ula provides us with the wrong distance m atrix o f the given set o f objects П15 П 2, П 3, П 4. In spite o f this, let us not regard the above m entioned assum ption and try to evaluate the distance m atrix by means of the M ahalanobis distance formula. F o r this purpose it is necessary to adopt a variance value which would be comm on for all the populations П х, П 2, П 3, П 4. In practice such a common variance is evaluated as a m ean o f all variances. T he com m on variance becomes

a = ^(ffi + a \ -f <r3 -(- ct4) = 83 ,

According to the M ahalanobis distance form ula (1.1), adjusted to the univariate case, we obtain the following distances

Пх П 2 п 3

n 2 0.193

n 3 0.012 0.301

n 4 0.301 0.012 0.434

N ow , using one o f the well-known hierarchical cluster analysis m ethods (e.g. nearest neighbour m ethod) the following family o f clusters is obtained

(4)

(2.1

)

The results can be presented also graphically in the form o f “ a tree with a ro o t” .

The family A in (2.1) and its graphical representation (see Fig. 1) are the final results o f clustering.

We cannot forget however that the results may deviate m uch from the correct ones, because o f the dissatisfied assum ption concerning the equality o f the population variances. It seems to be m ore reasonable to use in that example the Kullback-Leibler divergence form ula (1.3) derived for norm ally distributed populations w ith unequal covariance m atrices. T he correct distances calculated for the same set o f objects by m eans of the K u l-lback-Leibler form ula (1.3), adjusted to the univariate case, are the following

1^1 П3 ^ 2 П4

Figure 1. G raphical representation o f the cluster family A

П , n 2 n 3

n 2 19.840

П3 40.012 1.429

П4 124.058 3.578 0.871 This leads to the following family of clusters

A' = {(П3, П 4), (П 2, П 3, П 4), (П х, П 2, П 3, П 4)} (2.2) represented in Fig. 2.

П3 П4 П2 п ,

Figure 2. G raphical representation o f the cluster famii'y A'

We can see th at the last results differ from the previous ones (see Fig. 1). The cluster family A differs from the cluster family A ', although b o th o f these fam ilies were obtained fo r the sam e set o f objects

(5)

IIjl, П 2,

n3, n4.

But the im portant question is, how m uch they differ and how to m easure the similarity between the sets A and A'? In order to answer the question we need to find a m easure which could express the degree of similarity between the families A and A'.

F o r this purpose we applied the so-called M arczewski-Steinhaus distance m easure, defined for two families of subsets o f the same set.

The Marczewski-Steinhaus distance measure

The M arczewski-Steinhaus distance m easure is defined for two families o f subset o f the same set ( M a r c z e w s k i and S t e i n h a u s 1958, K a r o ń - s k i and P a l k a 1977). Let us denote by Ft the г-th cluster o f the family A and by Et the i-th cluster o f the family A'. The distance between the families A and A ' takes the form

d(A, A') — L „ л , V < я * Г , - В л д + * * * * » , - ! )

n - l p e p , " card(Ftu E P'd 1 >

and d(A, Л ')е < 0 , 1>,

where p is the perm utation o f the first n - 1 integers and P is the set o f all such perm utations.

Let us evaluate the M arczewski-Steinhaus distance d(A, A') for two families o f subsets (2.1) and (2.2) given in the example. The first family is the following

A = { ( n lt П3), (П 2, П4), ( Uu П2, П3, П4)}

and the second one is

A! = {(П3, П4), (П2, П3, П4), ( П1г П2, П3, П4)}

Now we consider 6 perm utations o f subsets o f the family A'. Thus

A ’Pi = {(П3, П4), (П2, П 3, П4), (Пх, П2, П3, П4)} A'„2 = {(П3, П4), (П1; П2, П3, П4), (П2, П3, П4)} A P, = {(П2, п з, П4), (П3, П4), (П15 П2, n 3, n 4)} A'Pt = {(П2, П 3, П4), (ПА, П2, П3, П4), (П3, П4)} A ’Pi = {(П15 П2, П3, П4), (П2, П3, П4), (П3, П4)} A'P( = {(П1; П2, П3, П4), (П3, П4), (П2, П 3, ri4)}

According to the form ula (2.3) we obtain the following schemes of calculations for the first perm utation (see Tab. 1).

(6)

T a b l e 1

Schemes o f middle calculations o f the M arczewski-Steinhaus distance fo r two families of clusters: the family A given in (2.1) and the first perm utation A'p o f the family A' given in (2.2).

F, F i Fi F3

(Пц П,)

(П2, П4)

(П i, П2, П3, П4)

*л..

(П3, П4)

(П2, П3, n j

(nj, П2, П3, П4)

П,

0 0 c l = card(Ft — Ep l )

1

о

о

П4

Пз

0

c l = card(Ep — F,)

1

1

о

F‘uEp,.i

(Пц П3> n j

(П2, П3, П4)

(П„ п 2, П3, П4)

с

3 = card(Flu E p ,)

3 3 4 c l + c 2 c3 2 3

1

3

0

We obtain that 3 card(Ft — Ep ,) -f card(Ep j - F j ) 2 1 S i = £ --- ---= + + o = l c ar d(F, vEp 3 3

C ontinuing the calculations for all the perm utations o f the clusters of the set A' we obtain

2 2 1 17 2 1 2 4 3 + 4 + 4 = 12 S s = 4 + 3 + 4 = 3 3 2 17 2 2 1 17 4 + 3 - Í 2 4 + 3 + 4 = 12' „ 3 2 2 7 S4 = - + -ľ + - j = T 4 4 4 4

Finnally, the M arczewski-Steinhaus distance has the value ,. . 1 . Г 17 17 7 4 17) 1

d(A, A ) = - m m j l , = 0.33.

T hus, the distance between the family A and the family A' or between the trees G and G' is equal to 0.33. It follows from the analyzed example th a t the results o f the cluster analysis, based on the M ahalanobis distance, can deviate even m uch from the correct results, if the assum ption concerning the variance equality is not satisfied.

(7)

Simulation study

In this section we present the results o f a com puter sim ulation study perform ed similarly as described in the example but for a larger num ber of univariate populations. Let us assume that all populations are norm ally distributed with the expected values as follows

» 4 = 4 .8 6 m6 = 4.91 ml l = 4.96 m2 = 4.87 m 1 = 4.92 m1 2 = 4.97 w3 = 4.88 m 8 = 4.93 m1 3 = 4.98 m4 = 4.89 m9 = 4.94 m1 4 = 4.99

ms ~ 4-90 m1 0 — 4.95 m15 = 5.00

The aim is to study sensitivity o f the cluster analysis m ethods (with the M ahalanobis distance m atrix applied) to d eparture from the variance equality assum ption. The results o f such an investigation for the nearest neighbour m ethod are presented in the Tab. 2.

T a b l e 2

The M arczewski-Steinhaus distance values expressing robustness o f the nearest neighbour m ethod to departure from variance equality assumption

The variances o f the populations the M arczewski- -Steinhaus distance ° 2i a l al < °-?5 4 4 4 4 4 4 4 o.oo 6 4 4 4 4 4 4 0.14 8 6 4 4 4 4 4 0.24 8 8 4 4 4 4 4 0.24 8 6 6 4 4 4 4 0.24 8 8 6 4 4 4 4 0.28 8 8 8 4 4 4 4 0.28 8 8 6 6 4 4 4 0.31 8 8 8 6 6 4 4 0.29 6 6 8 8 8 4 4 0.34

The num bers in the second column of the Tab. 2 represent the values o f the M arczewski-Steinhaus distance between two families o f clusters. Both families were obtained for the same set of populations by m eans o f the nearest neighbour m ethod but with the different distance m atrices applied. In the first case the M ahalanobis distance form ula (1.1) was applied under assum ption that all population variances are equal. In the second case the K ullback-Leibler distance form ula (1.3) was used, (the so-called Generalized

(8)

M ahalanobis Distance), the use of which is extended on norm al populations with various covariance matrices.

3. F IN A L R EM A R K S

T he sim ulation results lead to the conclusion that the nearest neighbour m ethod based on the M ahalanobis distance m easure is not robust to departure from the variance equality assum ption.

R EFE R E N C E S

B e r a n R. (1977): M inim um He/linger distance estimates fo r parametric models, The Annals o f Slat., Vol. 5, p. 445-463.

B h a t t a c h a r y y a A. (1943): On a measure o f divergence between two statistical populations

defined by their probability distributions, Bull. C alcutta M ath. Soc., Vol. 35, p. 99-109.

E v e r i t t B. S. (1979): A M onte Carlo investigation o f the robustness o f H otteling's one- and

two-sample T2 tests, Journal Amer. Stat. Assoc., Vol. 74, p. 48-51.

H o l l o w a y L. S., D u n n O. J. (1967): The robustness o f Hotteling's J 4, Journal Amer. Stat. Assoc., Vol. 62, p. 124-136.

H o p k i n s J. W. , C l a y P. P. F. (1963): Some empirical distributions o f bivariate T1 and

homoscedasticity criterion M under unequal variance and leptocurtosis, Journal Amer. Stat

Assoc., Vol. 58, p. 1048-1053.

J o h n s o n M. E., W a n g C., R a m b e r g J. S. (1979): Robustness o f Fisher's linear discriminant

function to departures fro m normality, Inform al R eport LA-8068-MS, Los A lam os Scientific

Laboratory, University o f California, O ctober 1979.

K o b a y a s h i H . M . (1970): Distance measures and asymptotic relative efficiency, IE E E Trans. Inform . Theory, Vol. IT-16, p. 288-291.

K a i l a t h T. (1967): The divergence and Bhattacharyya distance measures in signal selection, IE E E T rans. C om m unication Technology, Vol. COM -15, p. 52-60.

К a r o ń s k i M ., P а 1 k a A. (1977): On Marczewski-Steinhaus type distance between hypergraphs, „Z astosow ania M atem atyki” , XVI, 1, p. 47-57.

К o i c h i I. (1969): On the effect o f heteroscedasticity and non-normality upon some multivariate

test procedures, Multivariate Analysis, Vol. 2, Academic Press, New Y ork, p. 87-120.

K u l l b a c k S., L e i b l e r R. A. (1951): On information and sufficiency, A nnals o f M ath Stat Vol. 22, p. 79-86.

K u l l b a c k S. (1952): A n application o f information theory to multivariate analysis A nnals of M ath. Stat. Vol. 23, p. 88-102.

M a h a l a n o b i s P. C. (1936): On the generalized distance in statistics, Proc. N at. Inst. Sei. India, Vol. 12, p. 49-55.

M a r c z e w s k i E., S t e i n h a u s H. (1958): On a certain distance o f sets and the corresponding

(9)

Agnieszka Rossa

M IA R Y O D LEG ŁO ŚC I PO M IĘ D Z Y PO PU LA C JA M I STATY STY CZN YM I I IC H ZASTO SOW A NIE W A N A L IZ IE SK U PIE Ń - B A D A N IE M O N T E C A R L O

W pracy zaw arte zostały wyniki symulacyjnego badania dotyczącego jednej z m etod hierarchicznego grupow ania populacji statystycznych, tj. m etody najbliższego sąsiedztwa. Punkiem wyjścia jest konstrukcja m acierzy odległości pom iędzy obiektam i (tu pom iędzy populacjam i statystycznymi). Celem pracy było zbadanie odporności w spomnianej m etody aglomeracyjnej na odejście od założeń warunkujących zastosowanie określonej m iary odległości. W badaniu uwzględnione zostały dwie miary odległości: odległość M ahalanobisa, zdefiniowana d la populacji norm alnych o jednakow ych macierzach k owariancji oraz odległość K ulibac- ka-Leiblera, będąca uogólnieniem odległości M ahalanobisa n a przypadek populacji o dowolnych rozkładach. W pracy główny nacisk położony został na badanie odporności wspomnianej m etody aglomeracyjnej na odejście od założenia o równości macierzy kowariancji. Badanie symulacyjne przeprowadzone zostało w odniesieniu d o ustalonego z góry zbioru 15 je d n o -wymiarowych populacji norm alnych, których wariancje zmieniane były w kolejnych krokach. Celem badan ia było ustalenie stopnia różnic pomiędzy rodzinam i skupień otrzym anym i dla danego zbioru populacji lecz przy użyciu innej macierzy odległości. Jako miarę stopnia różnic pom iędzy otrzymanym i rodzinam i skupień wykorzystano odległość M arczewskiego-Steinhausa.

Cytaty

Powiązane dokumenty

Numerical weldability analysis is a new powerful research and development tool which is useful for metallurgists, technologist and design engineers and presently is based on

The critical systems include the bond percolation, the Ising, the q ⫽2⫺ 冑 3, 3, and 4 state Potts, and the Baxter-Wu model, and the tricritical ones include the q ⫽1 Potts model

This section describes the Monte Carlo algorithm used in this work in the language of the site-percolation model on a square lattice of size L⫻⬁; the infinite-size direction is

Serki sojowe typu tofu, mogą być zalecane jako źródło żelaza i miedzi w diecie wegan, złożonej z produktów pochodzenia roślinnego.... The analysis of selected milk

Wdrożenie systemu ocen powinno być poprzedzone informacją, skierowaną do pracowników na temat celów, kryteriów, warunków oraz procedur dotyczących oceny Niewłaściwy dobór

Wypowiedzenie umowy o pracę naruszające zasady współżycia społecznego.. Palestra

Do wyceny akcji za pomocą modelu CAPM wymaga się oszacowania następujących parametrów: stopy wolnej od ryzyka – przyjmuje się tutaj bieżące oprocentowanie

Rytm jest o wiele bogatszy od rymu bo działa przez całą długość wiersza, gd y tymczasem rym ogranicza się tylko na koniec wiersza; dalej jest rytm o wiele