A Monte Carlo investigation of two distance measures between statistical populations and their application to cluster analysis

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S

FO LIA O EC O N O M IC A 141, 1997

Agnieszka Rossa*

A M O N T E C A R LO IN V E STIG A TIO N O F TW O D ISTA N C E M EA SU RES BETW EEN STA TISTICAL PO PU LA TIO N S A N D T H E IR A PPLIC A T IO N T O CL U STE R A N A LY SIS

Abstract. The paper deals with a sim ulation study o f one o f the well-known

hierarchical cluster analysis m ethods applied to classifying the statistical p o p u lations. In particular, the problem o f clustering the univariate norm al populations is studied. T w o measures o f the distance between statistical po p u latio n s are considered: the M ahalanobis distance m easure which is defined fo r norm ally distributed populations under assum ption that the covariance matrices are equal and the K u llback-Leibler divergence (the so called Generalized M ahalanobis D istance) the use o f which is extended on populations o f any distribution.

The simulation study is concerned with the set o f 15 univariate norm al populations, variances o f which are chanched during successive steps. T he aim is to study robustness o f the nearest neighbour method to departure from the variance equality assum ption when the M ahalanobis distance form ula is applied. The differences between two cluster families, obtained for the same set of populations but with the different distance m atrices applied, are studied. The distance between both final cluster sets is m easured by means o f the M arczewski-Steinhaus distance.

Key words: hierarchical cluster analysis methods, robustness o f the nearest neighbour m ethod, the M ahalanobis distance, the K u llback-Leibler divergence, the M arczewski-Steinhaus distance measure.

1. T H E BASIC NO TIO N S

Let n m u ltiv ariate statistical p o p u latio n s П 1; П2, П „ be given distributed according to density functions f y, f 2, /„ , respectively. The starting point of the hierarchical cluster analysis procedures is constructing a distance m atrix D, elements of which express distances between each of the two populations П, and Пj(i,j — 1,2, ...,n).

(2)

D = [dy] i = 1 , 2 , n; j = 1 ,2 ,..., n;

One o f the m ost p o p u lar distance form ulae applied to m easuring distances between statistical populations is the M ahalanobis distance m easure ( M a h a l a n o b i s 1936), defined under assum ption that the populations are normally distributed and have a common covariance m atrix П, ~ N ( ß t, ]T) for 1= 1,2,...,« . The M ahalanobis distance between two populations П, and Пу takes the form

Д(Л j) = ( ß i ~ Vj ) (1.1)

K u 11 b а с к and L e i b 1 e r (1951) introduced a distance measure between statistical populations called “divergence” . The K ullback-Leibler divergence is a m ore general m easure than the M ahalanobis distance. It can be used w ithout lim itation to the case o f norm al populations with equal covariance m atrices.

Let two multivariate populations ГТг and П , have the respective probability densities f [( x 1, x 2, x k) and f j ( x lt x 2, ..., x k) which are equivalent, i.e.

X2’

•">

Хк

) == 0 ■<=>-

x 2,

xt) = 0

a A

for any A e B ( R k). Then the divergence between П, and П ; was defined by K ullback and Leibler in the following form

J( Uj ) = J ( f i(x) - / j ( x ) ) • log dx ( 1 2)

Rl J j W v '

where x = ( x u x 2,..., x k).

In the case o f normal populations (when П, ~ N(n, Z;) and ~ N(/x, £ ,)), the divergence form ula becomes

J ( h J) = 2₇> [ ( £j - Zj)(Zj 1 - 2 fx ) ] + ~ ( / i , - lij)T( L r1 + 2 7 l)(ßt - /Xj) (1.3) and in the particular case, when = 3C the divergence J(i, j ) has the form

J ( h j) = (f*i - H j ) T Z ~ l ( n i - n J) = A ( i , j ) ( 1 . 4 ) One can easily see from the equality (1.4) that the M ahalanobis distance is a special case of the K ullback-Leibler divergence for norm al populations with a com m on covariance m atrix.

(3)

2. D E P A R T U R E FR O M T H E C O V A R IA N C E EQ U A LITY A SSU M PTIO N - A SIM U LA TIO N STU D Y

The M ahalanobis distance is a distance m easure often applied, used in the cluster analysis for m easuring distances between statistical populations. However, its use is limited to the case of norm al populations with the equal covariance matrices. In practice, the M ahalanobis distance is employed even when the assum ptions are not satisfied. The following question arises: how m uch disregarding the above m entioned assum ptions affects the final results of clustering? D o they deviate m uch from the correct results or not? Let us consider the following example.

Example

Let 4 univariate norm al populations П х, П 2, П 3, П4 be given

n t ~ N ( 5, 1) n2 ~ iV (l, 5) n 3 ~JV (6, 9) П4 ~ЛГ(0, 15)

It can be easily seen, that the given standard deviations differ m uch from one another. Thus, the variance equality assum ption is not satisfied. In that case the M ahalanobis distance form ula provides us with the wrong distance m atrix o f the given set o f objects П15 П 2, П 3, П 4. In spite o f this, let us not regard the above m entioned assum ption and try to evaluate the distance m atrix by means of the M ahalanobis distance formula. F o r this purpose it is necessary to adopt a variance value which would be comm on for all the populations П х, П 2, П 3, П 4. In practice such a common variance is evaluated as a m ean o f all variances. T he com m on variance becomes

a = ^(ffi + a \ -f <r3 -(- ct4) = 83 ,

According to the M ahalanobis distance form ula (1.1), adjusted to the univariate case, we obtain the following distances

Пх П 2 п 3

n 2 0.193

n 3 0.012 0.301

n 4 0.301 0.012 0.434

N ow , using one o f the well-known hierarchical cluster analysis m ethods (e.g. nearest neighbour m ethod) the following family o f clusters is obtained

(4)

(2.1

)

The results can be presented also graphically in the form o f “ a tree with a ro o t” .

The family A in (2.1) and its graphical representation (see Fig. 1) are the final results o f clustering.

We cannot forget however that the results may deviate m uch from the correct ones, because o f the dissatisfied assum ption concerning the equality o f the population variances. It seems to be m ore reasonable to use in that example the Kullback-Leibler divergence form ula (1.3) derived for norm ally distributed populations w ith unequal covariance m atrices. T he correct distances calculated for the same set o f objects by m eans of the K u l-lback-Leibler form ula (1.3), adjusted to the univariate case, are the following

1^1 П3 ^ 2 П4

Figure 1. G raphical representation o f the cluster family A

П , n 2 n 3

n 2 19.840

П3 40.012 1.429

П4 124.058 3.578 0.871 This leads to the following family of clusters

A' = {(П3, П 4), (П 2, П 3, П 4), (П х, П 2, П 3, П 4)} (2.2) represented in Fig. 2.

П3 П4 П2 п ,

Figure 2. G raphical representation o f the cluster famii'y A'

We can see th at the last results differ from the previous ones (see Fig. 1). The cluster family A differs from the cluster family A ', although b o th o f these fam ilies were obtained fo r the sam e set o f objects

(5)

IIjl, П 2,

n3, n4.

But the im portant question is, how m uch they differ and how to m easure the similarity between the sets A and A'? In order to answer the question we need to find a m easure which could express the degree of similarity between the families A and A'.

F o r this purpose we applied the so-called M arczewski-Steinhaus distance m easure, defined for two families of subsets o f the same set.

The Marczewski-Steinhaus distance measure

The M arczewski-Steinhaus distance m easure is defined for two families o f subset o f the same set ( M a r c z e w s k i and S t e i n h a u s 1958, K a r o ń - s k i and P a l k a 1977). Let us denote by Ft the г-th cluster o f the family A and by Et the i-th cluster o f the family A'. The distance between the families A and A ' takes the form

d(A, A') — L „ л , V < я * Г , - В л д + * * * * » , - ! )

n - l p e p , " card(Ftu E P'd 1 >

and d(A, Л ')е < 0 , 1>,

where p is the perm utation o f the first n - 1 integers and P is the set o f all such perm utations.

Let us evaluate the M arczewski-Steinhaus distance d(A, A') for two families o f subsets (2.1) and (2.2) given in the example. The first family is the following

A = { ( n lt П3), (П 2, П4), ( Uu П2, П3, П4)}

and the second one is

A! = {(П3, П4), (П2, П3, П4), ( П1г П2, П3, П4)}

Now we consider 6 perm utations o f subsets o f the family A'. Thus

A ’Pi = {(П3, П4), (П2, П 3, П4), (Пх, П2, П3, П4)} A'„2 = {(П3, П4), (П1; П2, П3, П4), (П2, П3, П4)} A P, = {(П2, п з, П4), (П3, П4), (П15 П2, n 3, n 4)} A'Pt = {(П2, П 3, П4), (ПА, П2, П3, П4), (П3, П4)} A ’Pi = {(П15 П2, П3, П4), (П2, П3, П4), (П3, П4)} A'P( = {(П1; П2, П3, П4), (П3, П4), (П2, П 3, ri4)}

According to the form ula (2.3) we obtain the following schemes of calculations for the first perm utation (see Tab. 1).

(6)

T a b l e 1

Schemes o f middle calculations o f the M arczewski-Steinhaus distance fo r two families of clusters: the family A given in (2.1) and the first perm utation A'p o f the family A' given in (2.2).

F, F i Fi F3

(Пц П,)

(П2, П4)

(П i, П2, П3, П4)

*л..

(П3, П4)

(П2, П3, n j

(nj, П2, П3, П4)

П,

0 0 c l = card(Ft — Ep l )

1 о

_о

П4

Пз

0

c l = card(Ep — F,)

1

1 о

F‘uEp,.i

(Пц П3> n j

(П2, П3, П4)

(П„ п 2, П3, П4)

с

3 = card(Flu E p ,)

3 3 4 c l + c 2 c3 2 3

1

3

0

We obtain that 3 card(Ft — Ep ,) -f card(Ep j - F j ) 2 1 S_{i = £} _{--- ---=} ₊ _{+ o = l} c ar d(F, vEp 3 3

C ontinuing the calculations for all the perm utations o f the clusters of the set A' we obtain

2 2 1 17 2 1 2 4 3 + 4 + 4 = 12 S s = 4 + 3 + 4 = 3 3 2 17 2 2 1 17 4 + 3 - Í 2 4 + 3 + 4 = 12' „ 3 2 2 7 S4 = - + -ľ + - j = T 4 4 4 4

Finnally, the M arczewski-Steinhaus distance has the value ,. . 1 . Г 17 17 7 4 17) 1

d(A, A ) = - m m j l , = 0.33.

T hus, the distance between the family A and the family A' or between the trees G and G' is equal to 0.33. It follows from the analyzed example th a t the results o f the cluster analysis, based on the M ahalanobis distance, can deviate even m uch from the correct results, if the assum ption concerning the variance equality is not satisfied.

(7)

Simulation study

In this section we present the results o f a com puter sim ulation study perform ed similarly as described in the example but for a larger num ber of univariate populations. Let us assume that all populations are norm ally distributed with the expected values as follows

» 4 = 4 .8 6 m6 = 4.91 ml l = 4.96 m2 = 4.87 m 1 = 4.92 m1 2 = 4.97 w3 = 4.88 m 8 = 4.93 m1 3 = 4.98 m4 = 4.89 m9 = 4.94 m1 4 = 4.99

ms ~ 4-90 m1 0 — 4.95 m15 = 5.00

The aim is to study sensitivity o f the cluster analysis m ethods (with the M ahalanobis distance m atrix applied) to d eparture from the variance equality assum ption. The results o f such an investigation for the nearest neighbour m ethod are presented in the Tab. 2.

T a b l e 2

The M arczewski-Steinhaus distance values expressing robustness o f the nearest neighbour m ethod to departure from variance equality assumption

The variances o f the populations _{the M arczewski-} -Steinhaus distance ° 2i a l al < °-?5 4 4 4 4 4 4 4 o.oo 6 4 4 4 4 4 4 0.14 8 6 4 4 4 4 4 0.24 8 8 4 4 4 4 4 0.24 8 6 6 4 4 4 4 0.24 8 8 6 4 4 4 4 0.28 8 8 8 4 4 4 4 0.28 8 8 6 6 4 4 4 0.31 8 8 8 6 6 4 4 0.29 6 6 8 8 8 4 4 0.34

The num bers in the second column of the Tab. 2 represent the values o f the M arczewski-Steinhaus distance between two families o f clusters. Both families were obtained for the same set of populations by m eans o f the nearest neighbour m ethod but with the different distance m atrices applied. In the first case the M ahalanobis distance form ula (1.1) was applied under assum ption that all population variances are equal. In the second case the K ullback-Leibler distance form ula (1.3) was used, (the so-called Generalized

(8)

M ahalanobis Distance), the use of which is extended on norm al populations with various covariance matrices.

3. F IN A L R EM A R K S

T he sim ulation results lead to the conclusion that the nearest neighbour m ethod based on the M ahalanobis distance m easure is not robust to departure from the variance equality assum ption.

R EFE R E N C E S

B e r a n R. (1977): M inim um He/linger distance estimates fo r parametric models, The Annals o f Slat., Vol. 5, p. 445-463.

B h a t t a c h a r y y a A. (1943): On a measure o f divergence between two statistical populations

defined by their probability distributions, Bull. C alcutta M ath. Soc., Vol. 35, p. 99-109.

E v e r i t t B. S. (1979): A M onte Carlo investigation o f the robustness o f H otteling's one- and

two-sample T2 tests, Journal Amer. Stat. Assoc., Vol. 74, p. 48-51.

H o l l o w a y L. S., D u n n O. J. (1967): The robustness o f Hotteling's J 4, Journal Amer. Stat. Assoc., Vol. 62, p. 124-136.

H o p k i n s J. W. , C l a y P. P. F. (1963): Some empirical distributions o f bivariate T1 and

homoscedasticity criterion M under unequal variance and leptocurtosis, Journal Amer. Stat

Assoc., Vol. 58, p. 1048-1053.

J o h n s o n M. E., W a n g C., R a m b e r g J. S. (1979): Robustness o f Fisher's linear discriminant

function to departures fro m normality, Inform al R eport LA-8068-MS, Los A lam os Scientific

Laboratory, University o f California, O ctober 1979.

K o b a y a s h i H . M . (1970): Distance measures and asymptotic relative efficiency, IE E E Trans. Inform . Theory, Vol. IT-16, p. 288-291.

K a i l a t h T. (1967): The divergence and Bhattacharyya distance measures in signal selection, IE E E T rans. C om m unication Technology, Vol. COM -15, p. 52-60.

К a r o ń s k i M ., P а 1 k a A. (1977): On Marczewski-Steinhaus type distance between hypergraphs, „Z astosow ania M atem atyki” , XVI, 1, p. 47-57.

К o i c h i I. (1969): On the effect o f heteroscedasticity and non-normality upon some multivariate

test procedures, Multivariate Analysis, Vol. 2, Academic Press, New Y ork, p. 87-120.

K u l l b a c k S., L e i b l e r R. A. (1951): On information and sufficiency, A nnals o f M ath Stat Vol. 22, p. 79-86.

K u l l b a c k S. (1952): A n application o f information theory to multivariate analysis A nnals of M ath. Stat. Vol. 23, p. 88-102.

M a h a l a n o b i s P. C. (1936): On the generalized distance in statistics, Proc. N at. Inst. Sei. India, Vol. 12, p. 49-55.

M a r c z e w s k i E., S t e i n h a u s H. (1958): On a certain distance o f sets and the corresponding

(9)

Agnieszka Rossa

M IA R Y O D LEG ŁO ŚC I PO M IĘ D Z Y PO PU LA C JA M I STATY STY CZN YM I I IC H ZASTO SOW A NIE W A N A L IZ IE SK U PIE Ń - B A D A N IE M O N T E C A R L O

W pracy zaw arte zostały wyniki symulacyjnego badania dotyczącego jednej z m etod hierarchicznego grupow ania populacji statystycznych, tj. m etody najbliższego sąsiedztwa. Punkiem wyjścia jest konstrukcja m acierzy odległości pom iędzy obiektam i (tu pom iędzy populacjam i statystycznymi). Celem pracy było zbadanie odporności w spomnianej m etody aglomeracyjnej na odejście od założeń warunkujących zastosowanie określonej m iary odległości. W badaniu uwzględnione zostały dwie miary odległości: odległość M ahalanobisa, zdefiniowana d la populacji norm alnych o jednakow ych macierzach k owariancji oraz odległość K ulibac- ka-Leiblera, będąca uogólnieniem odległości M ahalanobisa n a przypadek populacji o dowolnych rozkładach. W pracy główny nacisk położony został na badanie odporności wspomnianej m etody aglomeracyjnej na odejście od założenia o równości macierzy kowariancji. Badanie symulacyjne przeprowadzone zostało w odniesieniu d o ustalonego z góry zbioru 15 je d n o -wymiarowych populacji norm alnych, których wariancje zmieniane były w kolejnych krokach. Celem badan ia było ustalenie stopnia różnic pomiędzy rodzinam i skupień otrzym anym i dla danego zbioru populacji lecz przy użyciu innej macierzy odległości. Jako miarę stopnia różnic pom iędzy otrzymanym i rodzinam i skupień wykorzystano odległość M arczewskiego-Steinhausa.