Estimation of Population Averages on the Basis of a Vector of Cluster Means

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S

FO LIA O E C O N O M IC A 175, 2004

Ja n u s z W y w ia ł*

E S T IM A T IO N O F P O P U L A T IO N A V E R A G E S O N T H E B A SIS O F A V E C T O R O F C L U S T E R M E A N S **

Abstract. The estim ation o f a vector o f mean values is being considered. T he vector estimator consists o f sim ple cluster sam ple means. It is assumod that a pop ulation o f a fixed size is divided into m utually disjoint clusters each o f the same size. T h e variance-covariance matrix o f the vector estim ator is derived. It is a function o f a hom ogeneity matrix o f m ultidim ensional variable which describes w ithin-cluster spread o f the m ultidim ensional variable under research. The accuracy o f estim ation is measured by m eans o f standard deviations o f particular sam ple cluster m eans as well as by m eans o f the trace or the determinant or the maximal eigenvalue o f the variance-covariance matrix o f the vector estimator. T he accuracy o f the vector o f simple sample cluster m eans is com pared with the accuracy o f the vector o f the sim ple sample means. T he accuracy o f the vector o f simple sample cluster m eans increases when the degree o f within-cluster spread o f the distribution o f a m ultidim ensional variable increases. Hence, the population should be divided into such clusters that the within-cluster spread is as large as possible.

Key words: cluster sam ple, vector estim ation, clustering m ethods, generalised variance relative efficiency, h om ogeneity coefficient o f m ultidim ensional variable, eigenvalue o f variance- covariance matrix.

1. T H E BA SIC P R O PE R T IE S O F T H E VEC TOR OF C L U STER M E A N S

A fixed population o f the size N is denoted by П. It is convenient to treat the p o p u latio n as a subset o f the n atu ra l num bers: Q = {1, 2, N j . Let us

assume th a t the popu latio n Q is divided into G such m utually disjoint clusters с

(P = 1»

G)

th at 1J Qp = Q. If each cluster is o f the sam e size deno ted by

. _ p= i

M , the population О is o f the size N = GM. Let S be the cluster sam ple o f the size g. T h e ra n d o m sam ple S is draw n according to the follow ing design:

* Prof., D epartm ent o f Statistics, University o f Econom ics, K atow ice, e-mail: wy- 'via@ lae.katow i ce.pl.

** The research was supported by the grant number 1 H 02B 015 10 from the Polish Scientific Research Com m ittee.

(2)

W = 77. 1

G

9

A k -th (k = l, N ) o u tco m e o f an i-th (i = 1, m) variable is d eno ted

by y ki. T h e sum o f observ atio n s o f an i-th variable in a p-th clustcr is as follows:

Zip Ун -kell.

T h e m ean value o f an i-th variable in a p-th cluster is:

y = M Zpľ

T h e m ean value o f an i-th variable per cluster is:

z i = r I zpi

-G P = 1

T h e p o p u latio n m ean o f an i-th variable takes the follow ing form : 1 e

y ‘ = zi>r iyp= i

T h e v ariance-covariance m atrix is denoted by: С = [cov(y;, уД], where:

1 G

CO v ( y if y j ) = —— - £ E ( У и ~ У д ( У ы - У j

)-N P — 1 k e i i p

T h e v a ria n c e -c o v a ria n c e m a trix o f clu ster sum s is d e n o te d by: C : = [cov(z„ Zj)], where:

1 о

COv(z„ Zj) = - — - £ (Zpi - Zi)(zpi - Z k).

u — 1 P = 1

T h e estim ato r o f the vector y = [ ý ^ y m] is defined as the vector

(3)

— £ 1Ĺ Ун — TT} Y , z pi (О

9 M pts и а, 9M

T h e vector y ieS is the unbiased estim ato r o f th e m ean vector y.

T h e covarian ce o f the estim ators y ^ , y JgS (i # / ' = 1, .... m) can be derived sim ilarly as variance o f y igS (i = 1...m), see e.g. W. G . C o  c h r a n (1963) o r C. E. S ä r n d a l , В. S w e n s o n , J. W r e t m a n (1992).

G - g

COv(yiiS) y jeS) = — - -2c o v (z „ Zj) (2)

T h e v ariancc-co variance m atrix o f the y gS can be w ritten d ow n in the follow ing way:

V (y ,s . P , ) - | ^ C ( z ) (3)

where: C (z) = [cov(z(, Zj)\.

T h e unbiased estim ato r o f the covariance is obtain ed th ro u g h su bstitu tion o f the follow ing statistic for the p aram eter cov(z(, z}):

COVs (zi; Zj) = --- X (Zpi- Z ()(zw - Z j ) .

9 1 peS

2. H O M O G E N E IT Y C O E FFIC IE N T OF M U L T ID IM E N S IO N A L V A RIA BLE

L et C b = [соу4(у;, уД] be the betw een-cluster m atrix o f the variances an d covariances, where:

1 c

c o v í , ( V i , y,) = r — 7 Z (yip - y i)(y jp - yj)-

17 1 p=i

T h e w ithin-cluster m a trix o f the variances and covariances is d en o ted by C w = [ c o v ^ . , у Д where:

1 c

co v wcv,., y j ) = —— — - £ E Су* - yip)(y» - yjp)■ l ) p = l k e a.

(4)

Sim ilarly to the one dim ensional case (sec e.g. C o c h r a n 1963, p. 243) the v ariance-covariance m a trix С can be decom posed in the follow ing way:

( N - 1 )C = (G - 1 ) M C b + (N — G)CW (4)

T he m atrix C (z) can be rew ritten as follows:

C(z) = M 2C h (5)

T h is expression and the eq u a tio n (4) lead to the follow ing results:

C <z> = G —1 ((N " 1)C ~ {N ~ G ) C J

C(z ) = M c ( l + ^ ^ á ) (6)

where:

G - 1

A = I — C -1C W (7)

In the case o f an one-dim ensional variable y t, w hen С reduces to the v ariance v a r y ; an d C w is th e w ith in -clu ster v aria n ce v a rw th e m a trix A reduces to th e h o m o g en eity coefficient (see S ä r n d a l , S w e n s o n , W r e t m a n 1992, p. 130): = 1 <8> where: var(y f) = - X I (Ул - y t) 2, У i = Tr t I yia (9) p = 1 teil, ^ р =1*бП, 1 e j v a rwCy,) = —— — - X I ( y i k - y ip) 2, У ip = x . Z У* (10) G (M — 1) p= ! M

T h en, the m a trix A can be treated as generalization o f th e ho m ogeneity coefficient Ö. T h a t is w hy the m atrix A can be nam ed as hom o gen eity m atrix o f m u ltid im en sio n al variable.

(5)

Theorem 1. I f the variance-covariance m atrix С is n o n -sin g u lar th en the eigenvalues A, (i = 1, m) o f the m atrix A fulfill the follow ing inequalities:

G - 1

N — G< A f < ; l , for each i = l , m ( I D

P roof. T h e ch a rac te ristic eq u a tio n for the m a trix A can be tran sfo rm ed as follows: IA — AI I = 0 |I —C ”1C WAI| = 0 |C _1C w- k I | = 0

(

12

)

(13) where к = (1 — A). Since th e m atrix C _1C W is positive sem i-definite its eigenvalues k , ^ 0 for each i = 1, m. H ence, the eigenvalues o f th e m atrix

A are: A ,< 1 fo r each i = 1, m.

Since the m atrix C b is positive sem i-definite the e q u a tio n (4) leads to the m atrix

A j = (N — 1)C — (N — G)CW

which is positive sem i-definite, to o . Because the m atrix С is positively defined the follow ing m a trix is positive sem i-definite:

A , = 1 С % „ , . N — 1 — _ I _ C 1C-„ , , ,

N — G N - G

A fte r sim ple alg ebraic tran sfo rm atio n s we have:

й

'

Let us d o the follow ing tran sform ations: IA — A11 = 0,

N - G N - G = 0,

(14)

(6)

where:

Since the m a trix Л2 is positive sem i-definite the eigenvalue ę, > 0 for each i = 1, m. H ence, on the basis o f the expression (16) we have:

W e can say th a t the w ithin-clustcr spread o f ob serv atio n s o f a m u lti dim ensional variable is less th a n their p o p u latio n spread if th e m a trix A is positive definite. W hen A is negative definite, then we say th a t the po pulatio n spread o f values o f a m ultidim ensional variable is less th an the w ithin-cluster spread.

Let y s be the vector o f the m ean from the sim ple ra n d o m sam ple o f the size n, selected w ith o u t replacem ent from a p o p u la tio n o f th e size N. Its variance-covariance m a trix is o f the follow ing form :

— for i = l, ..., m. T his com pletes the proof.

N — G

3. A C C U R A C Y O F A C L U STER SA M P L E M EA N VECTOR IN R ELA TIO N T O S IM P L E SA M P L E M EAN VECTOR

(17) where:

O n the basis o f the eq u atio n s (3) and (6) we have:

(7)

Hence:

V(ys , P.) - V (y „ , ;>,) = N ~ Ng -_ ° С Д (20)

or

V(ys.P ,) - V ( y , s,P ,) = V d - ľ ' C - C . ) ( 2 0

T his leads to th e follow ing property:

Theorem 2. If the m a trix ( C - C w) is non-positive definite (non-negative definite) then the strategy V(y(S, P s) is n o t w orse (n o t b etter) th a n the strategy V(ys , P s). P articu larly , if the m atrix С is no n sin g u lar and the A is non-positive definite (non-negative definite) then th e strategy V(yiS, Pg) is no t worse (n o t better) th a n the strategy V(yS, P S).

H ence, T h e strategy V(ygS, P e) is no t w orst th an th e strategy V(yS, P S), if the w ithin-cluster spread o f a m ultidim ensional variable represented by the m atrix C w is larger than its population spread represented by th e m atrix C.

Let us d en o te the variance o f a strategy, the d eterm in a n t, the trace and the m axim al eigenvalue o f a variance-covariance m a trix o f a vector strategy by D 2(., .), det(., .), tr(., .) and Я1(., .), respectively. T h e relative efficiency coefficients are defined as follows:

e°‘ =

**= 1 + | ž f а д - * - 1... m**

<22>

where 0(yt) expresses the form ulas (8- 1 0).

- - d « V » . ^ = d e t ( l + ^ A ' | (23) d et V(ys , P s) V G - 1 e _ t r V ( y eS, P e) N - G where: <5= Ž<5(y,)a„ i= i

(8)

var (у,) Z v a r (yf) i=i

ß3 Я Д Ь Р з )

(25)

Theorem 3. If the m atrix С is positive definite and m atrix A is non-positive (non-negative) definite, th en ek ^ 1 for fc = 1 ,2 , 3 and e0i ^ 1 for i = 1, ..., m. P articu larly , if th e m a trix A is negative (positive) d efin ite, ek < 1 fo r

k = 1,2 , 3 and е0 < < 1 for i = 1, m and e0j < l for a t least one index 7= 1, ..., m.

C. R. R a o (1982, p. 89), showed: if В is positive definite and ( A - B ) is no n-negative definite then dct(A ) > dct(B ). T his and the expression (7) lead to inequality < 1. T h e properties o f the trace o f a sum o f m atrix es lead to th e inequality e2< l . If the m atrix A is n on -p ositive definite, the m atrix (C — C w) is non-positive definite, to o . Let Aj(A) be the m axim al eigenvalue o f a m atrix A. Hence:

If (C — C w) is non-negative defined then for all non -zero vectors y: A,(C) = m a x { a TC a}, «T« - i A1(C w) = m a x { p TC wP}-7tCy - YTC WY s* o (26) Hence: a TC a - a TC wat = AX(C) - a TC wa ^ 0, PTC p - p TC J = ßTC ß - A 1(C w) > 0 , Aj(C) — Aj(Cw) > pTCp — Aj(Cw) ^ 0,

A ^ O ^ A ^ C J .

T his leads to inequality: e3< l . T he inequality (26) let us derive the inequalities e o i ^ 1> 1 = 1> m w hen we assum e th a t the elem ents oi the

(9)

T h e strategy (y eS, P e) can be b etter th an the strategy (yS , P S) if the m atrix (С - C J is negative definite. It m eans th a t the w ithin-cluster spread o f values o f the m ultidim ensional variable (u nder research) should be bigger th an the p o p u latio n spread o f observations o f those variables.

REFEREN CE

C o c h r a n W. O . (1963), Sam pling Techniques, John W iley, N ew York. R a o C. R. (1982). M odele liniowe s ta ty s ty k i m atem atycznej, PW N , W arszawa.

S ä r n d a l C. E., S w e n s o n B., W r e t m a n J. (1992), M o d el A ssiste d Survey Sam pling, Springer-Verlag, N ew Y o r k -B e r lin -H eid elb e r g -L o n d o n -P a r is-T o k y o -H o n g K on g-B ar- celona-B udapest.

J a n u s z W y w i a ł

E S T Y M A C JA W A R T O ŚC I PR ZEC IĘTN Y C H W P O P U L A C JI N A P O D S T A W IE W EKTORA ŚR E D N IC H Z PR Ó B Y K R U P O W E J

Zakłada się, że skończona i ustalona populacja jest podzielona na rów noliczne i rozłączne grupy. N a podstaw ie prostej próby grupowej jest wyznaczany wektor średnich, który daje oceny wektora przeciętnych w populacji. W yprow adzono macierz wariancji i kowariancji wektora wartości średnich z próby grupowej. Jest ona zależna od macierzy wewnątrzgrupowej jednorodności rozkładu wielowym iarowej zmiennej. Precyzja estymacji jest oceniana za p om ocą wariancji poszczególnych średnich z próby grupowej, śladu, wyznacznika lub m aksym alnej wartości własnej macierzy wariancji i kowariancji. Precyzja wektora średnich z próby grupowej jest porów nyw ana z precyzją wektora średniej z próby prostej. Okazuje się, że wektor średnich z próby grupowej jest precyzyjniejszy od wektora przeciętnych z próby prostej, gdy stopień w ewnątrzgupowego zróżnicow ania wartości zmiennych jest dostatecznie duży.