• Nie Znaleziono Wyników

Some Remarks on Statistical Inference for Complex Samples

N/A
N/A
Protected

Academic year: 2021

Share "Some Remarks on Statistical Inference for Complex Samples"

Copied!
10
0
0

Pełen tekst

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S

F O L IA O E C O N O M IC A 194, 2005

C z e s l a w D o m a ń s k i *

S O M E REMARKS ON STATISTICAL IN FER EN C E FOR CO M PLEX SA M PL E S

Abstract

C lassic th eo ry o f statistical inference gives us m eth o d s and verification o f hyp o th esis for sim ple sam ples (o b serv atio n s are stochastically in d ep en d en t and have th e sam e d istrib u tio n ). B ecause o f co sts a n d effectiveness o f research we use sim ple sam ples. O b serv atio n s in these sam ples a re sto ch astically d e p en d e n t a n d have d ifferent d istrib u tio n .

T h e p a p e r p resen ts p roblem s in estim atio n and verifications o f h y p o th esis o f consistency o f d istrib u tio n s for com plex sam ples.

Key words: com plex sam ples, estim atio n , goodness o f fit tests.

I. IN T R O D U C T IO N

Classic th eo ry o f statistical inference provides us estim ation m eth o d s of un k n o w n d istrib u tio n param eters estim ating the form o f fu n ctio n which defines this d istrib u tio n and hypotheses verification on the grou n d s o f simple sam ples, th a t is such hypotheses in which observ ation s are stochastically independent and have the same probability distributio n. In general, however, we use com plex sam ples with regard to costs and efficiency o f research. R esults o f o bserv ations in these sam ples are realizations o f stochastically d ep en d en t variates o f various distributions. In rep resen tativ e research we distinguish am o n g o thers the follow ing schemes:

- d ep en d en t sam pling (w ithout replacem ent) w ith various choice p ro b a ­ bilities,

- stratified sam pling, - cluster sam pling,

(2)

F o r exam ple, sam pling w ithout replacem ent elim inates stochastic indepen­ dence o f observation, stratification proccss causes diversification o f choice probab ilities o f sam ple elem ents but m ultistage sam pling influcnccs the diversification o f distribution.

T his p aper deals with problem s connected w ith estim ation , especially a d a p ta tio n o f m eth o d s o f central limit theorem for com plcx sam ples and verification o f goodness o f fit for com plcx sam ples.

R epresentative m eth o d deals with procedures o f sam pling from finite p o p u latio n s and estim ating on the grounds o f o btain ed sam ples of u n ­ k now n p a ra m e te rs in these p o p u latio n s. Since p o p u la tio n s are finite, therefore sam ples m ust also be finite. W hat is m o re, if N - denotes the size o f general p o p u la tio n and n - sam ple size, th en it is very reasonable to consider these situations in which n < N (cases w hen n = N are n o t the object o f interest o f sam pling m etho d). E conom ic and o r­ g an izational considerations force statisticians to replace sim ple sam ples (sim ple sam ple m eans th a t each observation has the sam e d istribu tio n as the d istrib u tio n o f investigated variable in p o p u latio n ) with com plex sam ples. T his fact m akes using limit theorem s for com plex sam ples im ­ possible.

In case o f sam pling w ithout replacem ent the co nditio n n < N m ust be satisfied. T h a t is why we can no t use limit theorem s know n from probability calculus in which it is assum ed th a t n —* oo. W e will m ention here Lindeberg- F eller theorem , see F isz (1967).

Theorem 1 (Lindeberg-Feller). Let {У*}(/с = 1 ,2 ,...) be th e sequence o f independen t variables. Let ц к and rk > 0 den ote an expected value and a stan d ard deviation respectively and Gk(y) - its d istrib u tio n function.

II. L IM IT T H E O R E M S

(3)

N ecessary an d sufficient co nd ition to

(1)

is the follow ing relation for any e > 0

(2) In stead o f fo rm u la (2) we will use the following:

Z,П (3)

In sam pling scheme with probability propo rtional to value o f characteristic У with replacem ent, although general p o p u latio n is finite we can use L indeberg-F ellcr theorem . On the gro unds o f this theorem it can be proved th a t H a n se n -IIu rw itz estim ato r has, see B racha (1998), with n —*• oo norm al d istrib u tio n . L et us n o te th a t n can be o p tionally large (sam ple units can be optio n ally large an d in addition variables Y i and У, (i Ф i') can be in dependen t (sam pling w ithout replacem ent).

In case w hen n < N and variates are independent and as a consequence we can n o t use L indeberg-F eller theorem to estim ate o f average value of po p u latio n .

D ifficulties arising from the assum ption n < N an d variables interdep en­ dence as th e first tried to solve M ad ow (1948). H e considered, instead given p o p u la tio n U, p o p u latio n sequence {{/„}, w hich was generated by m ultiple reproduction o f p articular elements from pop u latio n U, un der the assum ption th a t b o th size o f these populatio n s and sam ples sizes, w hich are sam pled from th em tend to infinity, th a t is by o x oo, N 0 —► oo, n0 —*■ oc and

H ajek (1960) refo rm u lated M adow theorem which can be show n as follow s:, see also E rd ö s i Réyi (1959).

Theorem 2 (Lindeberg-Feller-H ajek). T here is given a po pu latio n sequence { W - ь where

co rresp o n d in g sequence {У„}®=1, w here У„ = (У„ь Y vN ) T, and also d a ta sequence

i / v = { y vl>. . . , y v*v} (4)

M v = i, where d'v = {yvU and co rresp o n d in g sequences o f general term s

(4)

, »v Г . - т г Е Г * (5) v J = 1 S Í - l ^ r K ^ - T , ) 2, (6) * v 1 1 ”v > ’ v = 7 I ^ v b (7) n v j= 1

Let Un be for e > 0 subset o f set i / v and let its elements satisfy a condition

\ Y vJ- Y , \ > e J n v[ \ - % ) S v, (8)

w hen nv —> oo, N v —► nv —*• oo for v = oo. N ecessary and sufficient conditio n to

z - = ^ i ) . (9) i - i » > is relation l ( Y vj- Y y lim ^ ---= 0 (10)

'■*“

T ( Y VJ- r y jeuv

(this relatio n is called L indeberga-H ajek condition).

S cott and W u (1981) proved fu rth er features o f estim ato rs in case o f sim ple sam pling w ithout replacem ent.

Theorem 3 (Scott-W u). If the follow ing co nd ition is satisfied

<»>

V-+ C Q \ T V / r ł V

th en fo r e > 0

lim P { |y t — y j < £} = 1.

(5)

O ften instead o f equality (12) we use the follow ing form ula:

( F v - X ) ^ 0 .

T heorem 2 show s th a t average from the sam ple which w as sam pled according to sim ple sam ple w ithout replacem ent schem e is com p atible with estim ato r o f average У.

H ajek (1964) proposed rejective sam pling pro ced u re fo r the scheme o f sam pling with probabilities prop o rtio n al to the value characteristic У w ithout

JC

replacem ent. T his schemc proves th a t for given Pj — ^ th ere are determ ined by the sizes a, being function pj and fulfilling a co nd ition

N

Í aJ = 1 j= l

N ext, units arc sam pled with replacem ent and choice p rob ab ilities in each draw ing p ro p o rtio n a l to ау If sam ple contains n various units, then the sam ple is accepted. But if som e units rep eat th en the w hole sam ple is rejected and the new sam ple is sam pled.

Rosén (1972) proved ccntral limit theorem for H orvitz-Thom pson estim ator based on ra n d o m sequence sam ple. T hese analyses are perform ed by m any researchers by m eans o f both analytic and simulated m ethods, see for exam ple B racha (1990; 1998). On the ground s o f results o f these research au th o rs suggest high ca u tio n in draw ing conclusions ab o u t d istrib u tio n com patibility o f considered estim ators w ith norm al distribution.

III. x 2 T E S T O F G O O D N E S S O F F IT F O R C O M P L E X S A M P L E S

Let v ariate X tak e values belonging to k(k > 2) separable intervals. Let us d en o te by p t prob ab ility th a t variable X takes values from i-value interval

П

and at the sam e tim e p , > 0 fo r i = 1, ..., к and Y.Pi = '• O n grounds i= l

o f sim ple sam ple th e hypothesis m ust be verified: Я 0 : Р = Р0 to w ard s to altern ativ e hypothesis:

Я 0 :Р * Р 0 >

where: p = = 1.... * - ь Po is ( k — 1) dim ensional vector o f hypothetical p robabilities connected w ith p(p0 = [Poi];=i.... *

(6)

-i)-T o verify hypothesis I I 0 it is proposed to use m atrix statistics, sec for exam ple R a o (1982).

ŕ = n(0 - Po)7’Po 4 P - Po), (13)

where:

P 0 = diag(p0) - popj, p = [Pl]i=1.... * - i (14)

and at the sam e tim e pi is unbiased estim ato r Pi.

U nder the assum ption th a t veracity o f hypothesis I I Q statistics given by fo rm u la (13) has asym ptotic distrib u tio n x 2 o f к — 1 degrees o f freedom .

F o r com plcx sam ples H olt and others (1980) showed m o dification s o f X2 goodness o f fit statistics which has the follow ing form :

where:

X2 = T ( 1 5 >

, n * ß 2(Pi)

‘ - M

Ž f

<'6>

and a t the sam e tim e D ( pt) den o te variance estim ato rs o f investigated ch aracteristic which are suitable for particu lar sam pling scheme. T ak ing into ac co u n t a variance o f hypothesis I I 0 statistics (15) h as x 2 distribution o f (fc— 1) degrees o f freedom . W e reject hypothesis H Q on th e significance level a, if inequality x l ^ x l proceeds.

In case when к — 2, we verify hypothesis H 0 : p = p 0 ag ain st alternative h y p o th esis Н у \ р Ф р 0 by m ean s o f statistics, see fo r exam ple B rach a (1998).

, 2 » - P ° ) ł ( 1 7 )

D 2(р) 1 '

w here р is estim ato r p.

S tatistics (17) by the veracity o f hypothesis H 0 h as for big values n d istrib u tio n close to d istrib u tio n x 2 o f one goodness o f fit.

W e m ad e a few experim ents using M o n te C arlo m eth o d for com plex sam ples investigating sizes o f x 2 test and its m o dificatio n x \ - I Q the first

(7)

experim ent we were com paring sizes o f investigated tests fo r com plex sam ples (n o n -retu rn a b le sam pling) in finite pop u latio n o f n orm al d istrib u tio n with dem anded param eters for N = 1000, 2000, 10 000. O n the ground o f sampled sam ples we were verifying sim ple hypothesis H 0, th a t sam ple com es from p o p u latio n o f n orm al d istrib u tio n by m eans o f classic test ■/} and m odified x \ tak in g into consid eratio n sam pling scheme effect. T h e investigation was m ad e for dozen o r so v ariants o f classes division o f sam ple results for example num ber o f classes. N = 1000 к = 4, 5, 6, 8, 10, 12, 15, 20 adequately to size of sam ple which fulfils conditions o f convergence statistics x 2 tow ards d istrib u tio n x 2> sec R om ański (1990). T he investigation was m ad e for q = 10 000 repetitions.

In T ab le 1 we illustrated sizes o f considered tests for three significance levels a = 0.10; 0.05; 0.01 and nu m b er o f degrees o f freedom (Iss = 7) for N = 1000, 2000 an d Iss = 14 for N = 10 000. O n th e co n tra ry in T ab le 2 for N = 1000 we presented considered tests sizes for (Iss = 2, 4, 6) depending on n u m b er o f degrees o f freedom .

IV . C O N C L U S IO N S

1. T he size o f test y} for N = 1000 in all cases exceeds assumed significance levels and on the co n tra ry m odified test x l does n o t exceed assum ptcd significance levels a = 0.10 and a = 0.05, and also generally for a = 0.01. We o btained sim ilar results for N = 10 000 (see T ab le 1).

T abic 1. C o m p ariso n o f size o f x 2 goodness o f fit w ith m odified test x 2 fo r com plex sam ples sam pled fro m finite p o p u latio n s o f n o rm a l d is trib u tio n fo r N = 1000, 2000, 10 000 Iss — ( k — 1)

degrees o f freedom n sam ple size Significance level a = 0.10 я = 0.05 a = 0.01 X2 4 4 x l N = 1000 (Iss = 7) 40 0.136 0.091 0.071 0.046 0.020 0.010 50 0.128 0.086 0.071 0.046 0.023 0.012 60 0.123 0.085 0.068 0.047 0.018 0.008 70 0.111 0.078 0.067 0.040 0.020 0.013 80 0.105 0.081 0.064 0.042 0.017 0.013 90 0.116 0.090 0.057 0.041 0.014 0.009 100 0.106 0.092 0.062 0.053 0.014 0.010 120 0.111 0.090 0.059 0.048 0.016 0.008

(8)

T ab le 1. (contd.) n sam ple size Significance level a = 0.10 a = 0.05 a = 0.01 x1 * 2 xl * 2 N = 2000 (Iss = 7) 50 0.128 0.089 0.064 0.042 0.021 0.012 100 0.102 0.071 0.057 0.034 0.015 0.009 150 0.097 0.082 0.054 0.043 0.010 0.007 200 0.076 0.057 0.033 0.025 0.005 0.005 300 0.083 0.087 0.051 0.052 0.009 0.010 N = 10 000 (Iss = 14) 200 0.134 0.104 0.081 0.062 0.024 0.012 300 0.116 0.088 0.068 0.053 0.021 0.014 400 0.125 0.106 0.075 0.063 0.010 0.007 500 0.103 0.097 0.049 0.046 0.014 0.009 Source: O w n calculations.

2. W ith the increase o f num ber o f degrees o f freedom in general size o f classic test x 2 m ore and m ore stands o ff obtained significance level and, on the co n tra ry , with the increase o f num ber o f freedom , size o f m odified tests x l m ore and m ore approach the assum ed significance level (see T able 2).

T able 2. C o m p a riso n o f size o f y} goodness o f Tit w ith m odified test x \ for com plex sam ples sam pled from finite p o p u la tio n s o f n o rm a l d istrib u tio n for N = 1000 d e p en d in g o n n u m b er

o f degrees o f freedom Iss = 2, 4, 6

n sam ple size Significance level a = 0.10 a — 0.05 я = 0.01 ť X1 X* xl Iss = 2 10 0.1145 0.0593 0.0670 0.0318 0.0192 0.0093 15 0.1095 0.0490 0.0612 0.0259 0.0160 0.0070 20 0.1026 0.0497 0.0540 0.0221 0.0147 0.0064 30 0.0979 0.0427 0.0502 0.0202 0.0119 0.0046 40 0.0871 0.0375 0.0441 0.0188 0.0104 0.0040 50 0.0857 0.0379 0.0425 0.0178 0.0084 0.0036 100 0.0722 0.0367 0.0319 0.0163 0.0058 0.0027 Iss = 4 15 0.1398 0.0831 0.0813 0.0452 0.0263 0.0128 20 0.1272 0.0756 0.0737 0.0406 0.0224 0.0105

(9)

T able 2. (co n td .) n sam ple size Significance level a = 0.10 a = 0.05 a = 0.01 X2 xl A Iss = 4 30 0.1204 0.0730 0.0699 0.0368 0.0215 0.0096 40 0.1237 0.0768 0.0682 0.0384 0.0208 0.0101 50 0.1163 0.0715 0.0651 0.0408 0.0187 0.0105 100 0.1056 0.0727 0.0533 0.0367 0.0139 0.0082 Iss = 6 20 0.1516 0.0982 0.0906 0.0533 0.0331 0.0167 30 0.1405 0.0930 0.0851 0.0506 0.0285 0.0143 40 0.1327 0.0883 0.0779 0.0492 0.0249 0.0138 50 0.1213 0.0857 0.0727 0.0463 0.0222 0.0119 100 0.1161 0.0934 0.0625 0.0492 0.0160 0.0122 Source: O w n calculations.

S um m ing up, it has to be em phasised th a t on this stage fo r com plex sam ple (d ep en d en t sam pling) classic x 2 test o f goodness o f fit in general gives in assum ed cases insatiable indications in re la tio n to hypothesis verification. M o st often in sam pling w ithou t replacem ent real e rro r o f the first type considerably exceeds obtained significance level a.

F ro m th e experience gathered so far it follows th a t assum ed test should be investigated for sim ple and com plex sam ples. T herefo re, som e postulates o f m an y au th o rs w ho refer to rules o f applying x 2 should be verified.

R E F E R E N C E S

B rach a Cz. (1990), W ybrane problem y wnioskowania statystycznego na podstaw ie prób nieprostych, ZB SE G U S , P A N , W arszaw a.

B rach a Cz. (1998), M eto d a reprezentacyjna w badaniach opinii p ublicznej i m arketingu, E FE 1C T , W arszaw a.

D o m a ń sk i Cz. (1990), T esty sta tystyczn e, P W E , W arszaw a.

E rd ö s P., Rćyi A . (1959), O n th e central lim it th eo rem fo r sam ples fro m a finite p o p u la ­ tio n , P ublications o f the M athem atics Institute o f H ungarian A ca d e m y o f Science, 4, 4 9 -5 7 .

F isz M . (1967), R achunek prawdopodobieństwa i s ta ty s ty k a m a tem a tyczn a , P W N , W arszaw a. H ajek J. (1960), L im itin g d istrib u tio n in sam ple ra n d o m sa m p lin g fro m a fin ite p o p u latio n s,

P ublications o f the M a tem a tics Institute o f the Hungarian A ca d em y o f Science, 5, 361-374. H ajek J (1964), A sy m p to tic th eo ry o f rejective sam pling w ith vary in g p ro b a b ilities fro m a finite

(10)

H o lt D ., Sco tt A .J., Evings P .D . (1980), Enings chi-squared test with survey, Journal o f the A m erican S ta tistica l A ssociation, Scr. A , 143, 303-320.

M ad o w W .G . (1948), O n the lim iting d istrib u tio n s o f estim ates based on sam ple from finite p o p u latio n s, Annals o f M athem atical Statistics, 19, 535-545.

R a o C’.K (1982), M odele liniowe sta ty s ty k i m atem atycznej, P W N , W arszaw a.

R osén В. (1972), A sy m p to tic theory fo r successive sam pling w ith varying p ro b ab ilities w ith o u t replacem ent: 1 and II, Annals o f M athem atical S tatistics, 43, 373-397 an d 748-776. S cott A . J., W u C. F . (1981), O n the asy m p to tic d istrib u tio n s o f ra tio an d regression estim ato r,

Journal o f A m erican Statistica l Association, 76, 98-102.

Czeslaw Domański

U W A G I O W N IO S K O W A N IU S T A T Y S T Y C Z N Y M DI.A P R Ó B N IE P R O S T Y C H Streszczenie

K lasy czn a teo ria w n ioskow ania statystycznego d o sta rcz a n am m etod estym acji nieznanych p a ra m e tró w ro z k ła d u , szacow anie p o staci funkcji określającej ten ro zk ład o ra z w eryfikację h ip o te z n a p o d staw ie p ró b p ro sty c h , tzn. takich, w k tó ry ch obserw acje są niezależne i m ają ten sam ro zk ład p ra w d o p o d o b ień stw a . N a ogół je d n a k ze w zględu n a k oszty i efektyw ność b ad ań posługujem y się p ró b am i nieprostym i lub złożonym i (com plex samples). W yniki obserwacji w tych p ró b a c h są realizacjam i stochastycznie zależnych zm iennych losow ych o różnych ro zk ład ach . W b a d an iac h reprezentacyjnych w yróżniam y m iędzy innym i n a stęp u jące schem aty: losow anie zależne (bez zw racania), losow anie z różnym i p ra w d o p o d o b ień stw a m i w yboru, w arstw ow e, zespołow e i w ielostopniow e. P rzy k ład o w o , losow anie bez zw raca n ia elim inuje sto ch asty czn ą niezależność obserw acji, proces w arstw ow ania zróżnicow anie p ra w d o p o d o b ień stw w yboru elem entów p ró b y , n a to m iast losow anie w ielostopniow e w pływ a n a ró ż n o ro d n o ść ro zk ład ó w .

P rzedm iotem tej p ra cy są p roblem y zw iązane z estym acją (m eto d y ad ap tac ji centralnego tw ierdzenia granicznego d la p ró b n ieprostych) o raz w eryfikacja hip o tez o zgodności ro zk ład ó w d la p ró b niep ro sty ch .

Cytaty

Powiązane dokumenty

Discrete Mathematics (c) Marcin Sydow Proofs Inference rules Proofs Set theory axioms Example tasks/questions/problems provide the denition of formal proof describe at least 6

However, they seldom realize th at popular com puter programs calculate standard errors and statistical tests results by the assum ption that the sample was

Here, we present a method for the highly sensitive characterization of released N-glycans by combining a capillary electrophoresis-electrospray ionization-mass spectrometry

The w ell of available know ledge is

The hypothesis test will be evaluated using a significance level of α = 0.05. We want to consider the data under the scenario that the null hypothesis is true. In this case, the

The following lemma is an analogue of a result of Johnson [8, Theorem 5.1] and can be proved by a similar argument.. But for the convenience of the reader, we give

Hence, the very discussion round comparative activities is an interdisciplinary undertaking: inevitably we en- counter there problems o f history of literature,

SHORT COMMUNICATION www.aaem.pl No evidence of Mycobacterium tuberculosis complex infection in samples from cervids in various regions of Poland Blanka Orłowska1,A-B,D,F , Anna