• Nie Znaleziono Wyników

Verification of Hypotheses Concerning Parameters of the Regression Model for Complex Samples

N/A
N/A
Protected

Academic year: 2021

Share "Verification of Hypotheses Concerning Parameters of the Regression Model for Complex Samples"

Copied!
10
0
0

Pełen tekst

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S

FO LIA O E C O N O M IC A 206, 2007

Czeslaw D o m a ń s k i *

V ER IFIC A TIO N O F H Y PO T H E S E S CO N C ER N IN G PA RA M ETERS O F T H E R E G R E SSIO N M O D E L FO R C O M PL E X SA M PLES

Abstract. T h e paper considers the linear regression fun ction y = ß x + e, where ß is a vector o f unknow n parameters and r. is a rest com ponent. In case o f com plex samples som e m odifications o f test statistics should be made. Results o f sim ulation study revealed that the verification o f the hypothesis H0: ß = ß0 should be conducted by m eans o f modified test F.

Key words: com plex sam ples, test F, testing, design effect, x 2 test.

1. IN T R O D U C T IO N

One o f the fastest developing areas o f the statistical m ethods application is social research. Nowadays, every researcher analyzing results of opinion polls refers to statistical tools, in the form o f com puter program s most often, in order to sum up and present their results. Researchers believe that results calculated by means o f these com puter program s and interpreted according to rules they were taught are the basis for draw ing conclusions and setting research hypotheses. It happens quite frequently that interpreting results of sample investigations, researchers forget that they are biased with random errors o f the sample (they can be biased with non-random errors as well, which is very often caused by researchers themselves). Thus, for example, they interpret differences in proportions of the support for political parties as if all electors and not only the sample were subjected to the investigation. M ore often we refer to statistical inference m ethods taking into account dependence analysis. Still, one of the most popular statistical tools am ong social researchers is x 2 test which is used for determining

(2)

statistical significance of relations between qualitative variables presented by m eans of contingency tables. The expression “ relation is statistically significant” is one o f the m ost popular phrases which is used in sociological analyses o f investigations’ results. Both those who know the concept of confidence intervals or proportions tests, and those who use test, are concerned abou t the fact whether conditions o f their applicability are satisfied. However, they seldom realize th at popular com puter programs calculate standard errors and statistical tests results by the assum ption that the sample was sampled by means o f the simple sampling scheme. C on­ sequently, we have to do with the rejection o f independence hypotheses as well as the inference about the interdependence of investigated variables occurrence when, using proper analysis m ethods, they could be considered statistically insignificant.

Sociologists involved in designing investigations usually know that their samples are complex and sampled as results of m ultistage cluster schemes with the use o f stratification. Using weighing, at least at the basic stage, is very com m on especially in research agencies. Users of d a ta sets of popular investigations such as Polish General Opinion Poll or d ata revealed by the Public O pinion Research Center also encounter weights which compensate for various probabilities o f the inclusion in accepted sampling scheme. They include the post-stratification for the sake o f a few dem ographic charac­ teristics (investigations reports seldom m ention weighing in order to com ­ pensate for the problem o f not realizing the m easurem ent o f the p art of sampled sample). However, even if we use weights properly and decrease sizes o f the bias o f param eters estimation with the non-random error, we seldom take note of the fact that a weighted sample, even if initially sampled by m eans o f simple sampling, leads to different values of standard error estimates and the change o f statistical tests results but at the same time changing the estimators variance. The influence of accepted sampling schemes and the application o f clusters, stratification and diverse inclusion probability as well as samples weighing on the statistical inference is sometimes even realized but does not lead to the application o f adequate statistical techniques and m ethods. Up till now, it has been m ost often a result o f limitations o f statistical software. Even though there existed such specialized tools as SU D A A N , W esVar, complex samples analysis m odule in STATA (the program which unfortunately was not very popular in Poland) or even such free tools as CLU STERS, limited availability and knowledge o f this kind o f software m ade using it very difficult. As a result, the belief that “what we comm only do in the area of d ata analysis we do poorly” has become quite widespread.

(3)

2. PR O BLEM FO R M U L A T IN G

In sample surveys the estimation o f unknown populations’ param eters is conducted in m ost cases not on the basis o f simple samples but complex ones. This observation presumably inclined L. K i s h (1965) to introduce the divergence m eter, called “design effect” , between the simple sample and the complex one. The m eter was defined as follows:

where D 2(y') is the variance of the simple m ean sample sampled with replacem ent (lpzz), and D 2(t) - is the variance of the estim ator of the mean У o f the characteristic Y by the sampling scheme (we assume that statistics У and t were constructed on the basis o f samples of the same size n). The value deff(t) enables to com pare variances of various estim ators o f the m ean У (for example quotient, regression) constructed on the basis of data samples sampled according to various sampling schemes with the variance o f the sample m ean for the scheme with replacement. If deff(t) does not differ from 1 m uch, then the sample can be treated as the simple one and we can apply e.g. classic significance tests in order to verify the hypothesis H 0: У = У 2, where У 0 is hypothetical value o f the mean of the characteristic У or the hypothesis Н 0:У , = У2 where У |(У 2) is the m ean o f the first (the second) population. If deff(t) < 1, then the actual size of the test is smaller than the assumed one. If d e f f (t )> 1, then the actual probability of the type I error is bigger than the assumed significance level. In sample surveys samples are m ost often sampled according to very complex schemes, for example two-stage sampling with I stage units stratification (jps). And, at the same time, in jps strata they are sampled with probabilities p ro p o r­ tional to the value o f the particular additional characteristic X and without replacem ent (lppxbz), while at the second stage we have to do with the simple sam pling w ithout replacement (lpbz). Taking such schemes into consideration, deff(t) can even attain the value 8 ( K i s h 1965). It turns out that the m ore hom ogeneous jps and the bigger the average fraction of the sample at the second stage, the bigger deff(t). U nder such circumstances, the real test size can even exceed 0,50 by assumed 0,05 ( B r a c h a (1998). Then, testing the null hypothesis by means o f the classic test assuming that the sample is simple and the test statistic m odification is unnecessary, becomes pointless. The influence o f the sampling effect on the inference based on complex samples is described by the effective sample size defined by L. K i s h (1965) by means of the following formula:

(4)

пе = 1 -гШ deff(t) (2) where n is the real sample size.

Cz. B r a c h a (2003) presented a few estim ators deff, w hat enables to carry out the statistical inference according to the assumed confidence coefficient or the significance level.

3. V ER IFIC A T IO N O F H Y P O T H E S E S C O N C E R N IN G PA R A M E T E R S IN T H E R E G R E SSIO N M O D E L

N-elem ent parent population U = {1,2, ...,N} in which we can observe random characteristics У, X {, X 2, ..., X k is given. The characteristic У will be called the interpreted one whereas characteristics X {, X 2, ..., X k - will be called the interpretative one. Values o f these characteristics will be denoted by Yp X p . . . , X Jk.

We will investigate the linear regression function

Y = ß lX l + ß2X 2 + . . . + ß kX k + e (3) where ß x, . . . , ß k are unknow n param eters and e is the rest com ponent.

The equivalent o f the model (3) for the sample is the form ula

у = ßx + e (4)

where e is the vector o f random components.

In case of the sample chosen according to the simple sampling scheme with replacem ent as the estim ator of the vector ß, it is necessary to assume

b = (xTx ) - 1xTy (5)

Verification o f hypotheses concerning the vector ß is considered assuming that the random com ponent vector e has n-dimensional norm al distribution with the null vector o f expected values and covariance m atrix E dependent on the applied sample sampling scheme. It turns out that despite the finite population, this assum ption can be m ade only if n and N are big enough ( B r a c h a 1998).

Hypotheses that we are interested in, can be described by the following general linear hypothesis (Theil 1979):

(5)

H 0: Aß = d against H , : Aß Ф d (6) where:

A = [a,*],x* is the given m atrix o f the rank q (q < k ), d = [d.]q x i is the given vector.

Below we will present two special cases of the hypothesis (6). We obtain the first one assuming that A = I and d = ß0. Then

H 0:ß = ß0 against H, : ß ^ ß 0 (7) In the second case, m atrix A is the first versor (the vector in which 1 comes first and the rem aining coordinates equal to 0), whereas the vector d is reduced to the scalar ßM.

H 0: ß, = ß/0 against H , : ß, Ф ß,0 (8) From our deliberations it follows that the form of the test statistic depends on the sample sampling scheme.

If the sample was sampled according to the scheme with replacement, then in o rder to verify H 0 of the form (6) it is necessary to apply test F based on the form o f statistic ( R a o 1982)

_ (Ab - d)T[A(xTx )~ 1AT] - ‘(Ab - d))/q (У- y)T(y - УЖ п - к )

Assum ing truthfulness of hypothesis (6) and norm ality of random com ­ ponent e ~ N (0, a 21), and m oreover, if m atrix x is fixed, then the statistic defined by m eans o f form ula (9) has F Snedecor’s distribution of q and n — k degrees o f freedom.

Verifying hypothesis H 0 defined by the form ula (7), we obtain a simpler form o f statistic (9)

||xb — xß||2//c eTx(xTx ) - ‘xe/fe ||y - xb||2/(n - к) ет[/ - х(хтх ) ~ 'хТеДи - к)

If the null hypothesis (7) is true, the statistic defined by the form ula (10) has the distribution F ( k, n — k).

Now we draw our attention to the hypothesis defined by the form ula (8). Let us notice that the statistic F defined by the form ula (9) takes a simple form

(6)

r . ( b t — Д/о)

t = ---- r --- ( 1 1 )

where z u is the first diagonal element o f the m atrix (xTx ) _1.

The statistic defined by the form ula (11), assuming truthfulness o f the hypothesis o f the form (8), has the distribution F ( l , n — k ) . In practice, we frequently verify hypotheses

instead o f hypothesis o f the form (8).

Hypotheses defined by (12) are verified by m eans o f the test based on the statistic

which has Student’s distribution í of n - k degrees of freedom assuming truthfulness o f the hypothesis H 0. If the sample is not sampled according to the scheme with replacement, then the covariance m atrix of random com ponents e is not the scalar one. Consequently, the num erator and the denom inator o f the form ula ( 10) do not have distributions * 2(/c) and X 2( n — k ) , adequately ( R a o 1982). In case o f the complex sample in order to verify the hypothesis (7), one should not use the statistic given by the form ula ( 10).

Let us now consider symmetric and positively defined m atrix V. There exists such a non-singular m atrix C, for which the following conditions are satisfied:

If the form ula (4) is two-sidedly and left-handedly multiplied by C (G о - l d b e r g e r 1972)

H 0:/?, = Дю against Н ,:Д ,> Д Ю or Н ,:Д ( < Д Ю (12)

t, = ( b t - ß ^ / J s h , , (13)

CV C1 = I and CTC = V - ‘ (14)

Cy = CxB + Cc (15)

and we accept denotations

ý = Cy, x = Cx and é = Ce (16)

(7)

Let us notice th at the distribution o f random com ponent e ~ N (0 ,S 2V) is e ~ N ( 0 , S 2V). The sample [y:x] can be treated as the simple one. If estim ator for ß identical with generalized least squares m ethod applied to the sample [y:x]. In order to verify the hypothesis (7) we m ust apply the statistic

In case o f two-stage scheme it would be easier to write the statistic (18) in the following form:

are idem potent m atrices of ranks к and n — k adequately. W hat is m ore, Q T = 0, which proves that square forms occurring in the num erator and the denom inator o f the form ula (19) are stochastically independent.

In practice, we do not know the m atrix V and th at is why we estimate it basing on the sample. Therefore, in order to verify H 0 from the form ula (7) one should use the following statistic:

Assum ing truthfulness of H 0, big sample size and fixed m atrix x, the statistic (25) has the approxim ate distribution F( k ,n — k).

We can now ask the question whether in case of the two-stage sampling we can apply the following statistic instead of the statistic (22):

we apply the least squares m ethod to the sample [ý : x] then, we obtain the

yb - f / k _ ( b - ß T)xTV -_ 'x (b -ß )//c ет[У-> — V - 1x(xTV ‘x) ‘xTV ']e/(n - к) ету - |х(хту - |х ) - |хту - ‘с//с (18) . _ ĆTCx(xTV х ) - ' х тС Ч/ к ^ c T[I — Cx(xTV ~~ 'х)хтС г]е/(п — к) (19) M atrices Q = Cx(xTV ‘x) ‘x’rC 1 (20) and f = I - C x ( x TV - 'x ) x TC r (21) F (y - xb)TV - '( y - x b ) /( n - k) (b — ß)TxTV " 'x(ß (22)

(8)

(6 — Д)тхтх(Ь — ß)/k (у - xfi)T(y - xfi)/(л - к)

етх(хтх ) ~ ‘хтс//с

(23) eT[I — x(xTx) 'х^сД и — к)

The form ula (23) is an equivalent of the statistic from the form ula (10), except th at b was replaced with b.

C. F . W u, D. H o l t and D. J. H o l m e s (1988) suggested another m odification o f the test statistic defined by the form ula (9). If n-element sample is sampled according to the scheme with replacem ent, then D 7(b) = ct2( x t x ) _ i proceeds, assuming that the m atrix x is fixed, while b is defined by the form ula (5). F o r other sampling schemes, we have

The m atrix D given by the form ula (25) is called the misspecification effect m atrix (meff) ( S c o t t , H o l t 1982). Assuming th at at least one out of two m atrices xTx or D is the diagonal m atrix, C. F. W u, D. H о 11 and D. J. H o l m e s (1988) applied the following statistic to verify the hypothesis (7)

where F is defined by the form ula (10) and the m atrix V is defined by the form ula (14).

W hen the null hypothesis is true and the sample is big enough, the statistic (26) has, approxim ately, the distribution F(k, n — k). If the m atrix V is the unitary one, as it is in case o f the scheme with replacement, then, the m atrix D from the form ula (25) is also the unitary one and the statistic Fm is identical with the statistic F.

In case o f complex schemes (for example two-stage), the m atrix V is unknow n and therefore it m ust be estimated on the basis of the sample. T o this end, one should use the two-stage procedure. If we do not want

D 2(b) = er2(xTx) '(xTVx)(xTx) 1 = cr2(xTx ) - l D (24) where D = (xTVx)(xTx ) - ‘ (25) (26) к tr(QV) = tr(D) = I D 2(fe,)/D2(/>,|p = 0) (27) i -1 while Q - x(xTx )_ 1xT

(9)

to im prove the efficiency o f the vector |) estim ation by m eans of the application o f generalized least squares m ethod instead o f least square m ethod, then applying test statistic defined by the form ula (25) does not seem to be less laborious than using the statistic F from the form ula (19). It can be concluded th at the estimation o f the m atrix V is necessary in both cases. M oreover, in the second case there is no limit when it comes to the form o f the m atrix xTx (it is diagonal when, for example, we consider the m odel of the variance with the single classification analysis).

4. FINA L REM AR KS

Presented deliberations suggest that verifying the described hypotheses one should take into consideration the fact th at the sample is complex. However, from the presented form ulas it does not result explicitly that including particular sampling scheme improves the test size significantly. T h at is why, in order to get at least approxim ate answer to the problem, simulative investigation based on three-dimensional populations was con­ ducted. The investigation generated five 1000-element populations. In two- stage scheme units size o f the first, and the second degrees (m = 10, 15, 20) were differentiated. Every considered variant in experiment was repeated 300 times for sample size n = 50, 100, 150 (see Tab. 1). The investigation includes also, except for two-stage sampling scheme, simple sampling ele­ m ents as well as sampling without replacement. The investigation gives good grounds to state explicitly that verifying param eters of regression m odel on the basis o f complex samples it is necessary to apply modified tests including the so called sampling scheme effect (deff(t)).

T a b l e 1 Size o f the test F for a = 0.05

n Statistics F F 50 0.281 0.066 100 0.192 0.061 150 0.168 0.051 200 0.126 0.049 S o u r c e : own calculations.

In particular, the size o f test F for degree a = 0.05, in case of applying the classic form of the statistic, exceeded even 28% .

(10)

REFEREN CES

B r a c h a Cz. (1998), M odele reprezentacyjne w badaniu opinii publicznej i marketingu, EFE1CT, W arszawa.

B r a c h a Cz. (2003), K ilka uwag o szacowaniu efektu schematu losowania, [w:] M etoda reprezentacyjna w badaniach ekonomiczno-społecznych, W ydaw nictw o A kadem ii Ekonomicznej w K atow icach, K atow ice, 13-26.

G o l d b e r g e r A . S. (1972), Teoria ekonometrii, PW E, Warszawa.

H o l t D. , S m i t h Т . M. E. (1979), P ost-stratification, J. R oy. Statist. Soc. S. A ., 142, 33^*6.

K i s h L. (1965), S urvey Sampling, J. W iley, N ew York.

R a o C. R. (1982), M odele liniowe sta ty s ty k i m atem atycznej, PW N , Warszawa.

S c o t t A. J., H o l t D . (1982), The effect o f tw o-stage sam pling on ordinary least squares m ethods, JA SA , 848-854.

T h e i l H. (1979), Z a sa d y ekonometrii, PW N , W arszawa.

W u C. F ., H o l t D. , H o l m e s D . J. (1988), The effect o f tw o-stages sampling on the F sta tistic, JA SA , 150-159.

C zeslaw Dom ański

W ER Y FIK A C JA H IP O T E Z D O T Y C Z Ą C Y C H PA R A M E T R Ó W M O D E L U REG R ESJI D L A P R Ó B N IE P R O S T Y C H

Problem szacow ania parametrów funkcji regresji na podstaw ie prób nieprostych jest badany z górą od dw udziestu pięciu lat. Przedmiotem badania będzie liniow a funkcja regresji postaci macierzowej: у = fix + i , gdzie ß jest wektorem nieznanych param etrów, natom iast e jest składnikiem resztowym .

W przypadku prób nieprostych należy dokonać modyfikacji statystyki testowej, uwzględ­ niając tzw. efekt schematu losow ania.

W pracy prezentow ane są wyniki badań symulacyjnych, które wskazują na konieczność weryfikacji hipotezy H „: ß = ß0 za p om ocą zm odyfikow anego testu F.

Cytaty

Powiązane dokumenty

следует закону 1'одип—р степеней свободы, где 8 = Х*Х, а 8п является блочной матрицей матрицы

1) There were obtained precise estimates of the parameters for Poland and by regions (CVs up to 5% for mean expenditures, but up to 19% for the household structure).

Rozpad dotychczasowych struktur politycznych iw części także kościelnych (m.in. kraj opuści! arcybiskup praski), wzrost popularności hasła „precz od Rzymu”, konstytuowanie

The assessment of the impact of lighting on pedes- trian safety indicates that pedestrian crossings are a particularly important element of road infrastructure in this

In work [6] it was considered the problem of developing a calculation-experimental method for calculating wear of a sliding bearing based on a two-factor wear

[r]

Z badań wynika, że przeszklenie w budynkach oświatowych powinno wynosić około 15÷20% [3], tymczasem budynki te charakteryzują się często przeszkleniem sięgającym nawet

Mówimy jednak także, że dwa następ­ stwa dźwięków mają taką samą melodię, gdy mają taki sam układ in­ terwałów, czyli melikę, i takie same stosunki między