• Nie Znaleziono Wyników

On Application of Logistic Regression to Mean Value Estimation in Two-Phase Sampling for Nonresponse

N/A
N/A
Protected

Academic year: 2021

Share "On Application of Logistic Regression to Mean Value Estimation in Two-Phase Sampling for Nonresponse"

Copied!
10
0
0

Pełen tekst

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S

F O L IA O E C O N O M IC A 194, 2005

Wo j ci ech G a m r o t *

ON A PPL IC A T IO N OF LOGISTIC REG RESSIO N TO M EAN VALUE EST IM A TIO N IN T W O -PIIA SE SA M PL IN G FOR N O N R E S P O N S E

T h e p h e n o m e n o n o f n o n resp o n se in a sam ple survey usually leads to bias in estim ates o f p o p u la tio n p aram eters. O ne o f the techniques applied as a co u n te rm e a su re fo r n o n re sp o n se is based on tw o -p h ase (or d o u b le) sam pling. U sually a linear c o m b in a tio n o f m ean value estim ates o b tain e d in b o th p hases o f the survey is used as an estim ate o f p o p u latio n m ean value o f th e c h aracteristic u n d e r study. In th is p a p er a lte rn a tiv e e stim a to rs fo r tw o-phase sam pling schem e using estim ates o f resp o n se p ro b ab ilities o b tain ed on th e basis o f logistic regression m odel a re considered. T h e results o f M o n te C a rlo sim u latio n stu d y c o m p a rin g the p ro p e rties o f these e stim ato rs a re presented. In th e sim ulations, the d a ta fro m the P olish 1996 A g ricu ltu ral C ensus were used.

Key words: sto ch astic n o n re sp o n se, re sponse p ro b ab ilities, logistic regression, m ean value estim ation.

L et us assum e th a t the m ean value У o f som e ch aracteristic У in the p o p u latio n U o f the size N is to be estim ated and to accom plish this a sim ple ra n d o m sam ple s o f the size ti is d raw n w ith o u t replacem ent from U, according to the sam pling design p(-), given by form ula:

W e adm it, as in the paper of Cassel et al. (1983), th a t nonresponse mechanism is a stochastic one. T his m eans th a t each i-th p o p u la tio n unit has som e

Abstract

I. IN T R O D U C T IO N - T W O -P H A S E S A M P L IN G

(2)

unkno w n p rob ab ility p t o f responding if it is included in th e sam ple. Hence, the ph en o m en o n o f nonresponse m ay be treated as an ad d itio n al phase of sam ple selection, governed by som e unknow n probability d istrib ution g(s2|s). S ärn dal et al. (1992) call the probability d istrib u tio n q(s2 |s) the response distribution.

D epending on response probabilities, d u rin g an a tte m p t to collect d a ta som e units respond and some d o not. Hence, the sam ple s can be divided into tw o sets s 1 and s 2, o f the sizes 0 and 0 ^ n 2 ^ n co ntaining responding and n on-respon ding units respectively. C o nsequ ently , wc have S jU S j = s, s , n s 2 = 0 and n, + n2 = n. Si/£s n, and n2 are random variables, w hose d istrib u tio n s depend on unknow n response p robabilities. As it has been show n by Lessler and K alsbeek (1992), estim ates o f po p u latio n m ean based only on observations from the set will be biased. T o reduce bias, a subsam pling schem e proposed by H ansen and H u rw itz (1946) m ay be applied. A ccording to this scheme, a second phase o f the survey is perform ed. In this second phase, a sim ple subsam ple и o f the size nu = c n 2 (where 0 < с < 1 is a co n stan t fixed in advance) is selected w itho ut replacem ent from am o n g n2 units o f the set s2. T he prob ability o f selection for a certain subsam ple m ay be expressed as:

В Д » , ) - ; (2)

All units included in the subsam ple are then re-con tacted, an d we assum e th a t a p p ro p ria te effort is m ad e to o btain d a ta for all subsam pled units. H ence, under this p rocedure the probabilities p t apply only to the first ph ase o f th e survey.

II. C L A S S IC A L M E A N V A LU E E S T IM A T O R IN T W O -P H A S E S A M P L IN G

As it has been show n by S ärndal et al., (1992), fo r the sam pling scheme described above and stochastic nonresponse, the follow ing statistic is an unbiased estim ato r o f p o p u latio n m ean:

y , = w i y l + w 2y u (3)

w here у Ł = £ y t, y 2 = £ y t and y“u = — ]£y, are m ean values o f the

n l ies, n 2 i e s 2 П и ieu

(3)

Wj = 1 and w2 = ” 2 d en o te fractions o f sets s, an d s , in th e initial sam ple

n n

s. Let us also define: S 2 = - Ц £ f a - Y ^ a n d S* = — 1 - £ ( ^ - 7 2) 2.

“ 1 i e U n 2 ~ 1 i e j j

T h e variance o f y , depends on unknow n response p robabilities, and can be expressed as:

П 7 .) - £ = V + JSa ( »i- (4)

where d en o tes expectation with respect to the first stage sam pling d esig n 1, and Eq denotes expectation with respect to th e response distribution. In p a rtic u la r, w hen d eterm in istic n o n re sp o n se a p p e a rs, the p o p u la tio n U m ay be divided into tw o stra ta U 1 and U 2 such th a t U x contains resp o n d en ts and U 2 contains nonrespondcnts. C onsequently, the variance m ay be expressed as:

(5)

where W2 is the pop u latio n n o n rcsp o n d en t fractio n an d S 2 is the variance o f the ch aracteristic u n d er study in n o n re sp o n d en t stratu m U 2. In fu rth er study, th e estim ato r y s will be denoted by the sym bol S.

III. A L T E R N A T IV E M E A N V A LU E E S T IM A T O R U SIN G A U X IL IA R Y IN F O R M A T IO N

T h e estim ato r (3) is constructed as a linear co m b in atio n o f m ean value estim ates obtained in b oth phases o f survey. In th e case o f determ inistic nonresponse, the weights o f this com bination m ay be treated as th e estim ates o f re sp o n d en t and n o n rcsp o n d en t stratu m fraction s Wl an d W2. H ow ever, th ere arc o th e r possible ways to constru ct these weights on the basis of available auxiliary inform ation, w hich lead to the estim ato r:

У i. = “F i + O - a ) y „ (6)

In p artic u la r, W yw ial (2001a) suggests to assess th e p a ram eter a as:

(7)

(4)

w here is the estim ate o f individual response p ro b ab ility p i for the i-th unit. T h e estim ates are obtained by assum ing th a t for any populatio n unit the probability o f response is given by the following function o f auxiliary variables:

w here ß = [/?0.../iJ d e n o te s th e vector o f u n k n o w n p a ra m e te rs and x i — denotes the vector o f auxiliary variables corresp o n d in g to the i-th p o p u latio n unit. We will assum e th a t x i0 = 1 fo r i = 1 ...N, which m eans th a t ß 0 m ay be treated as intercept. A ssum e th a t J , = 1 if i-th unit resp o n d s and ./, = 0 if it does not. T h e p aram eters ß can be estim ated on the basis o f the response behaviour observed in the initial sam ple s by m inim izing th e likelihood function (see C how , 1995):

A ssum ing th a t partial derivatives o f this expression with respect to param eters ß i equal to zero we o b tain the system o f n onlin ear eq uatio ns, w hose solution ft is trea ted as an estim ate o f the param eter vector ß (see T heil, 1979). T he solu tio n jt m ay be found by using iterative m eth o d s, discussed e.g. by M in k a (2001). In particu lar, a gradient p rojectio n m eth o d propo sed by J.S. R osen (1969) m ay be used. Finally, substitu tin g ft instead o f ß in the fo rm u la (8) it is possible to com pute the estim ates p ; o f individual response prob ab ilities, the p aram eter a, and the m ean value estim ate y~L. In general, the e stim a to r J L is biased, bu t we m ay expect its variance to be significantly lower th a n the variance o f J s . In fu rth e r study the e stim ato r у L will be denoted by the sym bol L.

In this p a p e r a m odified version o f this estim ato r based o n rounded (discretized) response probabilities p\ is also p ro p o sed . T h e discretization is achieved by tran sfo rm in g the estim ates o f response pro babilities according to th e form ula:

(8)

T his is equivalent to the m axim ization o f log-likelihood:

(5)

f l for Pi > 0 .5 1 j o for p t < 0.5

and then applying transform ed values instead o f p, in expressions (7) and (6). Such ap p ro ach is in som e sense justified, w hen it is expected th at no nresponse m echanism is determ inistic, o r ap p ro xim ately determ inistic. In such case, the proposed procedure resembles the application o f discrim ination m ethod, to divide population into clusters o f respondents and nonrespondents, and assess their p o p u latio n p ro p o rtio n s, as proposed by W yw iał (2001b). In fu rth e r study th e m odified version o f the estim ato r y L will be denoted by the sym bol R.

It is w o rth no tin g th a t assum ptions sim ilar to the one given by (8) m ak in g use o f m u ltiv ariate logistic curve arc often considered in the context o f no n resp o n se m odelling (see Cassel et al., 1983; E kholm and L aaksonen, 1991; o r G a o et al., 2000), but they typically co n stitu te the basis for co n stru c tio n o f w eighting adjustm ents in single-phasc sam pling.

IV. A N O T H E R E S T IM A T O R IN V O L V IN G E S T IM A T E S OK R E S P O N S E P R O B A B IL IT IE S - A N E X T E N S IO N O F T H E S IN G L E P H A S E W E IG H T IN G A D JU S T M E N T

T h e typical way the estim ates o f response p robabilities are used in single-phase sam pling is to construct individual weights, fo r each observation. As indicated by S ärndal et al. (1992) and B ethlehem (1988), fo r an arb itra ry sam pling design the weight fo r i-th un it is usually set to l / я ip i w here n i is the inclusion probability associated with this unit. In the case o f simple ra n d o m sam pling w ithout replacem ent, the inclusion p ro b a b ility o f the first o rd e r is equal to n / N so the m ean value estim ato r tak es th e form :

JV ies, n Pi

T h e use o f estim ates p-t instead o f exact response p robabilities introduces some bias, b u t we m ay hope th a t this bias is m o d est if response probabilities are estim ated w ith sufficient accuracy. H ow ever, this e stim ato r ca n n o t be used if any o f the estim ates p, is equal to zero. W e p ro p o se th e way to overcom e this obstacle using the tw o-phase sam pling p ro c ed u re described above. Let us n o te th a t the (condition al) response p ro b a b ility fo r any i-th u n it included in th e first-phase sam ple s m ay be expressed as:

(6)

where pi is the probability o f this unit re sp on din g at the first phase and C(1 — Pi) represents the probability o f this unit not resp o n d in g at the first phase, but being included in the subsam ple and con sequently responding at the second phase. T h is allows to rew rite the estim ato r (12) in the form :

У w =

X Г/а

ä~vv’

0 4

)

Í C - 5 , U U ft\Pi “t* ^ 0 Pi))

It is easy to notice th a t the expression in d e n o m in a to r is alw ays positive, provided th a t с > 0. Let us also stress th a t sam pling units from b o th set si and subsam ple u are used in com p u tatio n o f m ean value estim ates. F or this estim ato r the logistic regression m odel will be used again as a m eans o f estim ating response probabilities. In fu rth er study the estim ato r y'w will be d en o ted by the sym bol W.

V. C O M P A R IS O N OK E S T IM A T O R S BY M E A N S O F M O N T E C A R L O S IM U L A T IO N

A sim ulation study was perform ed to co m pare the accuracy o f the fo u r estim ato rs presented above. T h e d a ta obtained from P olish A gricultural C ensus in 1996 for certain m unicipalities o f the D ą b ro w a T a rn o w sk a district represented the p o p u latio n under study d u rin g sim ulations. T h e to tal o f 2422 units were used in sim ulation. T he variable un d er study, denoted by У was to tal sales o f the farm in the year 1995. T h e auxiliary variables were the farm area (in acres) - X x, the num ber o f pigs in the farm - X 2, and th e n u m b er o f cattle stock in the farm - X 3. T h e P earson linear co rrelatio n coefficients betw een these variables are show n in the following table:

T able 1. C o rre la tio n coefficients betw een v ariab le u n d e r study an d a u xiliary variables У X , * 3 У 1 0.63 0.52 0.50 0.63 1 0.58 0.67 0.52 0.58 1 0.62 0.50 0.67 0.62 1

F o r every p o p u latio n unit the response p ro babilities were generated according to the m odel given by expression (8), and predefined param eter vector ß. T h e experim ents were carried o u t by repeatedly d raw in g w ithout

(7)

replacem ent sim ple ran d o m sam ples from the p o p u latio n . T o represent the sto chastic n o nrespon se m echanism , for each unit included in any sam ple an in d ep en d en t ra n d o m trial was executed w ith th e p ro b a b ility o f success equal to this u n it’s response prob ability. A un it was assum ed to respond if th e o u tco m e o f the trial was a success and treated as n o n re sp o n d en t otherw ise. F o r the resulting set o f nonresp o n d en ts a sim ple subsam ple o f the size equal to the 30% o f the first-phase n o n rc sp o n d cn t n u m ber was draw n w ith o u t replacem ent, and all the four estim ato rs d en o ted by letters S, L, R, W were com puted. On the basis o f com p u ted estim ates, the m ean square e rro r o f each estim ato r was evaluated.

Experim ent 1. In the first experim ent only one auxiliary variable was used. R esponse probabilities were generated for an arb itra rily chosen p a ram eter vector ß — [-4, 0.003]. C onsequently, the average response p ro ­ bability in the p o p u latio n was 0.89. S im ulations were executed for the sam ple size n = 40, 80, ..., 200. F o r every value o f n a to ta l o f 10 000 sam ples w ere d ra w n from th e p o p u la tio n . T h e relativ e accu racy (the p ro p o rtio n o f M SE o f any estim ator to the M SE o f th e stan d ard estim ato r S ) is show n on the G ra p h 1. E ach p o in t on the grap h th erefo re corresponds to 10 000 co m puted estim ates.

1 0 .9 -0.8 -W Lu 0.7 - g 0.6 = 7 0 .5 - ш ” 0 .4 w m - g 0 .3 0.2 -0.1

о н--- 1--- 1

---

1---4 0 8 0 1 2 0 1 6 0 2 0 0 Initial s a m p le siz e n

G raph 1. T h e d ep en d en ce betw een th e initial sam ple size n a n d relative accu racy оГ the e stim ato rs for single aux iliary variable

A s it can be seen on the graph, for any value o f n for w hich the sim ulation s w ere executed, each o f the estim ators: L, R and W had lower M SE th a n th e stan d ard tw o-phase estim ato r S. F o r sm all initial sam ple sizes the estim ato r R was the best (in term s o f M S E ). H ow ever, the M SE

MSE(Z.)/MSE(S) MSE(R)/MSE(S) MSE(W)/MSE(S)

(8)

o f this estim ato r grew rapidly, and for larger sam ples th e estim ato rs L and W were m o re ac cu rate th an R. T he results suggest th a t fo r th e sam ple size large en ough, the M SE o f estim ato r R m ay exceed the M S E o f stan d ard e stim ato r S. T h e relative accuracy o f the strategies L and W was rather stable and it was approxim ately equal to 80% . In m ost cases, the strategy W had low er M SE th an the strategy L. It is w orth n o tin g th a t for small sam ple sizes the estim ator R behaves well, despite non-detcrm inistic character o f response m echanism .

E xperim ent 2. In the second experim ent three auxiliary variables X lt X 2 and Х г w ere used. R esponse p ro b a b ilities w ere g en erated fo r an a rb itra rily chosen p a ram eter vector /? = [-6 , 0.002, 0.146, 0.348], with values ßi .. . ß i inversely p ro p o rtio n al to the m ean values o f correspo nding auxiliary variables. C onsequently, the average response prob ab ility in the p o p u la tio n w as 0.85. S im u latio n s w ere ex ecu ted fo r th e sam p le size n = 40, 80, ...,2 0 0 . F o r every value o f n a to tal o f 10 000 sam ples were d raw n from the p o p u latio n . T h e relative accuracy (the p ro p o rtio n o f M SE o f an y estim ato r to the M SE o f the stan d ard estim ato r S) is show n on the G ra p h 2. E ach p o in t on the grap h therefore co rresp o n d s to 10 000 co m puted estim ates.

In this experim ent, again each of the estim ators: L, R and Ж had lower M SE th an the stan d ard tw o-phase estim ato r S, for any value o f n for which the sim ulations w ere executed. T h e low est M S E was observed for the estim ato r R, and the highest for the estim ato r W.

1 0 .9

0.8

S'

07 Л 0.6 = 7 0 .5 Ш 0 .4 0.2 0.1 0

Initial sample size

n

MSE(i_)/MSE(S) MSE(R)/MSE(S) MSE(W)/MSE(S)

G raph 2. T h e dependence betw een the initial sam ple size n an d relativ e accuracy o f the e stim ato rs fo r th ree auxiliary variables

(9)

I t’s w o rth n o tin g th a t the add itio n o f tw o auxiliary variables did n o t influence significantly the relative accuracy o f th e estim ato r L, but the behavior o f two other estim ators changed. The M SE o f the estim ator R grows slow er w ith the increase o f initial sam ple size, w hereas th e M S E o f the estim ato r W is stable, bu t greater th an it was fo r only one auxiliary variable, p ro b ab ly because o f w eaker correlatio n between variable X t ( X 2) and the variable u n d er study.

VI. S U M M A R Y

In general, the estim ators considered in this p ap er arc biased. F o r the price o f bias, low er values o f M SE m ay be achieved. It should be stressed how ever th a t, in o rd e r to im prove the M SE suitable auxiliary inform atio n is needed. In the case o f estim ato rs L and R auxiliary ch aracteristics have to be observed for all pop u latio n units to co m p u te the estim ates. F o r the estim ato r W auxiliary characteristics should only be observed fo r all the units included in the initial sam ple, so this estim ato r m ay be applied in situ atio n s w here the estim ators R and L are n o t app licable due to lack o f d a ta on auxiliary characteristics in the w hole p o p u latio n . M oreo ver, the sim ulation results presented here are based on assu m p tio n , th a t functional form o f response m echanism is know n. In practice, such know ledge usually com es from sources external to the survey, and it m ay be in accu rate or sim ply false. In such case a m odel m isspecification e rro r occurs, and as a result the M S E o f the estim ators m ay be greater. W hen the functional form o f response m echanism is unknow n, som e n o n p a ra m e tric m eth od s m ay also be used to estim ate the response probabilities.

R E F E R E N C E S

B ethlehem J.G . (1988), R o d u c tio n o f n o n -resp o n se bias th ro u g h regression e stim a tio n , Journal

o f O fficia l S ta tistics, 4, 251-260.

Cassel C .M ., S ä m d a l C .E ., W retm an J .H . (1983), Som e uses o f statistical m odels in connection w ith the n o n resp o n se problem , [in:] Incom plete D ata in Sa m p le Surveys, eds. W .G . M ad o w , 1. O lkin, A cadem ic Press, N ew Y o rk .

C h o w G .C . (1995), Ekonom etria, W yd. N a u k . P W N , W arszaw a.

E k h o lm A ., L aa k so n e n S. (1991), W eighting via resp o n se m odelling in th e finnish household b u d g et survey, Journal o f O fficial Statistics, 7, 3, 325-338.

G a o S., H u i S .L., H all K .S ., H endrie H .C . (2000), E stim atin g disease prevalence fro m tw o-phase surveys w ith n o n -resp o n se a t the second phase, Sta tistics in M edicine, N o 19, 2101-2114. H a n sen M .H ., H u rw itz W .N . (1946), T h e p ro b lem o f n o n re sp o n se in sam ple surveys, Journal

(10)

Lessler J.T ., KaJsbeek W .D . (1992), Nonsampling Error in Surveys, Jo h n Wiley & Sons, New York. M in k a T .P . (2001), A lg o rith m s fo r m axim um likelihood logistic regression, technical re p o rt,

C arnegie M ellon U niversity h ttp ://w w w .stat.cm u .e d u /tr/tr7 5 8 /tr7 5 8 .p d f

R osen J.B . (1969), T h e g rad ie n t p ro jectio n m ethod fo r n o n lin e a r p ro g ram m in g , Journal o f

S o ciety f o r Industrial a n d A pplied M athem atics, 8, 1, 181-217.

S ä rn d al C .E ., Sw ensson B., W retm a n J.H . (1992), M o d el A ssisted S u rvey Sam pling, Springer V erlag, N ew Y ork.

T heil II. (1979), Z a sa d y ekonom etrii, P W N , W arszaw a.

W yw iał J. (2001a), E stim a tio n o f p o p u latio n m ean on the basis o f n o n -sim p le sam ple when n o n -resp o n se e rro r is presen t, S ta tistics in Transition, 5, 3, 443-450.

W yw iał J. (2001b), O n estim atio n o f p o p u la tio n m ean in the case w hen n o n re sp o n d en ts arc p resen t, T aksonom ia - K lasyfikacja danych, teoria i zastosow ania. N o 8, P race N aukow e A B W rocław . Wojciech Gamrot O Z A S T O S O W A N IU R E G R E S JI L O G IS T Y C Z N E J 1 ) 0 O C E N Y W A R T O Ś C I P R Z E C IĘ T N E J Z W Y K O R Z Y S T A N IE M S C H E M A T U L O S O W A N IA D W U F A Z O W E G O W P R Z Y P A D K U BK A K Ó W O D P O W IE D Z I Streszczenie

W ystąpienie nieko m p letn o ści obserw acji b a d an ej cechy w b a d an iu statystycznym zazwyczaj p ro w ad zi d o ob ciążen ia uzyskanej oceny b a d an e g o p a ram etru po p u lacji. Je d n a z technik stosow anych d la przeciw działania tem u zjawisku opiera się n a w ykorzystaniu schem atu losow ania d w ufazow ego. Ja k o e sty m a to r w artości przeciętnej w po p u lacji w ykorzystuje się zazwyczaj k o m b in ację liniow ą ocen w artości przeciętnej uzyskanych w pierw szym i d ru g im etap ie (fazie) b a d an ia. W niniejszym referacie p o d jęto p ró b ę z b ad a n ia w łasności alte rn a ty w n y c h strategii estym acji w ykorzystujących schem at loso w an ia dw ufazow ego, uw zględniających w ko n stru k cji e sty m a to ra oceny p ra w d o p o d o b ień stw uzy sk an ia odpow iodzi od poszczególnych jed n o stek p o p u lacji uzyskane p rzy w ykorzystaniu m odelu regresji logistycznej. P o ró w n an ie własności w ym ienionych strategii p rz e p ro w a d z o n o d ro g ą sym ulacji k o m p u tero w ej, p rzy w ykorzystaniu d an y ch u zy skanych w w ybranych gm inach pow iatu D ą b ro w a T a rn o w s k a p o d c za s spisu rolnego w ro k u 1996.

Cytaty

Powiązane dokumenty

Interest points from an image pair will likely cover some common world points in the scene, so that comparison of local features computed at the interest points will yield a set

Osoby, które nie są już dziećmi i jeszcze nie są dorosłymi ludźmi, powszechnie nazywa się młodzieżą (97m, 14s). Wszystkie inne podane przez respondentów określenia

EDUKACJA BIOLOGICZNA I ŚRODOWISKOWA | ebis.ibe.edu.pl | ebis@ibe.edu.pl | © for the article by the Authors 2013 © for the edition by Instytut Badań Edukacyjnych 2013..

Przy wyborze materiału skalnego do wykonania ozdób architektonicznych i rzeźb stosuje się próbę dźwiękową, za pomocą której wykrywa się, czy kamień.. posiada

Głównym celem aplikacji jest możliwość sporządzenia meldunku z rozpoznania inżynieryjnego. Ze względu na różnorodność zadań, często ze sobą ściśle

These descriptions also apply to work environments given the assumption that “the dominant features of an environment reflect the typical characteristics of its members (Holland,

• determination of the category of values based on the analysis of the literature on the subject, with particular emphasis on the logistic attributes of the value for

Voltammetric curves for a rotating disc electrode above 600 mV in NM solution show a limiting current, which increases linearly with the square root of the rotation