On Application of Logistic Regression to Mean Value Estimation in Two-Phase Sampling for Nonresponse

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S

F O L IA O E C O N O M IC A 194, 2005

Wo j ci ech G a m r o t *

ON A PPL IC A T IO N OF LOGISTIC REG RESSIO N TO M EAN VALUE EST IM A TIO N IN T W O -PIIA SE SA M PL IN G FOR N O N R E S P O N S E

T h e p h e n o m e n o n o f n o n resp o n se in a sam ple survey usually leads to bias in estim ates o f p o p u la tio n p aram eters. O ne o f the techniques applied as a co u n te rm e a su re fo r n o n re sp o n se is based on tw o -p h ase (or d o u b le) sam pling. U sually a linear c o m b in a tio n o f m ean value estim ates o b tain e d in b o th p hases o f the survey is used as an estim ate o f p o p u latio n m ean value o f th e c h aracteristic u n d e r study. In th is p a p er a lte rn a tiv e e stim a to rs fo r tw o-phase sam pling schem e using estim ates o f resp o n se p ro b ab ilities o b tain ed on th e basis o f logistic regression m odel a re considered. T h e results o f M o n te C a rlo sim u latio n stu d y c o m p a rin g the p ro p e rties o f these e stim ato rs a re presented. In th e sim ulations, the d a ta fro m the P olish 1996 A g ricu ltu ral C ensus were used.

Key words: sto ch astic n o n re sp o n se, re sponse p ro b ab ilities, logistic regression, m ean value estim ation.

L et us assum e th a t the m ean value У o f som e ch aracteristic У in the p o p u latio n U o f the size N is to be estim ated and to accom plish this a sim ple ra n d o m sam ple s o f the size ti is d raw n w ith o u t replacem ent from U, according to the sam pling design p(-), given by form ula:

W e adm it, as in the paper of Cassel et al. (1983), th a t nonresponse mechanism is a stochastic one. T his m eans th a t each i-th p o p u la tio n unit has som e

Abstract

I. IN T R O D U C T IO N - T W O -P H A S E S A M P L IN G

(2)

unkno w n p rob ab ility p t o f responding if it is included in th e sam ple. Hence, the ph en o m en o n o f nonresponse m ay be treated as an ad d itio n al phase of sam ple selection, governed by som e unknow n probability d istrib ution g(s2|s). S ärn dal et al. (1992) call the probability d istrib u tio n q(s2 |s) the response distribution.

D epending on response probabilities, d u rin g an a tte m p t to collect d a ta som e units respond and some d o not. Hence, the sam ple s can be divided into tw o sets s 1 and s 2, o f the sizes 0 and 0 ^ n 2 ^ n co ntaining responding and n on-respon ding units respectively. C o nsequ ently , wc have S jU S j = s, s , n s 2 = 0 and n, + n2 = n. Si/£s n, and n2 are random variables, w hose d istrib u tio n s depend on unknow n response p robabilities. As it has been show n by Lessler and K alsbeek (1992), estim ates o f po p u latio n m ean based only on observations from the set will be biased. T o reduce bias, a subsam pling schem e proposed by H ansen and H u rw itz (1946) m ay be applied. A ccording to this scheme, a second phase o f the survey is perform ed. In this second phase, a sim ple subsam ple и o f the size nu = c n 2 (where 0 < с < 1 is a co n stan t fixed in advance) is selected w itho ut replacem ent from am o n g n2 units o f the set s2. T he prob ability o f selection for a certain subsam ple m ay be expressed as:

В Д » , ) - ; (2)

All units included in the subsam ple are then re-con tacted, an d we assum e th a t a p p ro p ria te effort is m ad e to o btain d a ta for all subsam pled units. H ence, under this p rocedure the probabilities p t apply only to the first ph ase o f th e survey.

II. C L A S S IC A L M E A N V A LU E E S T IM A T O R IN T W O -P H A S E S A M P L IN G

As it has been show n by S ärndal et al., (1992), fo r the sam pling scheme described above and stochastic nonresponse, the follow ing statistic is an unbiased estim ato r o f p o p u latio n m ean:

y , = w i y l + w 2y u (3)

w here у Ł = £ y t, y 2 = £ y t and y“u = — ]£y, are m ean values o f the

n l ies, n 2 i e s 2 П и ieu

(3)

Wj = 1 and w2 = ” 2 d en o te fractions o f sets s, an d s , in th e initial sam ple

n n

s. Let us also define: S 2 = - Ц £ f a - Y ^ a n d S* = — 1 - £ ( ^ - 7 2) 2.

“ 1 i e U n 2 ~ 1 i e j j

T h e variance o f y , depends on unknow n response p robabilities, and can be expressed as:

П 7 .) - £ = V + JSa ( »i- (4)

where d en o tes expectation with respect to the first stage sam pling d esig n 1, and Eq denotes expectation with respect to th e response distribution. In p a rtic u la r, w hen d eterm in istic n o n re sp o n se a p p e a rs, the p o p u la tio n U m ay be divided into tw o stra ta U 1 and U 2 such th a t U x contains resp o n d en ts and U 2 contains nonrespondcnts. C onsequently, the variance m ay be expressed as:

(5)

where W2 is the pop u latio n n o n rcsp o n d en t fractio n an d S 2 is the variance o f the ch aracteristic u n d er study in n o n re sp o n d en t stratu m U 2. In fu rth er study, th e estim ato r y s will be denoted by the sym bol S.

III. A L T E R N A T IV E M E A N V A LU E E S T IM A T O R U SIN G A U X IL IA R Y IN F O R M A T IO N

T h e estim ato r (3) is constructed as a linear co m b in atio n o f m ean value estim ates obtained in b oth phases o f survey. In th e case o f determ inistic nonresponse, the weights o f this com bination m ay be treated as th e estim ates o f re sp o n d en t and n o n rcsp o n d en t stratu m fraction s Wl an d W2. H ow ever, th ere arc o th e r possible ways to constru ct these weights on the basis of available auxiliary inform ation, w hich lead to the estim ato r:

У i. = “F i + O - a ) y „ (6)

In p artic u la r, W yw ial (2001a) suggests to assess th e p a ram eter a as:

(7)

(4)

w here is the estim ate o f individual response p ro b ab ility p i for the i-th unit. T h e estim ates are obtained by assum ing th a t for any populatio n unit the probability o f response is given by the following function o f auxiliary variables:

w here ß = [/?0.../iJ d e n o te s th e vector o f u n k n o w n p a ra m e te rs and x i — denotes the vector o f auxiliary variables corresp o n d in g to the i-th p o p u latio n unit. We will assum e th a t x i0 = 1 fo r i = 1 ...N, which m eans th a t ß 0 m ay be treated as intercept. A ssum e th a t J , = 1 if i-th unit resp o n d s and ./, = 0 if it does not. T h e p aram eters ß can be estim ated on the basis o f the response behaviour observed in the initial sam ple s by m inim izing th e likelihood function (see C how , 1995):

A ssum ing th a t partial derivatives o f this expression with respect to param eters ß i equal to zero we o b tain the system o f n onlin ear eq uatio ns, w hose solution ft is trea ted as an estim ate o f the param eter vector ß (see T heil, 1979). T he solu tio n jt m ay be found by using iterative m eth o d s, discussed e.g. by M in k a (2001). In particu lar, a gradient p rojectio n m eth o d propo sed by J.S. R osen (1969) m ay be used. Finally, substitu tin g ft instead o f ß in the fo rm u la (8) it is possible to com pute the estim ates p ; o f individual response prob ab ilities, the p aram eter a, and the m ean value estim ate y~L. In general, the e stim a to r J L is biased, bu t we m ay expect its variance to be significantly lower th a n the variance o f J s . In fu rth e r study the e stim ato r у L will be denoted by the sym bol L.

In this p a p e r a m odified version o f this estim ato r based o n rounded (discretized) response probabilities p\ is also p ro p o sed . T h e discretization is achieved by tran sfo rm in g the estim ates o f response pro babilities according to th e form ula:

(8)

T his is equivalent to the m axim ization o f log-likelihood:

(5)

f l for Pi > 0 .5 1 j o for p t < 0.5

and then applying transform ed values instead o f p, in expressions (7) and (6). Such ap p ro ach is in som e sense justified, w hen it is expected th at no nresponse m echanism is determ inistic, o r ap p ro xim ately determ inistic. In such case, the proposed procedure resembles the application o f discrim ination m ethod, to divide population into clusters o f respondents and nonrespondents, and assess their p o p u latio n p ro p o rtio n s, as proposed by W yw iał (2001b). In fu rth e r study th e m odified version o f the estim ato r y L will be denoted by the sym bol R.

It is w o rth no tin g th a t assum ptions sim ilar to the one given by (8) m ak in g use o f m u ltiv ariate logistic curve arc often considered in the context o f no n resp o n se m odelling (see Cassel et al., 1983; E kholm and L aaksonen, 1991; o r G a o et al., 2000), but they typically co n stitu te the basis for co n stru c tio n o f w eighting adjustm ents in single-phasc sam pling.

IV. A N O T H E R E S T IM A T O R IN V O L V IN G E S T IM A T E S OK R E S P O N S E P R O B A B IL IT IE S - A N E X T E N S IO N O F T H E S IN G L E P H A S E W E IG H T IN G A D JU S T M E N T

T h e typical way the estim ates o f response p robabilities are used in single-phase sam pling is to construct individual weights, fo r each observation. As indicated by S ärndal et al. (1992) and B ethlehem (1988), fo r an arb itra ry sam pling design the weight fo r i-th un it is usually set to l / я ip i w here n i is the inclusion probability associated with this unit. In the case o f simple ra n d o m sam pling w ithout replacem ent, the inclusion p ro b a b ility o f the first o rd e r is equal to n / N so the m ean value estim ato r tak es th e form :

JV ies, n Pi

T h e use o f estim ates p-t instead o f exact response p robabilities introduces some bias, b u t we m ay hope th a t this bias is m o d est if response probabilities are estim ated w ith sufficient accuracy. H ow ever, this e stim ato r ca n n o t be used if any o f the estim ates p, is equal to zero. W e p ro p o se th e way to overcom e this obstacle using the tw o-phase sam pling p ro c ed u re described above. Let us n o te th a t the (condition al) response p ro b a b ility fo r any i-th u n it included in th e first-phase sam ple s m ay be expressed as:

(6)

where pi is the probability o f this unit re sp on din g at the first phase and C(1 — Pi) represents the probability o f this unit not resp o n d in g at the first phase, but being included in the subsam ple and con sequently responding at the second phase. T h is allows to rew rite the estim ato r (12) in the form :

У w =

X Г/а

ä~vv’

0 4

)

Í C - 5 , U U ft\Pi “t* ^ 0 Pi))

It is easy to notice th a t the expression in d e n o m in a to r is alw ays positive, provided th a t с > 0. Let us also stress th a t sam pling units from b o th set si and subsam ple u are used in com p u tatio n o f m ean value estim ates. F or this estim ato r the logistic regression m odel will be used again as a m eans o f estim ating response probabilities. In fu rth er study the estim ato r y'w will be d en o ted by the sym bol W.

V. C O M P A R IS O N OK E S T IM A T O R S BY M E A N S O F M O N T E C A R L O S IM U L A T IO N

A sim ulation study was perform ed to co m pare the accuracy o f the fo u r estim ato rs presented above. T h e d a ta obtained from P olish A gricultural C ensus in 1996 for certain m unicipalities o f the D ą b ro w a T a rn o w sk a district represented the p o p u latio n under study d u rin g sim ulations. T h e to tal o f 2422 units were used in sim ulation. T he variable un d er study, denoted by У was to tal sales o f the farm in the year 1995. T h e auxiliary variables were the farm area (in acres) - X x, the num ber o f pigs in the farm - X 2, and th e n u m b er o f cattle stock in the farm - X 3. T h e P earson linear co rrelatio n coefficients betw een these variables are show n in the following table:

T able 1. C o rre la tio n coefficients betw een v ariab le u n d e r study an d a u xiliary variables У X , * 3 У 1 0.63 0.52 0.50 0.63 1 0.58 0.67 0.52 0.58 1 0.62 0.50 0.67 0.62 1

F o r every p o p u latio n unit the response p ro babilities were generated according to the m odel given by expression (8), and predefined param eter vector ß. T h e experim ents were carried o u t by repeatedly d raw in g w ithout

(7)

replacem ent sim ple ran d o m sam ples from the p o p u latio n . T o represent the sto chastic n o nrespon se m echanism , for each unit included in any sam ple an in d ep en d en t ra n d o m trial was executed w ith th e p ro b a b ility o f success equal to this u n it’s response prob ability. A un it was assum ed to respond if th e o u tco m e o f the trial was a success and treated as n o n re sp o n d en t otherw ise. F o r the resulting set o f nonresp o n d en ts a sim ple subsam ple o f the size equal to the 30% o f the first-phase n o n rc sp o n d cn t n u m ber was draw n w ith o u t replacem ent, and all the four estim ato rs d en o ted by letters S, L, R, W were com puted. On the basis o f com p u ted estim ates, the m ean square e rro r o f each estim ato r was evaluated.

Experim ent 1. In the first experim ent only one auxiliary variable was used. R esponse probabilities were generated for an arb itra rily chosen p a ram eter vector ß — [-4, 0.003]. C onsequently, the average response p ro  bability in the p o p u latio n was 0.89. S im ulations were executed for the sam ple size n = 40, 80, ..., 200. F o r every value o f n a to ta l o f 10 000 sam ples w ere d ra w n from th e p o p u la tio n . T h e relativ e accu racy (the p ro p o rtio n o f M SE o f any estim ator to the M SE o f th e stan d ard estim ato r S ) is show n on the G ra p h 1. E ach p o in t on the grap h th erefo re corresponds to 10 000 co m puted estim ates.

1 0 .9 -0.8 -W Lu 0.7 - g 0.6 = 7 0 .5 - ш ” 0 .4 w m - g 0 .3 0.2 -0.1

о н--- 1--- 1

---

1---4 0 8 0 1 2 0 1 6 0 2 0 0 Initial s a m p le siz e n

G raph 1. T h e d ep en d en ce betw een th e initial sam ple size n a n d relative accu racy оГ the e stim ato rs for single aux iliary variable

A s it can be seen on the graph, for any value o f n for w hich the sim ulation s w ere executed, each o f the estim ators: L, R and W had lower M SE th a n th e stan d ard tw o-phase estim ato r S. F o r sm all initial sam ple sizes the estim ato r R was the best (in term s o f M S E ). H ow ever, the M SE

MSE(Z.)/MSE(S) MSE(R)/MSE(S) MSE(W)/MSE(S)

(8)

o f this estim ato r grew rapidly, and for larger sam ples th e estim ato rs L and W were m o re ac cu rate th an R. T he results suggest th a t fo r th e sam ple size large en ough, the M SE o f estim ato r R m ay exceed the M S E o f stan d ard e stim ato r S. T h e relative accuracy o f the strategies L and W was rather stable and it was approxim ately equal to 80% . In m ost cases, the strategy W had low er M SE th an the strategy L. It is w orth n o tin g th a t for small sam ple sizes the estim ator R behaves well, despite non-detcrm inistic character o f response m echanism .

E xperim ent 2. In the second experim ent three auxiliary variables X lt X 2 and Х г w ere used. R esponse p ro b a b ilities w ere g en erated fo r an a rb itra rily chosen p a ram eter vector /? = [-6 , 0.002, 0.146, 0.348], with values ßi .. . ß i inversely p ro p o rtio n al to the m ean values o f correspo nding auxiliary variables. C onsequently, the average response prob ab ility in the p o p u la tio n w as 0.85. S im u latio n s w ere ex ecu ted fo r th e sam p le size n = 40, 80, ...,2 0 0 . F o r every value o f n a to tal o f 10 000 sam ples were d raw n from the p o p u latio n . T h e relative accuracy (the p ro p o rtio n o f M SE o f an y estim ato r to the M SE o f the stan d ard estim ato r S) is show n on the G ra p h 2. E ach p o in t on the grap h therefore co rresp o n d s to 10 000 co m puted estim ates.

In this experim ent, again each of the estim ators: L, R and Ж had lower M SE th an the stan d ard tw o-phase estim ato r S, for any value o f n for which the sim ulations w ere executed. T h e low est M S E was observed for the estim ato r R, and the highest for the estim ato r W.

1 0 .9

„

0.8

S'

07 Л 0.6 = 7 0 .5 Ш 0 .4 0.2 0.1 0

Initial sample size

n

MSE(i_)/MSE(S) MSE(R)/MSE(S) MSE(W)/MSE(S)

G raph 2. T h e dependence betw een the initial sam ple size n an d relativ e accuracy o f the e stim ato rs fo r th ree auxiliary variables

(9)

I t’s w o rth n o tin g th a t the add itio n o f tw o auxiliary variables did n o t influence significantly the relative accuracy o f th e estim ato r L, but the behavior o f two other estim ators changed. The M SE o f the estim ator R grows slow er w ith the increase o f initial sam ple size, w hereas th e M S E o f the estim ato r W is stable, bu t greater th an it was fo r only one auxiliary variable, p ro b ab ly because o f w eaker correlatio n between variable X t ( X 2) and the variable u n d er study.

VI. S U M M A R Y

In general, the estim ators considered in this p ap er arc biased. F o r the price o f bias, low er values o f M SE m ay be achieved. It should be stressed how ever th a t, in o rd e r to im prove the M SE suitable auxiliary inform atio n is needed. In the case o f estim ato rs L and R auxiliary ch aracteristics have to be observed for all pop u latio n units to co m p u te the estim ates. F o r the estim ato r W auxiliary characteristics should only be observed fo r all the units included in the initial sam ple, so this estim ato r m ay be applied in situ atio n s w here the estim ators R and L are n o t app licable due to lack o f d a ta on auxiliary characteristics in the w hole p o p u latio n . M oreo ver, the sim ulation results presented here are based on assu m p tio n , th a t functional form o f response m echanism is know n. In practice, such know ledge usually com es from sources external to the survey, and it m ay be in accu rate or sim ply false. In such case a m odel m isspecification e rro r occurs, and as a result the M S E o f the estim ators m ay be greater. W hen the functional form o f response m echanism is unknow n, som e n o n p a ra m e tric m eth od s m ay also be used to estim ate the response probabilities.

R E F E R E N C E S

B ethlehem J.G . (1988), R o d u c tio n o f n o n -resp o n se bias th ro u g h regression e stim a tio n , Journal

o f O fficia l S ta tistics, 4, 251-260.

Cassel C .M ., S ä m d a l C .E ., W retm an J .H . (1983), Som e uses o f statistical m odels in connection w ith the n o n resp o n se problem , [in:] Incom plete D ata in Sa m p le Surveys, eds. W .G . M ad o w , 1. O lkin, A cadem ic Press, N ew Y o rk .

C h o w G .C . (1995), Ekonom etria, W yd. N a u k . P W N , W arszaw a.

E k h o lm A ., L aa k so n e n S. (1991), W eighting via resp o n se m odelling in th e finnish household b u d g et survey, Journal o f O fficial Statistics, 7, 3, 325-338.

G a o S., H u i S .L., H all K .S ., H endrie H .C . (2000), E stim atin g disease prevalence fro m tw o-phase surveys w ith n o n -resp o n se a t the second phase, Sta tistics in M edicine, N o 19, 2101-2114. H a n sen M .H ., H u rw itz W .N . (1946), T h e p ro b lem o f n o n re sp o n se in sam ple surveys, Journal

(10)

Lessler J.T ., KaJsbeek W .D . (1992), Nonsampling Error in Surveys, Jo h n Wiley & Sons, New York. M in k a T .P . (2001), A lg o rith m s fo r m axim um likelihood logistic regression, technical re p o rt,

C arnegie M ellon U niversity h ttp ://w w w .stat.cm u .e d u /tr/tr7 5 8 /tr7 5 8 .p d f

R osen J.B . (1969), T h e g rad ie n t p ro jectio n m ethod fo r n o n lin e a r p ro g ram m in g , Journal o f

S o ciety f o r Industrial a n d A pplied M athem atics, 8, 1, 181-217.

S ä rn d al C .E ., Sw ensson B., W retm a n J.H . (1992), M o d el A ssisted S u rvey Sam pling, Springer V erlag, N ew Y ork.

T heil II. (1979), Z a sa d y ekonom etrii, P W N , W arszaw a.

W yw iał J. (2001a), E stim a tio n o f p o p u latio n m ean on the basis o f n o n -sim p le sam ple when n o n -resp o n se e rro r is presen t, S ta tistics in Transition, 5, 3, 443-450.

W yw iał J. (2001b), O n estim atio n o f p o p u la tio n m ean in the case w hen n o n re sp o n d en ts arc p resen t, T aksonom ia - K lasyfikacja danych, teoria i zastosow ania. N o 8, P race N aukow e A B W rocław . Wojciech Gamrot O Z A S T O S O W A N IU R E G R E S JI L O G IS T Y C Z N E J 1 ) 0 O C E N Y W A R T O Ś C I P R Z E C IĘ T N E J Z W Y K O R Z Y S T A N IE M S C H E M A T U L O S O W A N IA D W U F A Z O W E G O W P R Z Y P A D K U BK A K Ó W O D P O W IE D Z I Streszczenie

W ystąpienie nieko m p letn o ści obserw acji b a d an ej cechy w b a d an iu statystycznym zazwyczaj p ro w ad zi d o ob ciążen ia uzyskanej oceny b a d an e g o p a ram etru po p u lacji. Je d n a z technik stosow anych d la przeciw działania tem u zjawisku opiera się n a w ykorzystaniu schem atu losow ania d w ufazow ego. Ja k o e sty m a to r w artości przeciętnej w po p u lacji w ykorzystuje się zazwyczaj k o m b in ację liniow ą ocen w artości przeciętnej uzyskanych w pierw szym i d ru g im etap ie (fazie) b a d an ia. W niniejszym referacie p o d jęto p ró b ę z b ad a n ia w łasności alte rn a ty w n y c h strategii estym acji w ykorzystujących schem at loso w an ia dw ufazow ego, uw zględniających w ko n stru k cji e sty m a to ra oceny p ra w d o p o d o b ień stw uzy sk an ia odpow iodzi od poszczególnych jed n o stek p o p u lacji uzyskane p rzy w ykorzystaniu m odelu regresji logistycznej. P o ró w n an ie własności w ym ienionych strategii p rz e p ro w a d z o n o d ro g ą sym ulacji k o m p u tero w ej, p rzy w ykorzystaniu d an y ch u zy skanych w w ybranych gm inach pow iatu D ą b ro w a T a rn o w s k a p o d c za s spisu rolnego w ro k u 1996.