A Two-phase Sampling Strategy for Estimating Multiple Mean Values in the Presence of Nonresponse


Academic year: 2021

A C T A U N I V E R S I T A T I S L O D Z I E N S I S l-’O LIA O E C O N O M IC A 175, 2004 W o jc ie c h G a m r o t * A T W O -P H A S E S A M P L IN G S T R A T E G Y F O R E S T IM A T IN G M U L T IP L E M E A N V A L U E S IN T H E P R E S E N C E O F N O N R E S P O N S E

Abstract. 'I'he phenom enon o f nonresponse in sample surveys usually results in biased estimates o f population characteristics. One o f the means to deal with nonresponse is the subsam pling technique. It relies on re-contacting some subset o f nonrespondents by using more expensive and more efficient tools (e.g. direct interview) than those used in the first attempt to collect data. This allow s to increase response rate and to obtain unbiased estim ates o f population characteristics. In this paper, the problem o f establishing the sam ple and subsample sizes m inim izing the expected cost o f the survey, while achieving desired precision o f multiple mean value estim ates, is considered. A n algorithm is proposed that allow s to establish the optim um initial sam ple and subsample sizes for tw o-phase sam pling strategy.

Key words: nonresponse, callbacks, subsampling, two-phase sampling.

1. IN T R O D U C T IO N - T W O -P H A S E S A M P L IN G

O ne o f the m easures tak en to deal with n o nresp on se in surveys is the callback technique. It relies on re-contacting n o n re sp o n d en ts and allow s to increase response ra te an d , u n d er som e conditions, to o b tain unbiased estim ates o f p o p u la tio n characteristics. T o reduce th e survey cost, usually only a small fraction of nonrespondents from the initial sam ple is questionned in the next phases o f th e survey. I he subset o f re-co ntacted n o n re sp o n d en ts is called a sub sam p le. O bviously, this tech n iq u e is a special case o f m ultiphase sam pling. In this p aper a tw o-phase sam pling strategy for estim ating m ultiple m ean values o f p o p u latio n ch aracteristics is considered. A determ inistic n o n resp o n se is assum ed.

Let us assum e th a t a p o p u latio n Q o f the size N is divided in to tw o nonoverlaping stra ta í í t and 0 2 w hose sizes are equal to and N 2 respectively. All th e units belonging to stratu m Q j would respond to all the questio ns, if co n tacted . All the units from the stratu m Q 2 w ould not


respond to any q u e stio n 1. Let us also denote: = N J N , W2 - N J N . T he aim o f the survey is to estim ate m ean values o f к p o p u latio n characteristics. In the first phase o f the survey a sim ple rando m sam ple s o f the size n is d raw n w ith o u t re p lacem en t from th e p o p u la tio n Í2, a c co rd in g to the sam pling design:

T h e sam ple s is p artitio n ed into tw o disjoint sets and s 2 c z f l 2 o f the sizes 0 < n3l < n and 0 < ntl < n respectively and u j 2 = s, .v, n .s 2 = 0, nSi + nSl = 1. T h e sizes nS) and n,2 are random variables having the follow ing hypergeom etrical d itrib u tio n function:

m a x { 0 , n - i V 2} ^ Mj, < m in(M , N t }, m a x { 0 , n - N ^ < n, 2 < m i n { n , N

2}-All the sam pling units from the set si w ould respond to q uestion s concerning all the characteristics, and all the sam pling units from the set s2 w ould n o t respond to any question. In the second phase o f the survey a subsam ple u o f the size nu is d raw n from am o ng nj2 units o f the set s 2 with the p ro b ab ility o f selection equal to:

It is assum ed th a t all units in the sam ple u respond in th e second a ttem p t o f co n tac t, so the d a ta is collected for the nSl + nu units, and , according to W. G. C o c h r a n (1977), the follow ing statistic is an unbiased estim ato r o f m ean value o f the i-th p o p u latio n characteristic:







S'ij* - the m ean o f the ch aracteristic under study in th e set s t , x p - the m ean o f the ch aracteristic under study in the subsam ple u. F o r fixed n and n,2 the variance o f the estim ato r described ab ov e is given by expressio n2:


S f - the variance o f the i-th characteristic in the p o p u latio n Q, S i, - the v arian ce o f the i-th characteristic in the stratu m П 2.

A ssum e th a t the cost o f observing all the p o p u latio n ch aracteristics is given by the expression:


C 0 - p er-u n it cost o f m ak in g the first co n tac t attem p t.

C , - per-u nit cost o f processing d a ta obtained d u rin g the first co n tac t attem pt.

С2 - per-unit cost o f getting and processing d a ta from the second stratum . T h e q u an titie s nljf nSi and nu are rand om variables, so we will co nsid er expected value o f the cost given by expression3:


Let us assum e th a t desired precision V0i should be achieved fo r the i-th p o p u latio n ch a rac te ristic4. F o r fixed n and nSl, the m inim um subsam ple size needed to o b ta in v ariance F ( x ^ ) no t exceeding F0i is given by:

1 See e.g. W. G. C o c h r a n (1977) or C. E. S a r n d a l , B. S w e n s s o n , J. W r e t m a n (1992) for a proof.

3 T o simplify the n otation , the sym bol £(•) is used to denote the expectation over both sampling designs P , and P 2, so it is equivalent to £ , ( £ , (•)).

4 A t this m om ent we assum e, that variances o f all other characteristics are n ot limited. However all the characteristics are observed and the cost is still given by the expression (6).

К = C 0n + C 1n,t + C 2n„ (6)


N S h n l _ n 2Si _ n l “ « 2(JVľol + S,2) - N n S ,2 + N S h n , , n( nat - b() + n, 2 n y t(n) + n, 2 where: N V ol + S f a, = N S 22 i (9) S f s i l * 1 - 3 0 ° ) yi(n) = nai - b i ( 1 1 )

and initial sam ple size satisfies the condition: N S f

n ^ n * , w here n* = - 2 (12)

/V VqI -f- Oj

F o r fixed n and п,г, the cost given by (6) grows with the subsam plc size nu. T h u s, given the variance limit V0i, it is m inim ized, w hen subsam ple size is established by using expression (8). A ssum ing th a t the subsam plc size is given by (8), an d th a t £ (n ,t) = n W x, the expected co st o f achieving the desired precision for the i-th ch aracteristic is a p p ro x im a te ly 5 eq u al to:

В Д , = / „ , ( » > - n C c + n e , * , + — ( 1 3 )

and it reaches m inim um for initial sam ple size equal to:

" ? = n * ( ^ ( « i - D + l ) 0 4 ) where:


5 Under assum ption, that W2 is a sufficiently accurate estimate o f nonrespondent fraction (n /и) in the sample.



F o r fixed n and nl2, the m inim um subsam ple size needed to achieve desired precision levels for all the к characteristics is equal to:

= m ax {пи,} (1 7)


so it is equal to the q u a n tity nuJ, where j is the n u m b er o f the ch a rac te ris­ tic for which the expression y7(n) takes the m inim um value. H ence the result o f the co m p ariso n o f expressions y,(n) does n o t depend on п1г, we m ay write

E (K ) = / ü ) ( n ) , where уД л) = m in {y,(n)} (18) 1 = 1 . к

The expected cost o f estim ating all the к characteristics is equal to the expected co st o f estim atin g this one ch a rac te ristic , fo r w hich y,(n) is m inim al. Let us fo rm u late the problem o f establishing initial sam ple size n, m inim izing the expected cost w itho ut violating th e precision lim its Voi for all к characteristics. T his can be expressed as:

( E ( K ) — m in

m x ^ X K o i for i = l , к (19)

(2< n < N

1 о solve the problem stated above, we d eterm ine intervals o f the initial sam ple size6 n, inside w hich yt(n) yields the m inim um value fo r the sam e characteristic. H ence yi(n) is a linear function o f n, the b o u n d s o f intervals can easily be established by finding the values o f n, for which som e o f the expressions y ^ n ) are equal to each other. In each interval co rresp o n d in g to some y'-th ch aracteristic we find the value nxJ m inim izing the c o st7, und er assum ption th a t only the variance o f this single ch aracteristic is lim ited by K , By co m p arin g the values o f m inim um expected cost evaluated in each interval we choose the o p tim u m initial sam ple size, and th e co rresp o n d in g value o f expected cost E (K ).

6 A t this point we assum e, that the initial sample size n is a real number, n ot an integer. 1 T he optim um value is obtained by evaluating the expression (14). I f the result o f evaluation falls outside the appropriate interval, it is assumed to be equal to the corresponding bound o f this interval because f U](n) is a convex function o f n.


T h e optim um initial sam ple size m ay be evaluated acco rd ing to the algorithm presented below. T h e variables m 0 and m, arc used to d en o te lower and up p er b o u n d o f cu rren t interval.

1. Assign the value n*,n to the variable m 0.

2. D en ote by the index j the characteristic for which the expression У|(ж о) yields the m inim um value. If there exists m ore th an one ch aracteristic having this pro p erty , the one with m inim al ai value should be chosen. T he characteristics for which y,(m0) > y, (mQ) and a , > a j m ay be elim inated from further considerations.

3. If all the ch aracteristics except the y'-th one were elim inated, assum e m l = N and go to the step 5.

4. F o r each no t elim inated characteristic ev alu ate the expression:

n, = (20)

ai — aj

A ssum e m l = m i n l n j . I f m l > N assum e m i = N .

5. E valuate the initial sam ple size nx m inim izing the expected cost, under assum ption th a t only the variance o f y-th characteristic is lim ited.

rix = nU) = ri] ( w 2^Ę (oiJ - l ) + 1^ (21)

6. If nx < m 0, assum e nx — m 0. 7. If nx > m l assum e nx — m l .

8. E v aluate m inim um expected cost co rresp ond ing to nx, accord in g to expression:

N n xW 22S 22lC 2

В Д » , ) - „ , ( С . + С . И У + ш v- - s -} j N ^ s h щ (2 2 ,

9. In the first itera tio n , store the values nx and E ( K \ n x) o b tain ed in steps 7-8. In the successive iterations, com p are the cost evalu ated in step 8 with previously stored value o f cost and if it is greater, th en sto re the values nx and E ( K \ n x) o btained in steps 7-8.

10. E lim inate the i-th ch aracteristic from fu rth e r con sid eration s. I f all the ch aracteristics were elim inated, term inate execution: the sto red values o f nx and £ ( К |п ж) co n stitu te the solution. In o th e r case assign the value o f m l to m0 and go to the step 2.

T he n u m b er o f itera tio n s to execute is n o t greater th an the num b er к characteristics u n d er study. In every iteratio n the expression (20) is


evaluated in step 4 fo r each o f n o t elim inated characteristics, so if we assum e, th a t ev a lu a tio n o f this expression is a d o m in atin g o p eratio n , the co m p u tatio n al com plexity o f the algorithm is o f the o rd e r 0 ( k 2).


In the p o p u latio n o f size N = 100 000, values o f 9 ch aracteristics are observed. P er-u nit costs are C 0 = 0.1, C t = 0.4, C 2 = 4 respectively. T ab le 1 shows the variances S f , S h , the desired precisions V0l, and th e m inim um initial sam ple sizes n* corresp o n d in g to each characteristic.

T a b l e 1 Example data Characteristic s f s 2 К , niи* 1 1 400 1 800 1 1 380.671 2 500 1 160 0.5 990.099 3 950 1 700 0.8 1 173.564 4 400 700 2 199.6008 5 1 450 1 550 1 1 429.276 6 3 000 1 800 2 1 477.833 7 1 000 1 340 1.2 826.4463 8 325 250 0.3 1 283.317 9 1 500 500 4 373.599

T ab le 2 shows the interval b o u n d s m0 and m ,, evaluated in successive iterations, op tim um initial sam ple sizes for each itera tio n and co rresp o n d in g expected costs. O O »


ŕ r -T a b l e 2 C om putation results Iteration m0 m, "x E ( K ) 1 1 477.833 1 544.986 1 544.986 2 603.747 2 1 544.986 1 729.56 1 729.56 2 443.612 3 1 729.56 2 512.186 2 167.548 2 382.019 4 2 512.186 3 129.657 2 512.186 2 403.948 5 3 129.657 100 000 3 129.657 2 568.449


T h e low est value o f cost E (K ) = 2382.019 was achieved in th e third interval, for nx = 2167.548. As nx is n o t an integer, the result was rounded to 2168. T h e expected cost fo r the rounded size is equal to 2382.020 so it does no t differ significantly from the cost obtained for n o t-ro u n d e d initial sam ple size.


W o jc ie c h G a m r o l


W badaniach statystycznych występuje często zjawisko nieuzyskania danych od części badanych jednostek. Prowadzi to do obciążenia ocen badanego parametru populacji. Jedną z technik stosow anych dla przeciwdziałania temu zjawisku jest ponaw ianie badania (ang. callback) w grupie jednostek populacji, od których nie uzyskano danych. Często spotykanym rozwiązaniem , um ożliwiającym ograniczenie kosztu badania jest w ylosow anie jedynie pew nego podzbioru tych jednostek w celu ponow ienia próby kontaktu. W niniejszym artykule rozw ażono strategię polegającą na jednokrotnym ponowieniu badania, dla jednoczesnej estymacji wartości przeciętnych wielu cech w populacji. Z apropon ow an o algorytm iteracyjny um ożliw iający ustalenie optymalnej liczebności próby i podpróby, cechujący się w ielom ianow ą złożonością obliczeniową.


