On Some Robust Against Outhers Predictor of the Total Value in Small Domain

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S

FO LIA O E C O N O M IC A 175, 2004

J a n u s z W y w i a ł * , T o m a s z Ż ą d ł o * *

O N S O M E R O B U S T A G A IN S T O U T L IE R S P R E D IC T O R O F T H E T O T A L V A LU E IN S M A L L D O M A IN

Abstract T he problem o f prediction o f the total value in a dom ain based on simple regression superpopulation model (with one auxiliary variable and n o intercept) is considered. The problem o f robust estim ation against outliers o f regression fun ction’s parameter is show n. T he presented robust estim ator is m edian value o f gradients o f all straight lines each determined by the origin and one o f n points (x , y), where n is sam ple size, у - the variable o f interest and x - auxiliary variable. This estimator is simplified form o f the estim ator presented by H. T h e i l (1979). The equation o f the mean square error o f the robust predictor based on the robust estim ator o f regression’s parameter is derived for asym ptotic assumptions. The best linear predictor based on the considered superpopulation m odel is presented. T he equation o f m ean square error o f the BLU predictor is derived. The accuracy o f these predictors is com pared for the assum ption o f normal distribution o f variables o f interest.

Key words: small area statistics, m odel approach, robust estim ation.

1. IN T R O D U C T IO N

In survey sam pling, including small area statistics, tw o ap p ro ach e s are considered - ra n d o m and m odel ap p ro ach . T h ere is also a prob lem o f ro b u st estim ation , extrem ely im p o rta n t in respect o f practical aspects o f sam ple surveys especially su p p o rted by m odel ap p ro ach . T h e reason is, th a t the statistician in the case o f m odel a p p ro a c h m u st assum e som e su p erp o p u la tio n m odel and estim ate its param eters. In this p ap er, som e ro b u st p re d ic to r o f to ta l value in dom ain will be p ro p o sed , and it will be com pared with B LU p re d ic to r fo r assum ed, presented below , su p e rp o p u la tion m odel. R ob u stn ess is considered in the co n tex t o f th e presence o f outliers.

* Prof., Departm ent o f Statistics, U niversity o f Econom ics, K atow ice. ** M A , Departm ent o f Statistics, U niversity o f Econom ics, K atow ice.

(2)

2. S U P E R P O P U L A T IO N M O D EL

F o llo w in g c o n s id e ra tio n s are based on sim p le re g ressio n m o d el assum ed fo r th e e n tire p o p u la tio n . It is a ssu m ed , th a t valu es o f auxiliary variable arc know n for all elem ents o f the p o p u la tio n . W ith regard to £ d istrib u tio n describing su p erp o p u la tio n m odel it is assum ed, th at У ,, Y n are independent and У, = щ + Ei, Hi = Е <(У,) = ß x h ЕДе,) = 0,

o f = D 2( y f) = D ^ e ,) = ст2 у ( х ,) , where ß , o 2 are u nk no w n and x if x N are know n for every i ( i = l , . . . , N ) . In follow ing co n sid eratio n it will be ad ditionally assum ed, th a t v(x,) = x f .

C o n sid e ratio n s are con d u cted for any sam ple design. It is assum ed, th a t the sam ple s is d raw n from the entire po pu latio n by sam ple design P(S) with first o rd e r inclusion p robabilities n t, w here i = 1, N . F o r any sam ple s with size n d raw n from p o p u latio n ę w ith size N, Ĺ1 = S u Š , where 3 den o tes elem ents o f the p o p u latio n , which were n o t d raw n to the sam ple. Let Sd — S n C l d, w here the d-th d o m ain is d en o ted by ęd. T h e size o f Sd equals nd (ran d o m variable) and the size o f ęd eq uals N d. T h e set o f elem ents o f p o p u latio n s which belong to d-th d o m ain ęd could be w ritten as Qd = Sdu S d, w here S d d enotes elem ents o f the d-th d o m ain , w hich were no t draw n to the sam ple.

In follow ing co n sid eratio n s n o tatio n s presented below will be used. G ra d ien ts o f all stra ig h t lines, each determ ined by the origin an d o n e o f n points (x, y), w here n is sam ple size, у - the value o f th e variab le o f interest and x - the value o f auxiliary variable, are considered:

3. C O N ST R U C T IO N OK PRED IC TO R

d )

where i = 1, ..., n.

Based on assum ed su p erp o p u la tio n m odel it is kn ow n , that:

E A ) = ß

(3)

Let us discuss tw o pred icto rs o f the to tal value in the d om ain :

T \ s t - ndY Sl + b u Y , x i (3)

where:

b 2, = Me{/i} ₍₅₎

T h e estim ato r b2, is the m edian o f a sequence o f ra n d o m variables {/ij, ha}. It is p a rtic u la r form o f the estim ato r considered by H . T h e i l (1979). Let us n o te th a t the estim ato r b2S is ro b u st again st possible outliers. F ro m th e theorem presented by R. M . R o y a l l (1976) it is k n o w n , th a t T lSi statistic is B LU p red icto r for assum ed in section one su p erp o p u la tio n m odel. It m eans, th a t it is £ - unbiased p red icto r o f th e to ta l value in small area Y d = £ Y t and it m inim ises £ - variance fo r assum ed s u p e r p o p u latio n m odel.

Let us ad d itio n ally assum e th a t random variables е(, ť = 1, ..., N has continues d istrib u tio n s w ith different variances. H ence, from the e q u a tio n (2) it is kn ow n, th a t { ht , ..., hs } are sequence o f in d ep en d en t ra n d o m variables w ith the sam e d istrib u tio n s given by density fu n ctio n / ( . ) w ith the sam e expected values and variances. F inally , fro m k n o w n resu lts on distrib u tio n s o f sam ple q uantiles (e.g. F i s z 1976) it results, th a t b 2S is consistent estim ato r o f ß and for large sam ple size b2S statistic is well a p p ro x im a te d by n o rm a l d is trib u tio n w ith fo llo w in g p a ra m e te rs

F irst, th e m ean sq u are e rro r o f T 1Si statistic (given by the e q u a tio n (3)) will be analysed assum ing, th a t su p erp o p u la tio n m o d el GR is tru e. F ro m R oyall theorem ( R o y a l l 1979) it results, th a t

(4)

Е {Е р( Г 15, - У ä)2 = ° Z \ E p( z * < ) + ст2Ер ( 1 > ? ) (6) n \ l e S t J V i c S , /

Sccond, m ean sq u are e rro r o f T 2St statistic (given by e q u a tio n (4)) will be analysed assum ing, th a t su p erp o p u la tio n m odel GR is true. Let us notice, th a t because o f p aram eters o f asym ptotic d istrib u tio n o f b2S, for large sam ple size E i(b2S) & ß ' H ence it is easy to prove, th a t p red icto r T 2Si is approxim ately £, - unbiased p redictor o f the to tal value in sm all dom ain:

Z ^ + ^ Z * , - - 1 у, - £ у, =

_ l e S t i e S , ieOj J U e S , l e S j l e S , le.1, _

^ 2 S

E

X i —

E

^ i l

=

E < ( ^ 2s )

E

X l ~ E f ( X ^ i ) ~ ß E Xl ~ ß E Xi = 0

ieS, ieSj J leg, \ieSd / US, IzSä

(7) T h e m ean sq u are e rro r o f predictor T 2Si for n o n in fo rm ativ e sam ple design is as follows:

EťEp(T2S, -

Yd) 2

= E ,

Et (

£ У, +

h2S

£ x , - E

у)

=

\leSé ieSi ieíij J

= E p El ( b 2SY x i - Y yX = E pE i ( h 2 i Y Jx - l Y - J > , + =

V l e S d i e S 4 J \ ieSd i e S , l e S , ieSt J

tasí*!- Ей) +(I

Yi -

ľ л) - 2^ b2s Z xí - E 4 E

Yi -

X л)

A le S t ie S , j \ i e S d ie S j ) \ l e S , IeS, J \ l e S 4 ie S ä / .

because b 2S i £ У, are independent ra n d o m variables, we receive: ieSi

= E Á s Z x , - l á l l Y , - l ^ h l b - l f i k ^ I Z Y r E f t ) =0 V ie S t le S j / \ i e S j ie S j j \ ieSj ie S j ) VieS, ieS, /

and then:

E l E p( T 2Si- Y d) 2 = E pE i

₍

_{ь и}

_{Е ^ - Е}

_а

_{) + (}

_е

_{у<- E / 4*)}

_ \ i e ie S d J V eS * ieS g J _

= EpE {

f Z

_{Xi] (b2S - ß ) 2}

+ f Z

_Yl

- Z л)

A í e S j / \ i e S j ie S , / _

(5)

2

= E , £ * < MSE{(b2S) + i D j i Y , ) =

{(

2

= E „ E x , M S E ^ + ^ Z x ? ,

where MSE<.(/>2s) — D 2( h 2S) + E^(b2S) — ß)2 ä D 2(b2S)

In p a rtic u la r, if ra n d o m variables У, have n o rm a l d is trib u tio n s , th en

Let us co m p are m ean square erro rs o f b o th p re d ic to rs assum ing, th a t У, have n o rm al d istrib u tio n s. O ne should rem em ber, th a t o u tliers can occur in the sam p le because o f som e d istu rb an ce s in d is trib u tio n o r e rro rs connected w ith d a ta edition. T his situation can im ply significant bias o f estim ates in th e case o f usage o f n o n -ro b u st pred icto r. B ut there is also problem o f a value o f the difference o f ro b u st pred icto r and B LU p re d ic to r’s m ean square erro rs. It equals:

T his difference is positive. T herefo re the B LU p re d ic to r is m o re ac cu ra te th an the ro b u st one. T h e difference is o f o rd er 0 ( n - 1 ). It m eans, th a t it decreases d ue to the increase o f sam ple size. H ence in large sam ples the accuracy o f b o th p re d icto rs is sim ilar b u t T 2Sd is ad d itio n ally ro b u st. It should be noticed, th a t the difference (8) is the sm aller, th e sm aller is P - expected value o f the to tal value o f auxiliary variable fo r n on -sam pled small d o m a in ’s elem ents. P roposed predictor should give good results for sam ple design p ro p o rtio n a l to the to tal value o f auxiliary v ariab le executed by i.e. L ahiri sam pling scheme.

(6)

5. C O N C L U SIO N

Sum m ing up, wc w ould like to state, th a t by analogy one can receive sim ilar results assum ing different sup crp o p u latio n m odels for d o m ain s or strata, bu t one c a n n o t forget, th a t asym ptotic con dition s m u st be m et to use d istrib u tio n p aram eters o f discussed regression coefficient. A lth o u g h possibility o f usage o f the p redictor can be seemed as lim ited, its stro ng advantages m ust be underlined. It should be stressed, th a t ro b u st pred icto rs different from presented predictor one can find in the book o f R. V a 11 i a n t, et al. (2000). T h eir idea is based on the proposal o f excluding in fo rm atio n on o u tliers fo r estim a tio n pu rp o ses. T h is a p p ro a c h req u ires subjective assessm ent, which o f sam pled elem ents are outliers. T h e co n stru c tio n o f the predictor p roposed in this p ap e r does n o t require to tak e such a subjective decision.

R EFEREN CES

C a s s e l C. M. , S ä r n d a l C. E., W r e t m a n J. H. (1977), Foundations o f Inference in Survey Sam pling, John W iley & Sons, N ew York London Syd ney-T oron to.

F i s z M. (1967), R achunek prawdopodobieństw a i sta ty s ty k a m a tem a tyczn a [in Polish], PW N , Warszawa.

R o y a l l R. M . (1976), The L in e a r Least Squares Prediction Approach to T w o-Stage Sam pling, “Journal o f the A m erican Statistical A ssociation", 71, 657-664.

T h e i l H. (1979), Z a sa d y eko n o m etrii [in Polish], PW N, Warszawa.

V a l l i a n t R. , D o r f m a n A. H. , R o y a l l R. M. (2000), Finite Population Sam pling and Inference. A Prediction A pproach, John Wiley & Sons, N ew Y ork-C hich ester-W ein- h eim -B risb ane-Singap ore-T oron to.

J a n u s z W y w ia ł, T o m a s z Ż ą d ło

O P E W N Y M O D P O R N Y M N A W AR TOŚCI O D D A L O N E P R E D Y K T O R Z E W A R T O ŚC I G LO BA LN EJ W M A Ł Y M O B S Z A R Z E

R ozw ażany jest problem predykcji w artośd globalnej w dom enie przy założeniu prostego m odelu regresyjnego nadpopulacji (model regresyjny z jedną zmienną objaśniającą i bez stałej). Podjęty zostaje problem odpornej na wartości oddalone estymacji parametru funkcji regresji. Zaprezentowany estym ator parametru funkcji regresji jest m edianą wszystkich w spółczynników kierunkowych prostych przechodzących przez początek układu współrzędnych i jeden z n punktów (x, y), gdzie n oznacza liczebność próby, x - zmienną dod atkow ą, а у - zm ienną badaną. Estymator ten jest uproszczoną formą estym atora prezentowanego w: H. T h e i l (1979).

(7)

A utorzy przy asym ptotycznych założeniach wyprowadzają wzór na błąd średniokw adratowy predykcji rozw ażanego predyktora odpornego. Przedstawiony zostaje także predyktor typu DLU dla zakładanego m odelu nadpopulacji wraz z błędem średniokw adratowym predykcji. D okładność obu predyktorów zostaje porównana przy założeniu norm alności rozkładu badanych zmiennych losow ych.