OF HORVITZ-THOMPSON STATISTIC UNDER POISSON SAMPLING DESIGN

(1)

Janusz L. Wywiał

Uniwersytet Ekonomiczny w Katowicach

ON LIMIT DISTRIBUTION

OF HORVITZ-THOMPSON STATISTIC UNDER POISSON SAMPLING DESIGN

Introduction

Let U_N be a fixed population of the size N , so, N =2,3,... The elements of the population are identified. So, the population can be represented by the set:

{

N

}

U_N = 1,..., . The observation of a variable under study is denoted by y_k_,_N, N

k =1,..., , N =2,3,... So, the vector y_N =[y₁_,_N y₂_,_N ...y_N_,_N] is attached to the set U_N. Particularly, when we assume that U_N ⊂U_N₊₁ then the observa- tions of a variable in the population can be represented more simply by the vector: y_N₊₁=[y_Ny_N₊₁] where y_N =[y₁, y₂...y_N]. A more particular case is as follows. Let (y_i,w_iN_t) means that the value y_i is w_iN_t times replicated in the population, where Σ^k_i₌₁w_i =1^and0< w₁<1 for i=1,2,...,k, t=1,2,..., and the vector y_N =[y_k y_k,..., y_k] is fixed. The size N_t of the population U_N_t is determined in such a way that w_iN_t is an integer for all i=1,...,k. So,

)]

, ( ...

) , ( ) ,

[( ₁ ₁ _t ₂ ₂ _t _k _k _t

N y wN y w N y w N

t =

y .

We assume that the all elements of the population can be selected for the sample with different probabilities. A k-th population element, k∈U_N, be selected for the sample with the inclusion probability 0<π_k_,_N <1, k=1,...,N . More precisely, let S_N =[S₁_,_N ...S_N_,_N] be the vector of independent binary random variables and

(2)

) 0 (

1 )

1

(S_k_,_N = = _k_,_N = −P S_k_,_N =

P

π

(1)

So, s_N =[s₁_,_N ...s_N_,_N] is the realization of the random sample

S

^.

The probability distribution of the random sample

S

is known as Poisson sampling design [see, e.g. Tille 2006]:

N N k

k s

N k s

N k N N k

N

s

S

P ( = ) = ∏

₌₁

π

_,^,

( 1 − π

_,

)

¹⁻ ^, . The population total

∑

∈

=

UN

k N

y

k

y

_,

~

can be estimated on the basis of the Horvitz-Thompson statistic [1952]:

∑

∈

=

N

N k U kN

N k N k HTS

S y y

, , ,

π

^.

It is well known that

E ( y

_HTS_N

) = ~ y

if

π

_k_,_N

> 0

for all

k = 1 , ..., N

. Because of P(S_k_,_N =1,S_h_,_N =1)=

π

_k_,_N

π

_h_,_N for all

k = 1 , ..., N

,

h = 1 , ..., N

and

k ≠ h

the variance of the statistic

HTSN

y

is:

( ) _∑

∈

= −

N N

U

k kN

N k N k HTS

y y V

, , 2

,

( 1 )

π

. (2)

Its unbiased estimator is:

( ) _∑

∈

= −

N

N k U kN

N k N k N k HTS

S y y

V

₂

, , , 2

,

( 1 )

π

. (3)

Let

∑

∈

=

UN

k

N k

N y

b₃_, N1 _, ³

and

∑

∈

=

UN

k N k

N y

v₂_, N1 ²_,

∑

∈

=

UN

k N k

N y

v₄_, N1 ⁴_, .

(3)

The original version and the proof of the Lapunov's [2001] theorem, which is slightly less general, can be found in the monograph by Fisz [1963]. On the basis of the books by Billingsley [2009] or Jakubowski and Sztencel [2004], the following more general version of the theorem is presented.

Theorem 1. Let

Z

_k_,_N^,

k = 1 , ..., N

,

N = 1 , 2 , ...

be a sequence of independent random variables and for some

δ > 0

2

→ 0

=

^δ₊_δ

β

N N

N

C

B

if

N → ∞

⁽⁴⁾

where:

∑

=

− +

= ^N

k

N k N

N EZk EZ

B

1

2 ,

, ( ) ^δ

δ ,

∑

=

= ^N

k

N

N V Zk

C

1

, )

( ^{. (5)}

Under these Lapunov's conditions, the random variable:

( )

N N

k

N k N

k

N

C

Z E Z Z

∑

=

−

=

¹ ^, ^,

) (

converges in distribution to the normal standard distribution if

N → ∞

^. Hájek [1964] considered a limit distribution for the following statistic

( )

∑

=

−

= ^N

k

k HTS k

S S

N y r

H _N

1

π

where:

( )

∑

=

−

=

_N

k

k k N k

k N

y

k

r

1 1

,

1 1

π π

π

.

He proved that the probability distribution of the statistic HS tends to the normal distribution because it fulfils the well known Lindeberg condition.

In the next section, the limit theorem for the estimator

HTSN

y

will be considered.

(4)

1. Limit theorem

Firstly, let us formulate the following statistics and the theorem.

) (

~

N N

HTS HTS

N V y

y

T y −

= ,

) ( ˆ ~

N N

N

HTS S HTS

N V y

y

T y −

= . (6)

We say that

π

_k_,_N

= O ( N

⁻^α

)

if for all

0 ≤ α < 1

there exists such a₁^and

a

0^that

0 ≤ a

₁

≤ a

₀

< 1

^and

{ }

{ max } 1

max

0 < a

₁

N

⁻^α

≤

_N₌₁_,₂_,... _k₌₁_,...,_N

π

_k_,_N

≤ a

₀

N

⁻^α

<

⁽⁷⁾

or

{ }

max 1

0 <

₁

≤ max

⁼¹^,²^,... ⁼¹^,..., ^,

≤ a

₀

<

a

^N

N

_α^k ^N

π

^k^N

Particularly, if

α = 0

,

{ }

{ max } 1

max

0 < a

₁

≤

_N₌₁_,₂_,... _k₌₁_,...,_N

π

_k_,_N

≤ a

₀

<

Moreover,

π

_k⁻_,¹_N

= O ( N

^α

)

because for all

0 ≤ α < 1

there exists such

0 1 0 1

1

1 1 c

a c ≤ a < ≤

< that

α α

π ^c ^N

N c

N k N k

N 0

, ,..., 1 ,...

2 , 1 1

max 1 max

1 ≤

⎪⎭

⎪ ⎬

⎫

⎪⎩

⎪ ⎨

⎧

⎪⎭

⎪ ⎬

⎫

⎪⎩

⎪ ⎨

≤ ⎧

<

₌ ₌ (8)

) (

1

,

π

_k⁻_N

− = O N

α because for all

0 ≤ α < 1

there exists such

0 0

1

0<d1 ≤c −N⁻^α <c −N⁻^α ≤d that

(5)

α α

π ^d ^N

N d

N k N k

N 0

, ,..., 1 ,...

2 , 1 1

max 1 max

0 ≤

⎪⎭

⎪ ⎬

⎫

⎪⎩

⎪ ⎨

⎧

⎪⎭

⎪ ⎬

⎫

⎪⎩

⎪ ⎨

≤ ⎧

<

₌ ₌ (9)

) ( ) ( )

( N

^α

O N

^γ

= O N

^α⁺^γ

O

because for all

α ≥ 0

and

γ

≥0 from the inequalities 0<d₁ ≤N^α ≤d₀ and 0< g₁ ≤N^γ ≤g₀ results the following one

0 0 0 1

1

0<e1 ≤d g ≤N^α^+γ ≤d g ≤e (10)

Finally,

O ( N

^α

) O ( N

⁻^γ

) = O ( N

^α⁻^γ

)

because for all

α ≥ 0

and

γ

≥0 from the inequalities 0<d₁ ≤ N^α ≤d₀ and 0<g₁ ≤N^γ ≤ g₀ results the following one

0 1 0 0

1

0

1

l

g N d

g

l ≤ d ≤ ≤ ≤

<

^α⁻^γ (11)

Theorem 2. Let

{ } π

_k_,_N

= O ( N

⁻^α

)

for all

0 ≤ α < 1

and

∞

<

≤

<

₀ ₂_, ₂

0 v v

_N

v

and

b

₃_,_N

≤ b

₃

< ∞

for

N = 1 , 2 , ...

When

N → ∞

then T_N ⎯⎯→_d T ~ N(0,1). When additionally

v

₄_,_N

≤ v

₄

< ∞

^then

) 1 , 0 (

ˆ T ~ N

T

_N

⎯ ⎯→

_d .

Proof: On the basis of the theorem 1, it is sufficient to assume that

δ = 1

. The Horvitz-Thompson statistic can be rewritten in the following way.

= ∑

= N k kN

HTS

Z

y

N

1 ,

where:

N k

N k N k N k

S Z y

, , ,

,

⁼ π ^k ⁼ ¹ ^, ^..., ^N

On the basis of the expression (1), we have:

⎪⎩

⎪⎨

⎧

=

−

= =

=

0 1

) , (

, ,

, , ,

, ,

,

N k N k

N k

N k N k N k N

k N k

z z y z

Z P

if if

π π

π

(12)

(6)

and

N k N

k

y

Z

E (

_,

) =

_, ,

N k

N k N k

Z y E

, 2 2 ,

,

)

( ⁼ π

^, ^⎟^⎟_⎠

⎞

⎜⎜

⎝

⎛ −

= 1 1

) (

, 2

, ,

N k N k N

k y

Z

V

π

^,

k = 1 , ..., N

( )

( ⁻ ⁺

=

−

=

−

_k_N _k_N

N k

N k N

k N k N k

N k N

k N

k

S y y E

Z E Z

E

₃ _, ³ _,

, 3 3 ,

, 3 ,

, 3 3 ,

,

( ) 1 π π

π π π

( )) ( )( ( ) ) ⁽

kN

⁾

N k

N k N k N

k N

k N k N

k N k

y y

2 , ,

3 2 ,

, 2 , 2 ,

, 3 , ,

3

,

1 1 1 1 π

π π π

π − = − − + ≤ −

+

.

This and the expression (5) for γ=1 lead to the following.

( )( ( ) )

∑

=

+

−

=

^N

k kN kN kN

N k

N k N

B y

1

2 , 2 , 2 ,

, 3

,

1 π 1 π π

π

_,\

( )

) 1 (

1 ,

, 2

,

HTSN

N

k kN

N k N k

N

y V y

C − =

= ∑

=

π

, On the basis of the expressions (7)-(11) we have:

( )

≤

⎟ ⎟

⎠

⎞

⎜ ⎜

⎝

⎛

⎟ ⎟

⎠

⎞

⎜ ⎜

⎝

⎛ −

+

−

=

∑

=

2 3

1 ,

2 , 1

2 , 2 , 2 ,

, 3 ,

3

1 1 1 ) 1

(

N

k kN

N k N

k

N k N

k N

k N k

N N N

y y

C B

π

π π

π π β

( ) ( ) ( ) ( )

( )

=

⎟⎠

⎜ ⎞

⎝

⎛ −

−

=

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛ −

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛ −

≤

∑

=

2 3

1 2

, 1

3 ,

2 3

1 ,

2 ,

1 , ,

3 ,

1 1

1 1 1 1 1

N k

N k N

k N k

N

k kN

N k N

k kN kN

N k

N O y

N O N O y

y y

α α α

π π π

( ) ( )

( )

( ) ( )

( )

₃_/₂

(

⁽ ¹⁾^/²

) ^.

0 2 / ) 1 ( 3

3 1 2 2

/ 3

, 2 2 / ) 1 ( 3

, 3 1 2

2 3

1 2

, 1

3 , 2

− +

+ +

+

=

= ≤ =

⎟ ⎠

⎜ ⎞

⎝

⎛

=

∑

α

α α α

α

α α

N v O

N O

b N O v

N O

b N O y

N O

y N

O

N N N

k N k N k

N k

(7)

It is easy to show that

β

→0, when

N → 0

^{to and}0≤

α

≤

α

₁ <1. This and the theorem 1 lead to the conclusion that

T

_N

→ T ~ N ( 0 , 1 )

.

In order to prove the second part of the theorem, we firstly show that

)

( / )

(

N N

N HTS HTS

S

N

V y V y

R =

converge in probability to 1. The expression (3) leads to the variance of the sample variance of Horvitz-Thompson statistic:

⎟=

⎟

⎠

⎞

⎜⎜

⎝

⎛ −

=

∑

= N

k k

N k N k k HTS

S N

S V y

y V

V _N _N

1 2

, , 2

) (

) 1

)) ( (

(

π

( )

∑

= ∈ ∈

=

−

⎟ =

⎟

⎠

⎞

⎜ ⎜

⎝

⎛ −

− =

=

U k

k U

k kN

k N

k

k k

N k

k

V S y y O N

N

y

4 3

3

, 4 ,

1 4

2 , 4

1 ) ( 1 1

) ) (

( ) 1

(

_α

π π

π

(

⁽³ ⁾

) ∑

⁴

(

³ ¹

)

₄_,

.

∈

=

+

=

U k

N

k

O N v

y N

O

^α ^α

Hence, on the basis of the expression (2) we have

( )

( ) _⎟⎟ ⁼

⎠

⎞

⎜⎜ ⎝

⎛ −

−

=

⎟ ⎟

⎠

⎞

⎜ ⎜

⎝

⎛

⎟ ⎟

⎠

⎞

⎜ ⎜

⎝

⎛ −

⎟ ⎟

⎠

⎞

⎜ ⎜

⎝

⎛ −

=

∑

∈

2 2

4 3

2

, 2

3

, 4

2

1 ) (

1 1 1 1 )

(

)) (

) ( (

α α

π π

N O y

y y y

V y V R V

V

N N N

U k

k U k

k

N U k

k k

N U k

k k

HTS HTS S N

) ) (

(

) (

)

(

₁

2 0 ) 1 ( 2

4 1 3

2 , 2 ) 1 ( 2

, 4 1 3

− +

+ +

+

=

≤

=

_α^α _α^α

O N

^α

v N

O

v N O v

N O

v N O

N N

Hence, V(R_N)=O(N^α⁻¹)→0^when

N → ∞

^and

0 ≤ α < 1

. So, this and the well known Tchebyshev's inequality lead to the conclusion that that

) ( / )

(

N N

N HTS HTS

S

N

V y V y

R =

converges in probability to 1 (in short:

⎯→1

⎯_p

RN ^if

v

₀

> 0

, v₄ <∞^and

N → ∞

. Let us note that

N N

N R

U =ˆ U _.

(8)

Hence, when

N → ∞

^thenT_N ⎯⎯→_d T ~ N(0,1) and R_N ⎯⎯→_p 1^{. So,} this and the well known Slucky's lemma, see e.g. Van der Vaart [2007], let us conclude that

T ˆ

_N

⎯ ⎯→

_d

T ~ N ( 0 , 1 )

. So, the proof of Theorem 2 has been completed.

2. Applications

The Poisson sampling design is frequently used to model non-response. In this case,

π

_{k ,}_N is the probability that a

k

-th population element will respond.

The Poisson sampling design can be treated as a model of the Internet research.

In this case,

π

_{k ,}_N is the probability that a

k

-th Internet user will respond.

Moreover, the Poisson sampling design can be considered in an audit sampling.

Let us note that, in the cases mentioned the probabilities

π

_{k ,}_N,

k = 1 , ..., N

,

...

,

= 2

N

are usually defined as follows.

∑

=

_N

1 i

N

x

k,

n

N i N

k

x

_,

π

,

where

n

is the expected sample size and

x

_k is a value of a positive auxiliary variable

x

observed in all the population. Let us assume that

∞

<

≤

< x

_{k ,}_N

b

0 α

and

n = wN

^{for all}

k = 1 , ..., N

,

N = 2 , ...

where

1 0 < < ≤ b

w a

. So, in this case, the first assumption of the Theorem 2 is fulfilled, because

1 0< ₀ = = ≤ _, ≤ = =a₁ <

a wb aN

nb bN

na b wa

a

π

_k_N

Theorem 2 lets us construct the confidence interval for the mean value estimated by means of the Poisson-Horvitz-Thompson strategy. Let

γ

be the confidence level and let

u

_γ be such a quantile that

2 ) 1

( γ

φ u

_γ

= ⁺

where

φ

(u) is the distribution function of the standard normal variable. When

N

is sufficient-

(9)

ly large, the confidence interval for the population mean is determined by the expression:

( −

γ

( ) < ~ < +

γ

( ) ) = ^γ

N N N

N N

N S HTS HTS S HTS

HTS

u V y y y u V y

y

P

.

It is possible to test the hypothesis on the population mean. The hypothesis

0 0

: y ~ ~ y

H =

can be tested on the basis of the statistic defined by the expression (6) when

N

is sufficiently large.

Finally, let us note that if the Lapunov's condition is fulfilled, the Lindeberg's condition is fulfilled [see, e.g. Billingsley 2009], too. Hence, if the assumptions of the above theorem 2 are fulfilled, the assumptions of Hájek's theorem are fulfilled, too. Moreover, it seems that in our case the assumptions of the theorem 2 are verified more simply than the Lindeberg's ones.

Acknowledgements

The research was supported by the grant number N N111 434137 from the Ministry of Science and Higher Education.

Literature

Billingsley P. (2009): Prawdopodobieństwo i miara (Probability and Measure). Wydaw- nictwo Naukowe PWN, Warszawa.

Fisz M. (1963): Probability Theory and Mathematical Statistics. Wiley and Sons, New York.

Hájek J. (1964): Asymptotic Theory of Rejective Sampling with Varying Probabilities from a Finite Population. “The Annals of Mathematical Statistics”, No. 35, 4.

Horvitz D.G., Thompson D.J. (1952): A Generalization of the Sampling without Replacement from Finite Universe. “Journal of the American Statistical Associa- tion”, No. 47.

Jakubowski J., Sztencel R. (2004): Wstęp do teorii prawdopodobieństwa (Introduction to Probability Theory). SCRIPT, Warszawa.

Lapunov A.M. (1901): Nouvell forme du theorem sur la limite de probabilite. „Mem.

Acad. Sci. St. Pétersburg”, No. 12.

Tillé Y. (2006): Sampling Algorithms. Springer, New York.

Van der Vaart A.W. (2007): Asymptotic Statistic. Cambridge University Press, Cam- bridge, New York, Melbourne, Madrit, Cape Town, Singapore, Sao Paulo.

(10)

O ROZKŁADZIE GRANICZNYM STATYSTYKI HORVITZA-THOMPSONA DLA PRÓBY DOBIERANEJ ZGODNIE Z PLANEM LOSOWANIA POISSONA

Streszczenie

W pracy na podstawie znanego twierdzenia centralnego Lapunowa jest wyprowa- dzany rozkład graniczny prawdopodobieństwa znanej statystyki Horvitza-Thompsona (HT). Okazało się, że jeśli określane przez plan losowania Poissona prawdopodobień- stwa wylosowania do próby poszczególnych elementów populacji spełniają pewne zało- żenia oraz rozmiar populacji rośnie nieograniczenie, to rozkład standardowej postaci statystyki HT zmierza do rozkładu normalnego standardowego. Taki sam wynik otrzymano przy dodatkowym założeniu narzuconym na prawdopodobieństwa wylosowania elemen- tów populacji do próby, gdy w standardowej postaci statystyki HT jej odchylenie stan- dardowe zastąpimy przez pierwiastek z nieobciążonego estymatora tej wariancji.

Rezultaty pracy znajdują zastosowania np. w pewnych typach badań ankietowych, a w szczególności internetowych, wykorzystujących wnioskowanie statystyczne, czyli estymację przedziałową lub testowanie hipotez statystycznych.