Janusz L. Wywiał
Uniwersytet Ekonomiczny w Katowicach
ON LIMIT DISTRIBUTION
OF HORVITZ-THOMPSON STATISTIC UNDER POISSON SAMPLING DESIGN
Introduction
Let UN be a fixed population of the size N , so, N =2,3,... The elements of the population are identified. So, the population can be represented by the set:
{
N}
UN = 1,..., . The observation of a variable under study is denoted by yk,N, N
k =1,..., , N =2,3,... So, the vector yN =[y1,N y2,N ...yN,N] is attached to the set UN. Particularly, when we assume that UN ⊂UN+1 then the observa- tions of a variable in the population can be represented more simply by the vec- tor: yN+1=[yNyN+1] where yN =[y1, y2...yN]. A more particular case is as follows. Let (yi,wiNt) means that the value yi is wiNt times replicated in the population, where Σki=1wi =1 and 0< w1<1 for i=1,2,...,k, t=1,2,..., and the vector yN =[yk yk,..., yk] is fixed. The size Nt of the population UNt is determined in such a way that wiNt is an integer for all i=1,...,k. So,
)]
, ( ...
) , ( ) ,
[( 1 1 t 2 2 t k k t
N y wN y w N y w N
t =
y .
We assume that the all elements of the population can be selected for the sample with different probabilities. A k-th population element, k∈UN, be selected for the sample with the inclusion probability 0<πk,N <1, k=1,...,N . More precisely, let SN =[S1,N ...SN,N] be the vector of independent binary ran- dom variables and
) 0 (
1 )
1
(Sk,N = = k,N = −P Sk,N =
P
π
(1)So, sN =[s1,N ...sN,N] is the realization of the random sample
S
.The probability distribution of the random sample
S
is known as Poisson sampling design [see, e.g. Tille 2006]:N N k
k s
N k s
N k N N k
N
s
S
P ( = ) = ∏
=1π
,,( 1 − π
,)
1− , . The population total∑
∈
=
UN
k N
y
ky
,~
can be estimated on the basis of the Horvitz-Thompson statistic [1952]:∑
∈=
N
N k U kN
N k N k HTS
S y y
, , ,
π
.It is well known that
E ( y
HTSN) = ~ y
ifπ
k,N> 0
for allk = 1 , ..., N
. Because of P(Sk,N =1,Sh,N =1)=π
k,Nπ
h,N for allk = 1 , ..., N
,h = 1 , ..., N
andk ≠ h
the variance of the statisticHTSN
y
is:( ) ∑
∈
= −
N N
U
k kN
N k N k HTS
y y V
, , 2
,
( 1 )
π
π
. (2)Its unbiased estimator is:
( ) ∑
∈
= −
N
N k U kN
N k N k N k HTS
S y y
V
2, , , 2
,
( 1 )
π
π
. (3)Let
∑
∈=
UN
k
N k
N y
b3, N1 , 3
and
∑
∈
=
UN
k N k
N y
v2, N1 2,
∑
∈
=
UN
k N k
N y
v4, N1 4, .
The original version and the proof of the Lapunov's [2001] theorem, which is slightly less general, can be found in the monograph by Fisz [1963]. On the basis of the books by Billingsley [2009] or Jakubowski and Sztencel [2004], the following more general version of the theorem is presented.
Theorem 1. Let
Z
k,N,k = 1 , ..., N
,N = 1 , 2 , ...
be a sequence of independent random variables and for someδ > 0
2
→ 0
=
δ+δβ
N N
N
C
B
ifN → ∞
(4)where:
∑
=− +
= N
k
N k N
N EZk EZ
B
1
2 ,
, ( ) δ
δ ,
∑
=
= N
k
N
N V Zk
C
1
, )
( . (5)
Under these Lapunov's conditions, the random variable:
( )
N N
k
N k N
k
N
C
Z E Z Z
∑
=−
=
1 , ,) (
converges in distribution to the normal standard distribution if
N → ∞
. Hájek [1964] considered a limit distribution for the following statistic( )
∑
=−
−
= N
k
k HTS k
S S
N y r
H N
1
π
where:( )
( )
∑
∑
=
=
−
−
=
Nk
k k N k
k N
y
kr
1 1
,
1 1
π π
π
.
He proved that the probability distribution of the statistic HS tends to the normal distribution because it fulfils the well known Lindeberg condition.
In the next section, the limit theorem for the estimator
HTSN
y
will be con- sidered.1. Limit theorem
Firstly, let us formulate the following statistics and the theorem.
) (
~
N N
HTS HTS
N V y
y
T y −
= ,
) ( ˆ ~
N N
N
HTS S HTS
N V y
y
T y −
= . (6)
We say that
π
k,N= O ( N
−α)
if for all0 ≤ α < 1
there exists such a1 anda
0 that0 ≤ a
1≤ a
0< 1
and{ }
{ max } 1
max
0 < a
1N
−α≤
N=1,2,... k=1,...,Nπ
k,N≤ a
0N
−α<
(7)or
{ }
{ }
max 1
0 <
1≤ max
=1,2,... =1,..., ,≤ a
0<
a
NN
αk Nπ
kNParticularly, if
α = 0
,{ }
{ max } 1
max
0 < a
1≤
N=1,2,... k=1,...,Nπ
k,N≤ a
0<
Moreover,
π
k−,1N= O ( N
α)
because for all0 ≤ α < 1
there exists such0 1 0 1
1
1 1 c
a c ≤ a < ≤
< that
α α
π c N
N c
N k N k
N 0
, ,..., 1 ,...
2 , 1 1
max 1 max
1 ≤
⎪⎭
⎪ ⎬
⎫
⎪⎩
⎪ ⎨
⎧
⎪⎭
⎪ ⎬
⎫
⎪⎩
⎪ ⎨
≤ ⎧
<
= = (8)) (
1
1
,
π
k−N− = O N
α because for all0 ≤ α < 1
there exists such0 0
1
0<d1 ≤c −N−α <c −N−α ≤d that
α α
π d N
N d
N k N k
N 0
, ,..., 1 ,...
2 , 1 1
max 1 max
0 ≤
⎪⎭
⎪ ⎬
⎫
⎪⎩
⎪ ⎨
⎧
⎪⎭
⎪ ⎬
⎫
⎪⎩
⎪ ⎨
≤ ⎧
<
= = (9)) ( ) ( )
( N
αO N
γ= O N
α+γO
because for allα ≥ 0
andγ
≥0 from the inequal- ities 0<d1 ≤Nα ≤d0 and 0< g1 ≤Nγ ≤g0 results the following one0 0 0 1
1
0<e1 ≤d g ≤Nα+γ ≤d g ≤e (10)
Finally,
O ( N
α) O ( N
−γ) = O ( N
α−γ)
because for allα ≥ 0
andγ
≥0 from the inequalities 0<d1 ≤ Nα ≤d0 and 0<g1 ≤Nγ ≤ g0 results the fol- lowing one0 1 0 0
1
0
1l
g N d
g
l ≤ d ≤ ≤ ≤
<
α−γ (11)Theorem 2. Let
{ } π
k,N= O ( N
−α)
for all0 ≤ α < 1
and∞
<
≤
≤
<
0 2, 20 v v
Nv
andb
3,N≤ b
3< ∞
forN = 1 , 2 , ...
WhenN → ∞
then TN ⎯⎯→d T ~ N(0,1). When additionallyv
4,N≤ v
4< ∞
then) 1 , 0 (
ˆ T ~ N
T
N⎯ ⎯→
d .Proof: On the basis of the theorem 1, it is sufficient to assume that
δ = 1
. The Horvitz-Thompson statistic can be rewritten in the following way.= ∑
= N k kN
HTS
Z
y
N1 ,
where:
N k
N k N k N k
S Z y
, , ,
,
= π k = 1 , ..., N
On the basis of the expression (1), we have:
⎪⎩
⎪⎨
⎧
=
−
= =
=
0 1
) , (
, ,
, , ,
, ,
,
N k N k
N k
N k N k N k N
k N k
z z y z
Z P
if if
π π
π
(12)and
N k N
k
y
Z
E (
,) =
, ,N k
N k N k
Z y E
, 2 2 ,
,
)
( = π
, ⎟⎟⎠⎞
⎜⎜
⎝
⎛ −
= 1 1
) (
, 2
, ,
N k N k N
k y
Z
V
π
,k = 1 , ..., N
( )
( − +
=
−
=
−
kN kNN k
N k N
k N k N k
N k N
k N
k
S y y E
Z E Z
E
3 , 3 ,, 3 3 ,
, 3 ,
, 3 3 ,
,
,
( ) 1 π π
π π π
( )) ( )( ( ) ) (
kN)
N k
N k N k N
k N
k N
k N k N
k N k
y y
2 , ,
3 2 ,
, 2 , 2 ,
, 3 , ,
3
,
1 1 1 1 π
π π π
π π π
π − = − − + ≤ −
+
.This and the expression (5) for γ=1 lead to the following.
( )( ( ) )
∑
=+
−
−
=
Nk kN kN kN
N k
N k N
B y
1
2 , 2 , 2 ,
, 3
,
1 π 1 π π
π
,\( )
) 1 (
1 ,
, 2
,
HTSN
N
k kN
N k N k
N
y V y
C − =
= ∑
=
π
π
, On the basis of the expressions (7)-(11) we have:
( )
( )
≤
⎟ ⎟
⎠
⎞
⎜ ⎜
⎝
⎛
⎟ ⎟
⎠
⎞
⎜ ⎜
⎝
⎛ −
+
−
−
=
=
∑
∑
=
=
2 3
1 ,
2 , 1
2 , 2 , 2 ,
, 3 ,
3
1 1 1 ) 1
(
N
k kN
N k N
k
N k N
k N
k N
k N k
N N N
y y
C B
π
π π
π π β
( ) ( ) ( ) ( )
( )
=
⎟⎠
⎜ ⎞
⎝
⎛ −
−
=
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛ −
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛ −
≤
∑
∑
∑
∑
=
=
=
=
2 3
1 2
, 1
3 ,
2 3
1 ,
2 ,
1 , ,
3 ,
1 1
1 1 1 1 1
N k
N k N
k N k
N
k kN
N k N
k kN kN
N k
N O y
N O N O y
y y
α α α
π π π
( ) ( )
( )
( ) ( )
( )
3/2(
( 1)/2) .
0 2 / ) 1 ( 3
3 1 2 2
/ 3
, 2 2 / ) 1 ( 3
, 3 1 2
2 3
1 2
, 1
3 , 2
− +
+ +
+
=
=
= ≤ =
⎟ ⎠
⎜ ⎞
⎝
⎛
=
∑
∑
αα α α
α
α α
N v O
N O
b N O v
N O
b N O y
N O
y N
O
N N N
k N k N k
N k
It is easy to show that
β
→0, whenN → 0
to and 0≤α
≤α
1 <1. This and the theorem 1 lead to the conclusion thatT
N→ T ~ N ( 0 , 1 )
.In order to prove the second part of the theorem, we firstly show that
)
( / )
(
N NN HTS HTS
S
N
V y V y
R =
converge in probability to 1. The expression (3) leads to the variance of the sample variance of Horvitz-Thompson statistic:⎟=
⎟
⎠
⎞
⎜⎜
⎝
⎛ −
=
∑
= N
k k
N k N k k HTS
S N
S V y
y V
V N N
1 2
, , 2
) (
) 1
)) ( (
(
π
π
( )
∑
∑
∑
= ∈ ∈=
−
⎟ =
⎟
⎠
⎞
⎜ ⎜
⎝
⎛ −
− =
=
U k
k U
k kN
k N
k
k k
N k
k
V S y y O N
N
y
4 33
, 4 ,
1 4
2 , 4
1 ) ( 1 1
) ) (
( ) 1
(
απ π
π
(
(3 )) ∑
4(
3 1)
4,.
∈
=
+=
U k
N
k
O N v
y N
O
α αHence, on the basis of the expression (2) we have
( )
( ) ⎟⎟ =
⎠
⎞
⎜⎜ ⎝
⎛ −
−
=
⎟ ⎟
⎠
⎞
⎜ ⎜
⎝
⎛
⎟ ⎟
⎠
⎞
⎜ ⎜
⎝
⎛ −
⎟ ⎟
⎠
⎞
⎜ ⎜
⎝
⎛ −
=
=
∑
∑
∑
∑
∈
∈
∈
∈
2 2
4 3
2
, 2
3
, 4
2
1 ) (
1 ) (
1 1 1 1 )
(
)) (
) ( (
α α
π π
N O y
N O y
y y y
V y V R V
V
N N N
N N N
U k
k U k
k
N U k
k k
N U k
k k
HTS HTS S N
) ) (
(
) (
) (
)
(
12 0 ) 1 ( 2
4 1 3
2 , 2 ) 1 ( 2
, 4 1 3
− +
+ +
+
=
≤
=
αα ααO N
αv N
O
v N O v
N O
v N O
N N
Hence, V(RN)=O(Nα−1)→0 when
N → ∞
and0 ≤ α < 1
. So, this and the well known Tchebyshev's inequality lead to the conclusion that that) ( / )
(
N NN HTS HTS
S
N
V y V y
R =
converges in probability to 1 (in short:⎯→1
⎯p
RN if
v
0> 0
, v4 <∞ andN → ∞
. Let us note thatN N
N R
U =ˆ U .
Hence, when
N → ∞
then TN ⎯⎯→d T ~ N(0,1) and RN ⎯⎯→p 1. So, this and the well known Slucky's lemma, see e.g. Van der Vaart [2007], let us conclude thatT ˆ
N⎯ ⎯→
dT ~ N ( 0 , 1 )
. So, the proof of Theorem 2 has been completed.2. Applications
The Poisson sampling design is frequently used to model non-response. In this case,
π
k ,N is the probability that ak
-th population element will respond.The Poisson sampling design can be treated as a model of the Internet research.
In this case,
π
k ,N is the probability that ak
-th Internet user will respond.Moreover, the Poisson sampling design can be considered in an audit sampling.
Let us note that, in the cases mentioned the probabilities
π
k ,N,k = 1 , ..., N
,...
,
= 2
N
are usually defined as follows.∑
==
N1 i
N
x
k,n
N i N
k
x
,π
,where
n
is the expected sample size andx
k is a value of a positive auxiliary variablex
observed in all the population. Let us assume that∞
<
≤
≤
< x
k ,Nb
0 α
andn = wN
for allk = 1 , ..., N
,N = 2 , ...
where1
0 < < ≤ b
w a
. So, in this case, the first assumption of the Theorem 2 is ful- filled, because1 0< 0 = = ≤ , ≤ = =a1 <
a wb aN
nb bN
na b wa
a
π
kNTheorem 2 lets us construct the confidence interval for the mean value es- timated by means of the Poisson-Horvitz-Thompson strategy. Let
γ
be the con- fidence level and letu
γ be such a quantile that2 ) 1
( γ
φ u
γ= +
whereφ
(u) is the distribution function of the standard normal variable. WhenN
is sufficient-ly large, the confidence interval for the population mean is determined by the expression:
( − γ ( ) < ~ < +
γ ( ) ) = γ
N N N
N N
N S HTS HTS S HTS
HTS
u V y y y u V y
y
P
.It is possible to test the hypothesis on the population mean. The hypothesis
0 0
: y ~ ~ y
H =
can be tested on the basis of the statistic defined by the expression (6) whenN
is sufficiently large.Finally, let us note that if the Lapunov's condition is fulfilled, the Lindeberg's condition is fulfilled [see, e.g. Billingsley 2009], too. Hence, if the assumptions of the above theorem 2 are fulfilled, the assumptions of Hájek's theorem are fulfilled, too. Moreover, it seems that in our case the assumptions of the theorem 2 are verified more simply than the Lindeberg's ones.
Acknowledgements
The research was supported by the grant number N N111 434137 from the Ministry of Science and Higher Education.
Literature
Billingsley P. (2009): Prawdopodobieństwo i miara (Probability and Measure). Wydaw- nictwo Naukowe PWN, Warszawa.
Fisz M. (1963): Probability Theory and Mathematical Statistics. Wiley and Sons, New York.
Hájek J. (1964): Asymptotic Theory of Rejective Sampling with Varying Probabilities from a Finite Population. “The Annals of Mathematical Statistics”, No. 35, 4.
Horvitz D.G., Thompson D.J. (1952): A Generalization of the Sampling without Replacement from Finite Universe. “Journal of the American Statistical Associa- tion”, No. 47.
Jakubowski J., Sztencel R. (2004): Wstęp do teorii prawdopodobieństwa (Introduction to Probability Theory). SCRIPT, Warszawa.
Lapunov A.M. (1901): Nouvell forme du theorem sur la limite de probabilite. „Mem.
Acad. Sci. St. Pétersburg”, No. 12.
Tillé Y. (2006): Sampling Algorithms. Springer, New York.
Van der Vaart A.W. (2007): Asymptotic Statistic. Cambridge University Press, Cam- bridge, New York, Melbourne, Madrit, Cape Town, Singapore, Sao Paulo.
O ROZKŁADZIE GRANICZNYM STATYSTYKI HORVITZA-THOMPSONA DLA PRÓBY DOBIERANEJ ZGODNIE Z PLANEM LOSOWANIA POISSONA
Streszczenie
W pracy na podstawie znanego twierdzenia centralnego Lapunowa jest wyprowa- dzany rozkład graniczny prawdopodobieństwa znanej statystyki Horvitza-Thompsona (HT). Okazało się, że jeśli określane przez plan losowania Poissona prawdopodobień- stwa wylosowania do próby poszczególnych elementów populacji spełniają pewne zało- żenia oraz rozmiar populacji rośnie nieograniczenie, to rozkład standardowej postaci sta- tystyki HT zmierza do rozkładu normalnego standardowego. Taki sam wynik otrzymano przy dodatkowym założeniu narzuconym na prawdopodobieństwa wylosowania elemen- tów populacji do próby, gdy w standardowej postaci statystyki HT jej odchylenie stan- dardowe zastąpimy przez pierwiastek z nieobciążonego estymatora tej wariancji.
Rezultaty pracy znajdują zastosowania np. w pewnych typach badań ankietowych, a w szczególności internetowych, wykorzystujących wnioskowanie statystyczne, czyli estymację przedziałową lub testowanie hipotez statystycznych.