GEE Estimators in Mixture Model with Varying Concentrations

(1)

FOLIA OECONOMICA 3(314)2015 http://dx.doi.org/10.18778/0208-6018.314.03

Oleksii Doronin

*

, Rostislav Maiboroda

**

GEE ESTIMATORS IN MIXTURE MODEL WITH

VARYING CONCENTRATIONS

Abstract. We discuss a semiparametric mixture model where some components are parameterized with common Euclidean parameter and others are fully unknown. We introduce GEE (generalized estimating equations) approach and adaptive GEE-based approach for parameter estimation. Derived estimators are consistent and asymptotically normal, and they are optimized in terms of their dispersion matrices. Proposed techniques are tested on simulated samples.

Keywords: mixture model, semiparametric estimation, GEE.

1. INTRODUCTION

The cumulative distribution function (CDF) of one observation in a mixture model is expressed by a linear combination of some CDFs with

probabilities , (i.e. M F F ,...,₁ M p p ,...,1 1 1 



 M m m p 



M_ m m m F p x F 1 ( ) (  m p m p x) ). Note that is called the CDF of the -th mixture component, and − the component concentration. In mixture model with varying concentrations depends on the observation index: , m F m m j p m p  j1,N. Thus,



  M j m m jF x p x F j 1 ) ( ) (  , j1,N.

We consider the case when some parametric model is known for the first

K

components: Fm(x)Fm(x;t), m1 K, . Parameter is assumed to be Euclidean: The true value of

t

we designate as

t

. d   

t



and assume that it

is unknown. The CDFs of the last M mixture components are assumed toK

* Ph.D. student, Department of Probability Theory, Statistics and Actuarial Mathematics, Mechanics and Mathematics Faculty, Taras Shevchenko National University of Kyiv.

** Ph.D., Department of Probability Theory, Statistics and Actuarial Mathematics, Mechanics and Mathematics Faculty, Taras Shevchenko National University of Kyiv.

(2)

be fully unknown. We also assume that concentrations are known. Our goal is to estimate m j p .

 To do this, we derive consistent and asymptotically normal estimators, and optimize them in terms of their dispersion matrices.

2. NONPARAMETRIC ESTIMATE FOR DISTRIBUTION FUNCTION

CDF of the m-th component may be estimated through the weighted empirical distribution function:

. 1 : ) ( ˆ } { 1 x N j m j m a I _j N x F _ 



 _

Weights are taken as the solution of the minimization problem of maximal variance of unbiased estimates of for all possible CDFs (i.e.

where m j a ) (x F_m F_m m m e p a  1 p: (pmj )_j ₁_,_N_,_m N M,     _{1 M}_,  : 1 T _M M, p p N     M ,..., 1 i m})_ m i

e :({  ). See Maiboroda et al. (2008) for details.

Note that weights can be negative. Thus, we can improve by introducing improved empirical distribution function (see Maiboroda et al. (2005)):

m j a Fˆ x_m( ) )) ( max , 1 min( : ) ( ˆ _x _F _y F x y m   _ _. 3. GEE ESTIMATE

Consider some set of measurable functions

Theoretical moment may be estimated by the weighted empirical moment as . ) ; ( ),..., ; ( 1 t gK t d g   



g_k(x;t)F_k(dx) .) ; ( 1 : ) ( ˆ 1



  N j j k k j k k a g t N t g 

Define the joint weighted empirical moment of

g

ˆ t

_kk

(

)

as

.) ( ˆ : ) ( ˆ 1



  K k k k t g t g

(3)

Definition. GEE estimator for ˆ  is the measurable function from sample N  ₁,...,   N

such that Next we assume that as

. . 0 ) ˆ ( ˆ   g P[t:gˆ(ˆ)0]1

Example. Moment estimators can be represented as GEE estimators. Let be the set of estimating functions. Denote theoretical moment of

as , K h h ,...,₁ ( Hk ) (x hk



 ( ) ( ; ) : ) h x F dxt

t k k k1 K, . Define estimating functions as

), ( ) (x H t hk  k  : ) ; ( tx gk k1 K, . ) ˆ ( 1



 K k k k h

GEE estimator can be represented as where

ˆ

: ˆ _{ H}1

 1

H is the inversed function to

Analogous improved moment estimate with can be introduced. .) ( k t H : ) ( 1



  K k t H ) ( ˆ ) (x Fk dx  : k k  ˆ _h h



k

Consistency for moment estimators is shown in theorem 3.1 from Doronin (2014a).

4. ASYMPTOTICS OF GEE ESTIMATOR

Assume that CDFs are absolutely continuous with respect to sigma-finite measure

M

F F ,...,₁



on the space of observations. Denote densities of each

component's distributions as , ) ( ) ; ( : ) ( fk x d x dF x k    k1 K, , , ) ( ) ( : ) ( x d x dF x f k k  _ . , 1 M K k  

Introduce the matrix of estimating functions

d K K x g x g x G              ) ; ( ) ; ( : ) ( 1    .

Expectation of

G

(x

)

from the m-th component designate as , ) ( ) ( :



G x F dx Gm m m1,...,M.

Introduce the following notations.

K K K l k N j s j r j l j k j N K l k l k s r s r a a p p N       _       



, 1 , 1 , 1 , , , , 1 lim : ) ( :   , r,s1,M.

(4)

K K K l k N j m j l j k j N K l k l k m m a a p N       _       



, 1 , 1 , 1 , , ₎ _: _lim 1 ( :   , m1,M. K K m M m mf x x R( ):



_ ( )  1 . d d M s r rs s T r T G G dx x G x R x G Z 







_   1 , , ) ( ) ( ) ( ) ( :   . d d k t K k k dx F t t x g V _ _     

 

( ; ) ( ) : 1  .

Theorem 4.1. (Theorem 3.4 from Doronin (2014a)) Let be GEE estimator in introduced definitions, and be some open neighborhood of the true parameter value

ˆ

U .

 Assume the following.: (i) converges in probability to ˆ  as N.

(ii) Derivatives exist and are integrable (i.e. ) for , where denotes expectation under condition that the true parameter value is

T k k x t g x t t g'( ; ) ( ; )/ U t E_t ,   ||] ) ; ( [||_g' _t E_t _k _m

t and m are the formal random values with

distrubutions F_m.

(iii) Functions g (t) E [gk( m;t)]

m

k    are continuous on U.

(iv) E_[sup_t__U ||g_k(_m;t)||].

(v) Limit matrix  exist and is nonsingular. (vi) Matrices _{r ,}_s and _m exist.

(vii) Matrix V is nonsingular.

(viii) GEE is unbiased, i.e. [ ( ; )] 0

1 



 K

k Et gk k t for tU.

Then N(ˆ) converges in distribution to Gaussian distribution with zero mean and covariance matrix 1 T.

ZV

V 

5. LOWER BOUND OF DISPERSION MATRIX FOR GEE ESTIMATOR

Assume that the matrix

Z

and nonsingular matrix V exist. Without loss of generality we can assume that two conditions for GEE estimator are fulfilled: ˆ

(i1)



g_k(x;)F_k(dx) 0, k1,K (unbiasedness);

(5)

Consider the minimization problem of dispersion matrix

Z

in Loewner ordering (i.e. A if B A is non-negatively defined) over all B gk(x;)

.

d

c ,  satisfying conditions (i1), (i2). Thus, we have to minimize for all

The solution of this problem is the set of estimating functions ) which give us the lower bound of dispersion matrix

Zc cT ; x ( gk  

Z (see theorem 4.1 from Doronin (2014a)).

6. ADAPTIVE ESTIMATE

Unfortunately, it is impossible to use in practice the optimal estimating functions which give the lower bound of dispersion matrix. The first reason is that they depend on unknown densities

), ; (x gk  ), (x

fk k1 K, . The second one

is the difficulty to solve the GEE in the general case. Therefore, we consider the adaptive approach.

Each function can be approximated as where

is some matrix of coefficients to be found, and is the vector of some predefined basis functions (e.g. B-splines). Under conditions (i1), (i2) equation we can approximate as

) ; ( tx gk  t) 0 ( ) ; ( tx u Bk k L k x t u ( ; ) k L d k B   k



 K k k k g 1ˆ



 



    K k K k k k k k k t B u t g 1ˆ ( ) 1 ˆ ( ) ( ) 0   .

The solution of this approximated equation is ˆ ( .)

1



   K k k k ku B t   Thus,

one can start with some consistent estimate and define adaptive estimate as ~ .) ~ ( ˆ ~ : ˆ 1



   K k k k ku B   

Consistency and asymptotic normality of introduced adaptive estimate is shown in lemma 3.3 from Doronin (2014b).

7. NUMERICAL RESULTS

We chose a three-component mixture model to simulate. All components are taken Gaussian, with parameter values (m,) as (3.2), for each component, respectively. The first two components are assumed to be

), (0.2 2 . 3

(6)

parameterized with (different means, common standard deviation). Distribution of the third component is assumed to be fully unknown. Concentrations were also generated as the pseudo-random values, derived by formula where is taken from uniform distribution on . Series of samples with sizes 50, 100, 250, 500, 750, 1000, 2000, 5000 were simulated, 2000 samples in each series. Vectors of basis functions

for adaptive estimate were chosen as the set of uniform cubic B-splines with knots at points T m m, , ) ( ₁ ₂   ) 3 2 j j s s   , /( 1 j m j m j s s p  smj

]

1 ,

0 [

) ; ( tx uk  i

m where m and  are the mean and standard deviation of the k-th component, respectively, i5,...,5. Matrices were chosen to minimize dispersion matrix. Results are shown in Figure 1.

k

B

CONCLUSIONS

The mixture model with varying concentrations is considered. Several estimators for this model are introduced (moment, GEE, adaptive). The proposed estimators are consistent and asymptotically normal under some conditions. Performance of moment and adaptive estimators are compared on simulated samples. Dispersion of introduced estimators converges to its theoretical asymptotic value for samples with 1000 and more observations.

REFERENCES

Doronin O. (2014a), Lower bound of dispersion matrix for semiparametric estimation in mixture model. "Theory of Probability and Mathematical Statistics", no. 90, p. 64−76.

Doronin O. (2014b), Adaptive estimation in semiparametric model of mixture with varying concentrations. "Theory of Probability and Mathematical Statistics", no. 91, p. 27−38. Doronin O. (2012), Robust Estimates for Mixtures with Gaussian Component. "Bulletin of Taras

Shevchenko National University of Kyiv. Series: Physics & Mathematics" (in Ukrainian), vol. 1, p. 18–23.

Maiboroda R., Sugakova O. (2008), Estimation and classification by observations from mixtures. Kyiv University Publishers, Kyiv (in Ukrainian).

Maiboroda R., Kubaichuk O. (2005), Improved estimators for moments constructed from observations of a mixture. "Theory of Probability and Mathematical Statistics", no. 70, p. 83−92.

Maiboroda R., Sugakova O., Doronin A. (2013), Generalized estimating equations for mixtures with varying concentrations. "The Canadian Journal of Statistics", no. 41, vol. 2, p. 217−236.

(7)

1 0 0 2 0 0 5 0 0 1 00 0 2 00 0 5 00 0 1 0 0 1 00 0 1 0 4 1 0 5 ) ˆ₍_m1 MSE 100 200 500 1000 2000 5000 1 0 0 50 2 0 0 300 15 0 70 ) ˆ (1 m RobVar 1 0 0 2 0 0 5 0 0 1 00 0 2 00 0 5 00 0 1 0 0 1 00 0 1 0 4 1 0 5 ) ˆ₍_m2 MSE ₅₀₀₀ 100 200 500 1000 2000 ) ˆ₍_m2 RobVar

1 0 0 70 50 300 20 0 1 5 0

(8)

1 0 0 2 0 0 5 0 0 1 00 0 2 00 0 5 00 0 10 1 0 0 1 00 0 1 0 4 1 0 5 )ˆ ( MSE 100 200 500 1000 2000 5000 N )ˆ (  RobVar He re MSE is th e m ea n sq ua red e rro r of th e para m eter e st im at e m ul tip lied b y nu m be r of obse rv ations . RobVar i s th e ro bu st esti m ate of MSE t hro ug h the i nte rqu ar tile r an ge o f pa ra m ete r es tim ate . S ym bol ■ in dic ate s the m om ent e st im at es ( lo w er line f or im prov ed an d up pe r lin e for un im prov ed), an d ▲ − a da pt ive e sti m ates. W hi te s ym bo ls in di cate th eo re ti cal disp ersi on . Sy m bol ○ ind ic at es th e l ow er bou nd . Figure 1 . D is pers io n of es tim ate s So urce : pl ots are g enera ted by W olf ra m M at he m at ica usin g ou r own s cr ip t.

70 50 30 20 15 300 10 0 2 0 0 1 5 0