FOLIA OECONOMICA 3(314)2015 http://dx.doi.org/10.18778/0208-6018.314.03
Oleksii Doronin
*, Rostislav Maiboroda
**GEE ESTIMATORS IN MIXTURE MODEL WITH
VARYING CONCENTRATIONS
Abstract. We discuss a semiparametric mixture model where some components are parameterized with common Euclidean parameter and others are fully unknown. We introduce GEE (generalized estimating equations) approach and adaptive GEE-based approach for parameter estimation. Derived estimators are consistent and asymptotically normal, and they are optimized in terms of their dispersion matrices. Proposed techniques are tested on simulated samples.
Keywords: mixture model, semiparametric estimation, GEE.
1. INTRODUCTION
The cumulative distribution function (CDF) of one observation in a mixture model is expressed by a linear combination of some CDFs with
probabilities , (i.e. M F F ,...,1 M p p ,...,1 1 1
M m m p
M m m m F p x F 1 ( ) ( m p m p x) ). Note that is called the CDF of the -th mixture component, and − the component concentration. In mixture model with varying concentrations depends on the observation index: , m F m m j p m p j1,N. Thus,
M j m m jF x p x F j 1 ) ( ) ( , j1,N.We consider the case when some parametric model is known for the first
K
components: Fm(x)Fm(x;t), m1 K, . Parameter is assumed to be Euclidean: The true value oft
we designate ast
. d t
and assume that itis unknown. The CDFs of the last M mixture components are assumed toK
* Ph.D. student, Department of Probability Theory, Statistics and Actuarial Mathematics, Mechanics and Mathematics Faculty, Taras Shevchenko National University of Kyiv.
** Ph.D., Department of Probability Theory, Statistics and Actuarial Mathematics, Mechanics and Mathematics Faculty, Taras Shevchenko National University of Kyiv.
be fully unknown. We also assume that concentrations are known. Our goal is to estimate m j p .
To do this, we derive consistent and asymptotically normal estimators, and optimize them in terms of their dispersion matrices.
2. NONPARAMETRIC ESTIMATE FOR DISTRIBUTION FUNCTION
CDF of the m-th component may be estimated through the weighted empirical distribution function:
. 1 : ) ( ˆ } { 1 x N j m j m a I j N x F
Weights are taken as the solution of the minimization problem of maximal variance of unbiased estimates of for all possible CDFs (i.e.
where m j a ) (x Fm Fm m m e p a 1 p: (pmj )j 1,N,m N M, 1 M, : 1 T M M, p p N M ,..., 1 i m}) m i
e :({ ). See Maiboroda et al. (2008) for details.
Note that weights can be negative. Thus, we can improve by introducing improved empirical distribution function (see Maiboroda et al. (2005)):
m j a Fˆ xm( ) )) ( max , 1 min( : ) ( ˆ x F y F x y m . 3. GEE ESTIMATE
Consider some set of measurable functions
Theoretical moment may be estimated by the weighted empirical moment as . ) ; ( ),..., ; ( 1 t gK t d g
gk(x;t)Fk(dx) .) ; ( 1 : ) ( ˆ 1
N j j k k j k k a g t N t g Define the joint weighted empirical moment of
g
ˆ t
kk(
)
as.) ( ˆ : ) ( ˆ 1
K k k k t g t gDefinition. GEE estimator for ˆ is the measurable function from sample N 1,..., N
such that Next we assume that as
. . 0 ) ˆ ( ˆ g P[t:gˆ(ˆ)0]1
Example. Moment estimators can be represented as GEE estimators. Let be the set of estimating functions. Denote theoretical moment of
as , K h h ,...,1 ( Hk ) (x hk
( ) ( ; ) : ) h x F dxtt k k k1 K, . Define estimating functions as
), ( ) (x H t hk k : ) ; ( tx gk k1 K, . ) ˆ ( 1
K k k k hGEE estimator can be represented as where
ˆ
: ˆ H1
1
H is the inversed function to
Analogous improved moment estimate with can be introduced. .) ( k t H : ) ( 1
K k t H ) ( ˆ ) (x Fk dx : k k ˆ h h
kConsistency for moment estimators is shown in theorem 3.1 from Doronin (2014a).
4. ASYMPTOTICS OF GEE ESTIMATOR
Assume that CDFs are absolutely continuous with respect to sigma-finite measure
M
F F ,...,1
on the space of observations. Denote densities of eachcomponent's distributions as , ) ( ) ; ( : ) ( fk x d x dF x k k1 K, , , ) ( ) ( : ) ( x d x dF x f k k . , 1 M K k
Introduce the matrix of estimating functions
d K K x g x g x G ) ; ( ) ; ( : ) ( 1 .
Expectation of
G
(x
)
from the m-th component designate as , ) ( ) ( :
G x F dx Gm m m1,...,M.Introduce the following notations.
K K K l k N j s j r j l j k j N K l k l k s r s r a a p p N
, 1 , 1 , 1 , , , , 1 lim : ) ( : , r,s1,M.K K K l k N j m j l j k j N K l k l k m m a a p N
, 1 , 1 , 1 , , ) : lim 1 ( : , m1,M. K K m M m mf x x R( ):
( ) 1 . d d M s r rs s T r T G G dx x G x R x G Z
1 , , ) ( ) ( ) ( ) ( : . d d k t K k k dx F t t x g V
( ; ) ( ) : 1 .Theorem 4.1. (Theorem 3.4 from Doronin (2014a)) Let be GEE estimator in introduced definitions, and be some open neighborhood of the true parameter value
ˆ
U .
Assume the following.: (i) converges in probability to ˆ as N.
(ii) Derivatives exist and are integrable (i.e. ) for , where denotes expectation under condition that the true parameter value is
T k k x t g x t t g'( ; ) ( ; )/ U t Et , ||] ) ; ( [||g' t Et k m
t and m are the formal random values with
distrubutions Fm.
(iii) Functions g (t) E [gk( m;t)]
m
k are continuous on U.
(iv) E[suptU ||gk(m;t)||].
(v) Limit matrix exist and is nonsingular. (vi) Matrices r ,s and m exist.
(vii) Matrix V is nonsingular.
(viii) GEE is unbiased, i.e. [ ( ; )] 0
1
Kk Et gk k t for tU.
Then N(ˆ) converges in distribution to Gaussian distribution with zero mean and covariance matrix 1 T.
ZV
V
5. LOWER BOUND OF DISPERSION MATRIX FOR GEE ESTIMATOR
Assume that the matrix
Z
and nonsingular matrix V exist. Without loss of generality we can assume that two conditions for GEE estimator are fulfilled: ˆ(i1)
gk(x;)Fk(dx) 0, k1,K (unbiasedness);Consider the minimization problem of dispersion matrix
Z
in Loewner ordering (i.e. A if B A is non-negatively defined) over all B gk(x;).
d
c , satisfying conditions (i1), (i2). Thus, we have to minimize for all
The solution of this problem is the set of estimating functions ) which give us the lower bound of dispersion matrix
Zc cT ; x ( gk
Z (see theorem 4.1 from Doronin (2014a)).
6. ADAPTIVE ESTIMATE
Unfortunately, it is impossible to use in practice the optimal estimating functions which give the lower bound of dispersion matrix. The first reason is that they depend on unknown densities
), ; (x gk ), (x
fk k1 K, . The second one
is the difficulty to solve the GEE in the general case. Therefore, we consider the adaptive approach.
Each function can be approximated as where
is some matrix of coefficients to be found, and is the vector of some predefined basis functions (e.g. B-splines). Under conditions (i1), (i2) equation we can approximate as
) ; ( tx gk t) 0 ( ) ; ( tx u Bk k L k x t u ( ; ) k L d k B k
K k k k g 1ˆ
K k K k k k k k k t B u t g 1ˆ ( ) 1 ˆ ( ) ( ) 0 .The solution of this approximated equation is ˆ ( .)
1
K k k k ku B t Thus,one can start with some consistent estimate and define adaptive estimate as ~ .) ~ ( ˆ ~ : ˆ 1
K k k k ku B Consistency and asymptotic normality of introduced adaptive estimate is shown in lemma 3.3 from Doronin (2014b).
7. NUMERICAL RESULTS
We chose a three-component mixture model to simulate. All components are taken Gaussian, with parameter values (m,) as (3.2), for each component, respectively. The first two components are assumed to be
), (0.2 2 . 3
parameterized with (different means, common standard deviation). Distribution of the third component is assumed to be fully unknown. Concentrations were also generated as the pseudo-random values, derived by formula where is taken from uniform distribution on . Series of samples with sizes 50, 100, 250, 500, 750, 1000, 2000, 5000 were simulated, 2000 samples in each series. Vectors of basis functions
for adaptive estimate were chosen as the set of uniform cubic B-splines with knots at points T m m, , ) ( 1 2 ) 3 2 j j s s , /( 1 j m j m j s s p smj
]
1
,
0
[
) ; ( tx uk im where m and are the mean and standard deviation of the k-th component, respectively, i5,...,5. Matrices were chosen to minimize dispersion matrix. Results are shown in Figure 1.
k
B
CONCLUSIONS
The mixture model with varying concentrations is considered. Several estimators for this model are introduced (moment, GEE, adaptive). The proposed estimators are consistent and asymptotically normal under some conditions. Performance of moment and adaptive estimators are compared on simulated samples. Dispersion of introduced estimators converges to its theoretical asymptotic value for samples with 1000 and more observations.
REFERENCES
Doronin O. (2014a), Lower bound of dispersion matrix for semiparametric estimation in mixture model. "Theory of Probability and Mathematical Statistics", no. 90, p. 64−76.
Doronin O. (2014b), Adaptive estimation in semiparametric model of mixture with varying concentrations. "Theory of Probability and Mathematical Statistics", no. 91, p. 27−38. Doronin O. (2012), Robust Estimates for Mixtures with Gaussian Component. "Bulletin of Taras
Shevchenko National University of Kyiv. Series: Physics & Mathematics" (in Ukrainian), vol. 1, p. 18–23.
Maiboroda R., Sugakova O. (2008), Estimation and classification by observations from mixtures. Kyiv University Publishers, Kyiv (in Ukrainian).
Maiboroda R., Kubaichuk O. (2005), Improved estimators for moments constructed from observations of a mixture. "Theory of Probability and Mathematical Statistics", no. 70, p. 83−92.
Maiboroda R., Sugakova O., Doronin A. (2013), Generalized estimating equations for mixtures with varying concentrations. "The Canadian Journal of Statistics", no. 41, vol. 2, p. 217−236.
1 0 0 2 0 0 5 0 0 1 00 0 2 00 0 5 00 0 1 0 0 1 00 0 1 0 4 1 0 5 ) ˆ(m1 MSE 100 200 500 1000 2000 5000 1 0 0 50 2 0 0 300 15 0 70 ) ˆ (1 m RobVar 1 0 0 2 0 0 5 0 0 1 00 0 2 00 0 5 00 0 1 0 0 1 00 0 1 0 4 1 0 5 ) ˆ(m2 MSE 5000 100 200 500 1000 2000 ) ˆ(m2 RobVar
1 0 0 70 50 300 20 0 1 5 0
1 0 0 2 0 0 5 0 0 1 00 0 2 00 0 5 00 0 10 1 0 0 1 00 0 1 0 4 1 0 5 )ˆ ( MSE 100 200 500 1000 2000 5000 N )ˆ ( RobVar He re MSE is th e m ea n sq ua red e rro r of th e para m eter e st im at e m ul tip lied b y nu m be r of obse rv ations . RobVar i s th e ro bu st esti m ate of MSE t hro ug h the i nte rqu ar tile r an ge o f pa ra m ete r es tim ate . S ym bol ■ in dic ate s the m om ent e st im at es ( lo w er line f or im prov ed an d up pe r lin e for un im prov ed), an d ▲ − a da pt ive e sti m ates. W hi te s ym bo ls in di cate th eo re ti cal disp ersi on . Sy m bol ○ ind ic at es th e l ow er bou nd . Figure 1 . D is pers io n of es tim ate s So urce : pl ots are g enera ted by W olf ra m M at he m at ica usin g ou r own s cr ip t.