On an Improvement of the Model-Based Clustering Method

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S FO LIA O E C O N O M IC A 228, 2009______________

E w a Witek*

ON AN IMPROVEMENT OF THE MODEL-BASED

CLUSTERING METHOD

Abstract. A n im provem ent o f the m odel-based clustering (M B C ) m ethod in the case w hen EM algorithm fails as a result o f singularities is the basic aim o f this paper. R eplacem ent o f the m axim um likelihood (M L E) estim ator by a m axim um a posteriori (M A P ) estim ator, also found by the EM algorithm is proposed. M odels w ith different num ber o f com ponents are com pared using a m odified version o f BIC, w here the likeli-hood is evaluated at the M A P instead o f M LE. A highly dispersed p roper conjugate prior is show n to avoid singularities, but w hen these are not present it gives sim ilar results to the standard m ethod o f M BC.

Key words: M odel-based clustering (M B C ), G aussian m ixture m odels, EM algo-rithm , M LE, M A P, BIC, conjugate prior.

In model-based clustering, individual clusters are described by multivariate normal distributions, where the class labels, parameters and proportions are un-known. The data x, = jc/2___J*" are assumed to be generated by a mixture with density:

where / s.(xj|0 v) is a probability distribution with parameters 0 V, and t s is the probability o f belonging to the sth component. The parameters o f the model are usually estimated by maximum likelihood using the Expectation-Maximization (EM) algorithm (Dempster et al. [1997]). Each EM iteration consist o f two steps

* Ph.D student. Department of Statistics, The Karol Adamiecki University of Economics, Katowice

I. M ODEL-BASED CLU STERIN G

n u

(2)

- an E-step and an M-step. Given an initial guess for the cluster means fit , co-variances h s and proportions ts, the E-step calculates the conditional probabil-ity that object i belongs to the 5th component:

АлA)

YJr/Mi

_a

_,A)

The maximization step (M-step) consists o f estimating the parameters from the data and the conditional probabilities i is. The E- and M-steps iterate until convergence. Finally, each object is classified in the class in which it has the highest conditional or posterior probability. The results o f the EM are highly dependent on the initial values, model-based hierarchical clustering can be a good solution (Banfield and Raftery [1993]; Dasgupta and Raftery [1998])

In order to select the optimal clustering, model several measures have been proposed (McLachlan and Peel [2000]). In several applications, the BIC ap-proximation to the Bayes factor (Schwarz [1978]) has performed quite well (Dasgupta and Raftery [1998], Fraley and Raftery [1998], [2002]). The BIC has the form:

BICS = 2 logp {x 0 „ M s) - vs log(rt), (3)

where log/?(x0 v,M s.) is the maximized loglikelihood for the model and data, v.v is the number o f parameters to be estimated in the model M s and n is the num -ber o f observations in the data.

The strategy for model selection has been found to be effective in mixture estimation and clustering is given bellow:

1. Determine a maximum number o f clusters, и, (as small as possible) and a set o f mixture models to consider.

2. Estimate parameters via EM for each parameterization and each number o f components up to u. The conditional probabilities corresponding to a classifica-tion from model-based hierarchical clustering are good choices for initial values.

3. Compute the BIC for the mixture model using the optimal parameters from EM for 2 clusters. This results with a matrix o f BIC values corre-sponding to each possible combination of parameterization and number o f clusters.

4. Plot all o f the BIC values. A decisive first local maximum indicates strong evidence for a model (parameterization and number o f clusters).

(3)

II. LIMITATIONS OF EM ALGORITHM

The EM algorithm for clustering has a number o f limitations. First, the rate o f convergence can be very slow. This does not appear to be a problem in prac-tice for well-separated mixtures when started with reasonable values. Second, the number o f conditional probabilities associated with each observations is equal to number o f components in the mixture, so that the EM algorithm for clustering may not be practical for models with very large numbers o f compo-nents. Finally, EM breaks down when the covariance matrix corresponding to one or more components becomes ill-conditioned (singular or nearly singular). In general it cannot proceed if clusters contain only a few observations or if the observations they contain are very nearly collinear. If EM for a model having a certain number o f components is applied to a mixture in which there are actu-ally fewer groups, then it may fail due to ill-conditioning.

III. BAYEASIAN REGULARIZATION FOR MULTIVARIATE NORMAL MIXTURES

Fraley and Raftery (2005) proposed a replacement o f the MLE by the maxi-mum a posteriori (MAP) estimate from a Bayesian analysis to eliminate conver-gence failures o f the EM algorithm. They proposed a prior distribution on the parameters that eliminates failure due to singularity, while having little effect on stable results obtainable without prior. The Bayesian predictive density for the data is assumed to be o f the form

K ix (Ф ,- ) = P ( r , , Ц ,, Z , |0 ),

Where L mlv is the mixture likelihood:

i=l í=l

= Ш > -ы .vi

2 яЕ л 2 е х р { -^ (х , -щ .У L ; ’(x, - ц л.)}, (4)

and Р is a prior distribution on the parameters r v , ц ( and . Fraley and R after/ (2005) proposed to find a posteriori mode or MAP (maximum a

(4)

posteri-ori) rather than a maximum likelihood estimate for the mixture parameters. They used BIC for model selection, but in a modified form- the first term on the right- hand side o f (3), equal to twice the maximized log-likelihood is replaced by twice the log-likelihood evaluated at the MAP or posterior mode.

For multivariate data, a normal prior on the mean (conditional on the covari-ance matrix) has a form:

H ~ |e | 2 е х р | - ““ ^[(Ц - Ц р ) 7' 2 Г 1 (|1 - l» „)]j, (5)

and an inverse Wishart prior on the covariance matrix:

Hyperparameters \ip, к p, vp, are called mean, shrinkage and degrees o f freedom respectively, o f the prior distribution. The hyperparameter A ^, which is a matrix, is called the scale o f the inverse Wishart prior. The prior defined in this way is called conjugate prior for a multivariate normal distribution and an in-verse Wishart distribution. Under this prior, the posterior means o f the mean vector and covariance matrix are:

rix + Kpv p tcp +n а ; 1 + ( - ^ - Х х - Ц р) (х - цр ) / + E f=I(x, - x) (x, - x) 7 * к + n L = --- e.--- --- --- . (7) v p + m + 2

The normal inverted Wishart prior and its conjugacy to the multivariate normal are discussed in e.g. Gelman et al. (1995) and Schafer (1997).

Fraley i Raftery (2005) proposed the following choices for the prior hyper- paremeters ( к p, v р,Л р,д 2р) for multivariate mixtures:

/up is the mean o f the data, к р =0,0 1.

(5)

/7 X +ЛГ (I

The posterior mean ---- --- -—— can be viewed as adding к observations x p + ns

with value ц /; to each group in the data. The value was determined by experi-mentations. Values close to and bigger than 1 caused large perturbations in the cases where there were no missing BIC values without prior. tcp = 0,01 resulted in BIC curves that appeared to be smooth extensions to their counterparts with-out the prior.

vp = 777 + 2 (8)

The marginal prior distribution for ц is multivariate t centered р р with v - 777 + 1 degrees o f freedom. The mean o f this distribution is \ i r provided that v p > 777, and it has a finite covariance matrix provided v p > m + 1 (Schafer

[1997]).

g 2p = (for spherical and diagonal models). The average o f the di-agonal elements o f the empirical covariance matrix o f the data- S divided by the number o f components to the 21m power.

§

Ap = —2 ^ 7 ( f o r e l l i p s o i d a l m o d e l s ) t h e e m p i r i c a l c o v a r i a n c e m a t r i x o f t h e d a t a d i v i d e d b y t h e s q u a r e o f t h e n u m b e r o f c o m p o n e n t t o t h e 1//77.

IV. EX A M PLE

The data was generated by cluster.Gen function (cluster.Sim package o f R). Three elongated clusters contain two-dimensional data. The number o f observa-tions in each classes is: 13, 10, 13. The observaobserva-tions are independently drawn from bivariate normal distribution with means (0;0' ,(1.5;7) , (3; 14) and covari ance matrices: L, = 1 - 0,9' '1,5 o ' 1 О _ty* i , X 2 = _{, E , =} 1 - 0 ,9 1 0 1,5_’ J О _'t/I

Functions o f mclust package o f R were implemented to Bayesian regulariza-tion for mixture models

For the analyzed dataset the model and classification chosen according BIC without prior chooses four component VII model with four components, when the known number o f components is three. The standard BIC values based on the MLE are not available for six models (VII, VEI, EVI, VVI, VEV, VVV)

(6)

with five or more mixture components. For those number o f components models fail to converge without the prior because one o f the covariances becomes singu-lar as the EM iterations progress, as shown in Figure la). The hierarchical clus-tering result based on the unconstrained model used for initialization assings a single observation to one o f the groups in those cases. The Bayesian regulari-zation allows identification o f a group with a single member while allowing the covariance matrix to vary between clusters, which is not possible without the prior. The BIC with the prior peaks the 3 groups classification for Eli model. The Ell model with three components is chosen according to BIC with prior. In this case failures due to singularity for almost all models are eliminated and the right number o f clusters is selected.

?-Ell 0 W l A VII ■ E E E ♦ E E I * E E V ■ V E I V E V ♦ EV I o V W “ I number of components a) without prior

Source: Own research.

Figure 1. BIC values

Ell 0 W l д VII ■ E E E * E E I “ E E V ■ VEI V E V • EVI o W V 4 6 number of components b) with prior IV. C O N C LU SIO N S

We have shown an improvement o f the model-based clustering for avoiding the singularities that can arise in estimation using EM algorithm. The method involves a proper conjugate prior and uses the EM algorithm to find the MAP estimator. For model selection it uses a version o f BIC that is modified by re-placing the maximized likelihood by the likelihood evaluated at the MAP.

(7)

REFERENCES

Banfield J.D., Raftery A.E. (1993), Model-based Gaussian and non-Gaussian clustering, „Biometrics”, 49, 803-821.

Biernacki С, Celeux G., Govaert G., Langrognet F. (2006), Model-based cluster and disriminant analysis with the M1XMOD software, „Computational Statistics and Data Analysis”, 51, 587-600.

Dasgupta A., Raftery A.E. (1998), Detecting features in spatial point processes with clutter via model-based clustering, „Journal of the American Statistical Associa-tion”, 93, 294-302.

Dempster A.P., Laird N.M., Rubin D.B. (1977), Maximum likelihood for incomplete data via the EM algorithm (with discussion), „Journal of the Royal Statistical Soci-ety”, ser. B, 39, 1-38.

Fraley C., Raftery A.E. (1998), How many clusters? Which clustering method? Answers via model-based cluster analysis, „The Computer Journal”, 41, 577-588.

Fraley C., Raftery A.E. (2002), Model-based clustering, discriminant analysis, and den-sity estimation, „Journal of the American Statistical Association”, 97, 611-631. Fraley C., Raftery A.E. (2005), Bayesian regularization for normal mixture estimation

and model-based clustering, Technical Report 486, Department of Statistics, Uni-versity of Washington

Fraley C., Raftery A.E. (2006), MCĽUST Version 3: An R package for normal mixture modeling and model- based clustering, 1-50.

Gelman A., Carlin J.B., Stem H.S, Rubin D.B. (1995), Bayesian data analysis, Chapman and Hall, London.

McLachlan G.J., Peel D. (2000), Finite mixture models, Wiley, New York.

Schafer J.L. (1997), Analysis o f incomplete multivariate data by simulation, Chapman and Hall, London.

Schwarz G. (1978), Estimating the dimension o f a model, „The Annals of Statistics”, 6, 461-464.

Ewa Witek

O PEWNEJ MODYFIKACJI W METODZIE TAKSONOMII OPARTEJ NA MODELACH MIESZANYCH

W artykule przedstawiona została modyfikacja metody taksonomii opartej na mode-lach mieszanych, w przypadku gdy niemożliwym staje się oszacowanie parametrów modelu za pomocą algorytmu EM. Gdy liczba obiektów przypisanych do klasy jest mniejsza niż liczba zmiennych opisujących te obiekty, niemożliwym staje się oszacowa-nie parametrów modelu. By uniknąć tej sytuacji estymatory największej wiarygodności zastępowane są estymatorami o największym prawdopodobieństwie a posteriori. Wybór modelu o najlepszej parametryzacji i stosownej liczbie klas dokonywany jest wówczas za pomocą zmodyfikowanej statystyki BIC.