 # High dimensional Sparse Gaussian Graphical Mixture Model

This paper considers the problem of networks reconstruction from heterogeneous data using a Gaussian Graphical Mixture Model (GGMM). It is well known that parameter estimation in this context is challenging due to large numbers of variables coupled with the degeneracy of the likelihood. We propose as a solution a penalized maximum likelihood technique by imposing an l_1 penalty on the precision matrix. Our approach shrinks the parameters thereby resulting in better identifiability and variable selection. We use the Expectation Maximization (EM) algorithm which involves the graphical LASSO to estimate the mixing coefficients and the precision matrices. We show that under certain regularity conditions the Penalized Maximum Likelihood (PML) estimates are consistent. We demonstrate the performance of the PML estimator through simulations and we show the utility of our method for high dimensional data analysis in a genomic application.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Networks reconstruction has become an attractive paradigm of genomic science. Suppose we have data originate from different densities such as , ,… , where

is a multivariate normal distribution with mean vector

and variance covariance matrix

and s are the mixture proportions. The question we ask ourselves is what is the underlining networks from which the data come from? Statistical methods for analyzing such data are subject to active research currently (Agakov et al., 2012). Gaussian graphical Model (GGM) are a way to model such data.

A Gaussian graphical Model for a random vector is a pair where is an undirected graph and is the model comprising all multivariate normal distributions whose inverse covariance matrix or precision matrix entries satisfies . The conditional independence relationship among nodes are captured in . Consequently, the problem of selecting the graph is equivalent to estimating the off-diagonal zero-pattern of the concentration matrix. Further details on these models as well interpretation of the conditional independency on the graph can be found in (Lauritzen, 1996)

In genomics, often there is heterogeneity in the data. We observe that in broad range of real world application ranging from finance to system biology, structural dependencies between the variables are rarely homogeneous i.e our population of individuals may come from different clusters or mixture components without any information about their cluster membership. One challenge is, given only the sample measurement and with sparsity constraint, to recover the underlying networks.

Mixture distributions are often used to model heterogeneous data or observations supposed to have come from one of

different components. Under Gaussian mixtures, each component is suitably modelled by a family of Gaussian probability density. This paper deals with the problem of structural learning in reconstructing the underlying graphical networks (using a graphical Gaussian model) from a data supposed to have come from a mixture of Gaussian distributions.

We consider model-based clustering (McLachlan et al., 2002) and assume that the data come from a finite mixture model where each component represents a cluster. A large literature exists in normal mixture models; (Lo et al., 2001; Bozdogan, 1983)

. Our focus here is on a high dimensional data setting where we present an algorithm based on a regularized expectation maximization using Gaussian Mixture Model (GMM). We assume that our data

is generated through a latent generative mixture components. We aim to group the data into a few clusters and identify which observations are from which Gaussian components.

A natural way for parameter estimation in GMMs is via a maximum likelihood estimation. However some performance degradation is encountered owing to the identifiability of the likelihood and the high dimensional setting. To overcome these problems, Banfield and Raftery (1993)

proposed a parameter reduction technique by re-parameterizing the covariance matrix through eigenvalue decomposition. In doing so, some parameters are shared across clusters. As a result of a continuous increasing number of dimensions, this approach can not totally alleviate the

phenomena. Recently proposals to overcome the high dimensionality problem involve estimating sparse precision matrix. Among these proposals is the penalized log likelihood technique of Friedman et al. (2008), an regularization approach which encourages many of the entries of the precision matrix to be . Our method is based on this idea. The penalty promotes sparsity. We provide sufficient conditions for consistency of the penalized MLE.

Closely related to our work is that of Pan and Shen (2007) where variable selection is considered in model-based clustering. They considered GMM and penalize only the mean vectors and seeking to estimate sparse mean vectors. They assumed a common diagonal covariance matrix for all clusters. This work was later extended to (Zhou et al., 2009) where a new approach to penalized model-based clustering was considered but this time with unconstrained covariance matrices. However not much has been said about the consistency of the resulting estimators. Another recent work in this field is the work by Agakov et al. (2012) that learn structures of sparse high dimension latent variables with application to mixtures.

In this article, we propose a penalized likelihood approach in the context of Gaussian Graphical mixture model, which constraints the cluster distributions to be sparse. The parameters in the cluster distributions are estimated by incorporating an existing graphical lasso method for covariance estimation into an EM algorithm. In effect, we view each cluster as an instance of a particular GGM. Therefore we aim at not only identify the population of individuals cluster membership but also the dependencies among the variables in each subgroup. Additionally, we assess how well the resultant graphs obtained through Glasso relate to the true graphs and we provide consistency results of the estimates. Throughout this paper, we assume , the number of components of mixture models is known.

The reminder of this article is organized as follows: We introduce the model, set up the PMLE approach and summarize the main result in connection with the consistency of the Glasso estimator in section 2. We then proceed with the inference procedure through a penalized version of the EM algorithm in section 3. In section 4 we present some simulations and an example of applications to illustrate our results. We conclude with a brief discussion and future work in section 5.

## 2 Penalized maximum likelihood estimation

In this section we introduce our model-based clustering with GGM, then we derive the penalized likelihood upon which statistical inference via the EM algorithm is based and prove consistency of the PMLE.

### 2.1 The Mixture model

The model consists of assuming that a variable

, describing which component an individual originates, is a multinomial random variable with parameters

denoting the mixture proportions or the mixing coefficients with , , and is known. In essence

 P(Zi=k)=πk

In our mixture model, we suppose that some vector-valued random variables are a random sample from the mixture components. We model each subpopulation separately by assuming a GGM where . In this paper we assume that , . In practice, this means that the data is assumed to be normalized by subtracting the mean. Since is dependent on , we say that represents the class that produced and we know fully if we know which class falls. The density of each can be written as

 fγ(yi)=K∑k=1πkφk(yi|Θk) (1)

where denotes the density of Gaussian distribution with mean and inverse covariance covariance matrix ; represents the “incomplete” mixture data density of the sample i.e . We introduce the parameter set of mixture namely

 Ω={{Θk}Kk=1|Θk≻0,k=1,...,K}

; indicates that is positive-definite matrix, and

 J={{πk}Kk=1|πk>0,k=1,...,K}

where

 Γ=Ω×J (2)

denotes the parameter space with the true parameter defined as .

In order to characterize the mixture model and estimate its parameters thereby recovering the underlying graphical structure from the data (seen as mixture of multivariate densities), several approaches may be considered. These approaches include graphical methods, methods of moments, minimum-distance methods, maximum likelihood

(Ruan et al., 2011; Zhou et al., 2009) and Bayesian methods (Bernardo, 2003; Biernacki et al., 2000). In our case we adopt the ML method.

### 2.2 The penalized model-based likelihood

We can now write the likelihood of the incomplete data density as

 Ly(γ)=n∏i=1[K∑k=1πkφk(yi|Θ−1k)]

whose log-likelihood function is given by

 ly(γ)=n∑i=1logfγ(yi) (3)

The goal is to maximize the log-likelihood in (3) with respect to . Unfortunately, a unique global maximum likelihood estimate does not exist because of the permutation symmetries of the mixture subpopulation; (Day, 1969; Surajit and Bruce, 2005). Also the likelihood function of normal mixture models is not a bounded function on as was put forward by Kiefer and Wolfowitz (1956). On the question of consistency of the MLE, Chanda (1954), Cramer (1946) focus on local ML estimation and mathematically investigate the existence of a consistent sequence of local maximizers. These results are mainly based on Wald’s technique (Wald, 1949) . Redner (1981) later extended these results to establish the consistency of the MLE for mixture distributions with restrained or compact parameter spaces. It was proved that the MLE exists and it is globally consistent in a compact subset of that contains ; i.e

 given^γn|ly(^γn)=maxγ∈^Γly(γ),^γn→γ0 in prob. for n→∞

In addition to the degenerate nature of the likelihood Kiefer and Wolfowitz (1956) on the set , the“high dimensional, low sample size setting” - where the number of observations is smaller that the number of nodes or features - is another complication. Estimating the parameters in the GGMM by maximizing criterion (3) is a complex one. The penalized likelihood-based method (Friedman et al., 2008; Yuan and Lin, 2007) is a promising approach to counter the degeneracy of while keeping the parameter space unaltered. However, to make the PMLE work, one has to solve the problem of what kind of penalty functions are eligible. We opt for a penalty function that guarantees consistency and also prevents the likelihood from degenerating under the multivariate mixture model. We assume that the penalty function satisfies:

 lim|Θk|→∞P(Θk)|Θk|n=0∀k∈{1,2,...,K}∀n (4)

where denotes determinant of , .

This results in placing an penalty on the entries of the concentration matrix so that the resulting estimate is sparse and zeroes in this matrix correspond to conditional independency between the nodes similar to (Nicolai et al., 2006). Numerous advantages result from this approach. First of all, the corresponding penalized likelihood is bounded and the penalized likelihood function does not degenerate in any point of the closure of parameter space and therefore the existence of the penalized maximum likelihood estimator is guaranteed. Next, in the context of GGM, penalizing the precision matrix results in better estimate and sparse models are more interpretable and often preferred in application.

We define the penalized log-likelihood as:

 lpy(γ)=ly(γ)−λK∑k=1||Θk||1 (5)

where is a user-defined tuning parameter that regulates the sparsity level, ,

is the number of mixing components assumed fixed. The hyperparameters

and determine the complexity of the model. The corresponding PMLE are defined as

 ^γλ=argmaxγlpy(γ) (6)

Our method penalizes all the entries of the precision matrix including the diagonal elements. We do this in order to avoid the likelihood to degenerate. To see this, consider a special case of a model consisting of two univariate normal mixtures . By letting with other parameters remaining constant, the likelihood tends to infinity for values of ; i.e the likelihood degenerates due to mixture formulation whereby a single observation mixture component with a decreasing variance on top of the observation explodes the likelihood. For that matter an penalty which does not penalize the diagonal elements tend to result in a degenerate ML estimator especially when .

### 2.3 Consistency

At this stage we want to characterize the solution obtained by maximizing (5). The general theorem concerning the consistency of the MLE (Redner, 1980; Wald, 1949) can be extended to cover our type of penalized MLE. This is because if a likelihood function which yields a strong consistent estimate over a compact set is given, then our penalty would not alter the consistency properties. Consistency of the PMLE is given in theorem 1. The latter uses results in (Wald, 1949) under the classical MLE over a compact set. The MEL version of theorem 1 can be found in (Redner, 1980). We define first a set of conditions upon which theorem 1 holds.

1. Let the parameter space be compact set, and denotes the quotient topological space obtained from and suppose that is any compact subset containing .

2. Let be the closed ball of radius about Then for any positive real number , let:

 fγ(y,r)=supη∈Br(γ)fγ(y,η);f∗γ(y,r)=max[1,fγ(y,r)]

Then for each and for sufficiently small

 ∫logf∗γ(y,r)fγ0(y,r)<∞
3.  ∫|logfγ0(y)|fγ0(y)<∞
4.  ∫|logfi(y,γi)|fj(y,γj)<∞forγi∈Γiandγj∈Γj
###### Theorem 1

Suppose that the mixing distribution satisfy conditions (C1-4). Define . Suppose that is bounded away from zero, it follows that for a fixed , the penalized likelihood solution is consistent in the quotient topological space i.e

 limn→∞P(|^γλn−γ0|>ϵ)=0
###### Proof 1

Let the PMLE and MLE be defined by

 ^γλn=argmaxγlpn(γ)

and

 ^γn=argmaxl(γ)

where

 lpn(γ)=l(γ)−λnK∑k=1||Θk||l1∀k∈{1,...,K}

Then

we have

 P(|^γλn−γ0|>ϵ) = P(|^γλn−^γn+^γn−γ0|>ϵ) ≤ P(|^γλn−^γn|>ϵ/2)+P(|^γn−γ0|>ϵ/2)

Considering the second inequality on the RHS of (LABEL:ineq), we can write that

 limn→∞P(|^γn−γ0|>ϵ/2)=0

This follows from theorem 5 of Redner (1981). Therefore it is sufficient to prove that

 limn→∞P(|^γλn−^γn|>ϵ/2)=0

Suppose is bounded below by a function under the following assumptions:

1. There exists a neighborhood of such that is continuously differentiable wit respect to parameters in

2. converges (pointwise) to as

We define

 ^γλn,L=argmaxγlpn,L(γ)

Then the followings hold:

 ∀δ>0∃n1∈Ns.t.∀n>n1|^γλn,L−^γn|<δ

Let be a function such that , then

 ^γgn=argmaxγgn(γ)

satisfies

 ∀δ>0∃n2∈Ns.t.∀n>n2|^γgn−^γn,L|<δ

Take , then we can write

 ∀δ>0∃n3∈Ns.t.∀n>n3|^γλn−^γn,L|<δ

Suppose and , then

 P(|^γλn−^γn|>ϵ2) ≤ P(|^γλn−^γn,L|>ϵ4)+P(|^γn,L−^γn|>ϵ4) (8) = 0a.sn→∞

## 3 Penalized EM algorithm

In order to maximize the penalized likelihood function (5) we consider a penalized version of the EM algorithm of Dempster et al. (1977) . To do that we first augment our data with so that the complete data associated with our model now becomes and an EM algorithm iteratively maximizes, instead of the penalized observed log-likelihood in (5), the quantity , the conditional expectation of the penalized log-likelihood of the augmented data and is the current value at iteration .

Suppose i.e is the density of the augmented data . Now the penalized log-likelihood of the augmented data can be written as

 lpc(γ) = ln[hci(γ)]−λK∑k=1||Θk||l1 lpc(γ) = n∑i=1lnπk+lnϕk(yi|Θ−1k)−λK∑k=1||Θk||l1 (9) = n∑i=1K∑k=11{Zi=k}[lnπk+lnϕk(yi|Θ−1k)]−λnK∑k=1||Θk||l1

Note the indicator function simply says that if you knew which component the observation came from, we would simply use its corresponding for the likelihood. For illustration purpose, suppose we have observations and we are certain that the first two were generated by the Gaussian density and the last came from , then we write the full log-likelihood as follows:

 lYZ(Θ)=lY1(Θ2)+lY2(Θ2)+lY3(Θ1) (10)

### 3.1 The E-step

From (9), We compute the quantity as follows

 Q(γ|γ(t)) = EZi[lYZ(γ)−λn||Θ||1|y;γ(t)] (11) = n∑i=1K∑k=1[lnϕk(yi|Θ−1k)+lnπk]EZi[1{Zi=k}|yi;γ(t)]−λn||Θk||1 = n∑i=1K∑k=1[lnϕk(yi|Θ−1k)+lnπk]P(Zi=k|yi;γ(t))−λn||Θk||1 = n∑i=1K∑k=1[lnϕk(yi|Θ−1k)+lnπk]ω(t)ik−λn||Θk||1

The E-step actually consists of calculating , the probabilities (condition on the data and ) that ’s originate from component . It can also be seen as the responsibility that component takes for explaining the observation

and it tells us for which group an individual actually belongs. This is the soft K-mean clustering. Using Bayes theorem, we have:

 ω(t)ik = P(Zi=k|yi,γ(t)) = P(yi|Zi=k;γ(t))P(Zi=k)∑Kl=1P(yi|Zi=l;γ(t))P(Zi=l) = ϕ(t)k(y|Θ−1k)π(t)k∑Kl=1ϕ(t)l(yi|Θ−1k)π(t)l

### 3.2 The M-step

The M-step for our mixture model can be split in to two parts, the maximization related to and the maximization related to .

1. M-step for :

For the maximization over we make use of the constraint that i.e and . It turns out that there is an explicit form for . Let . Then

 ∂Q∂πk0=n∑i=1⎡⎢⎣ω(t)ik0πk0−ω(t)iK1−∑K−1k=1πk⎤⎥⎦ (13)

Setting , yields the following:

 ω(t).k0K−1∑k=1πk+πk0ω(t).K=ω(t).k0 (14)

It can be shown that a unique solution to (14) is

 π(t+1)k0 = ω(t).k0/n (15) = n∑i=1ω(t)ik0/n
2. M-step for :

Next, to maximize over , we only need the term that depends on . The first thing we do here is to try to formulate the maximization problem for a mixture component to be similar to that for Gaussian graphical modeling with the aim of applying graphical lasso method. Now from , for a specific cluster , the term that depends on the cluster specific covariance matrix is given by

 Q(Θk0) = n∑i=1ω(t)ik0lnϕk0(yi|Θ−1k0)−λn||Θk0||1 (16) = n∑i=1ω(t)ik0[12ln|Θk0|−12y′iΘk0yi]−λn||Θk0||1 = n∑i=1ω(t)ik02ln|Θk0|−12tr(n∑i=1ω(t)ik0(yiy′i)Θk0)−λn||Θk0||1 = ω(t).k02⎡⎢⎣ln|Θk0|−tr(~Sk0Θk0)−2λnω(t).k0||Θk0||1⎤⎥⎦ = ω(t).k02[ln|Θk0|−tr(~Sk0Θk0)−~λn||Θk0||1]

where

 ~Sk0=∑ni=1ω(t)ik0(yiy′i)ω(t).k0 (17)

is the weighted empirical covariance matrix, and

 ^Θk0=argmaxΘ{ln|Θk0|−tr(~Sk0Θk0)−~λn||Θk0||1} (18)

subject to the constraint that is positive definite with .

Therefore the maximization of consists of running the graphical lasso procedure (Friedman et al., 2008) for each cluster where each observation for gets a weight and the sampling covariance matrix is transformed to a weighted sampling covariance. This is a major innovation in our work where we formulate the Gaussian mixture modelling problem as a Gaussian Graphical modeling framework.

## 4 Simulation and Real-data Exampple

We generate data from a two components mixture and consider two different schemes based on . We study the consistency properties of the PMLE by allowing the sample size to grow. We subsequently applied our method to well known data “Scor’ and “CellSignal” data.

### 4.1 Simulation

We investigate the consistency properties of the PMLE using our penalized EM algorithm described in section 3. We simulate data from two-component multivariate normal mixture models with probability (true mixture proportion) equals and inverse covariance matrix built according to the following schemes.

 Θ1(i,j)=⎧⎪⎨⎪⎩1ifi=j−0.4,if|i−j|=10,elsewhere (19)
 Θ2(i,j)=⎧⎪⎨⎪⎩1ifi=j−0.4,if|i−j|=20,elsewhere (20)

The corresponding graphical model structures are depicted in Figure 1. For a fixed , we consider two schemes one with and the other with , each with increasing sample sizes, to examine the consistency of the PMLEs. In all cases, parameter estimation is achieved by maximizing the likelihood function via our penalized EM-algorithm and a model selection is performed inside the algorithm based on Extended Bayesian Information Criterion (EBIC), (Chen and Chen, 2008). The results of our penalized EM-algorithm approach are compared based on the two different schemes corresponding to different values of .

Due to the effect of label switching, we are not able to assign each parameter estimate correctly to the right class. As a result, the estimates will be interchangeably represented. We compute the Absolute Deviation (AD) of the mixture proportions, and compare the Frobenuis norm of the difference between the true and estimated precision matrices for each cluster. In addition we compute the

score, True positive (TP), False positive (FP), Precision and Recall for the PMLE.

Example 1. We considered the simulated two-component multivariate normal mixture models above and choose sequence of values of such that . On experimental basis we set . The performances of the penalized EM-algorithm corresponding to different sample sizes are presented in Table 1.

The results show that as the sample size increases, the AD (for the mixture proportions) and the Frobenuis norms (for the precision matrices) decrease indicating the consistency of the PMLEs. At , the AD for the mixture proportion is almost , indicating that our method has recovered precisely the true mixture distribution. Based on the EBIC criterion, we reported also the score, the True Positive (TP), the False Positive (FP), the Precision and the Recall of the PMLE. We recorded an overall improvement in the score as increases.

Example 2. In this example, we again choose the same two-component multivariate Gaussian mixture model. In contrast to the model used in example 1, we have fixed the tuning parameter such that ; remain unchanged. The performances of the penalized EM-algorithm corresponding to different sample sizes are presented in Table 2. We again observe a decrease in both the Frobenuis norm and the AD as increases even though we suffer from a deficiency in the AD of for the case . However the AD is almost at . We note that this penalty decreases to faster and as result tends to produce full graph as can be seen in the higher value recorded for false positive. Figure 2: Graphical model of the 2 group of students Figure 3: Graphical model of the CellSignal data based on first 500 cells

Comparing the 2 examples, we observe that the choice of plays a strong hand in parameter and graph selection consistency of the resultant networks. The consistency properties of the PMLEs was achieved in both cases but our results indicates that the overall performance of the asymptotic behavior of is more satisfactory. Even though both penalty decrease to as increases, decreases slower resulting in a relatively sparser networks as compared to .

### 4.2 Real-data Examples

#### 4.2.1 Open/Closed Book Examination Data

As a simple example of a data set to which mixture models may be applied, we consider the “scor” data. This data can be found in the “bootstrap” package in R; type help(“scor”) in R for more details.

This is a data on students who took examinations in subjects namely mechanics, vectors, algebra, analysis, statistics. Some where with open book and others with closed book. Mechanics and vectors were with closed book.

We fit a two-mixture component to the data with a strong indication that there are two-groups of students each with similar subjects interest. We applied our PMLE algorithm to the data with based on scheme 1. The pattern of interaction among the two groups were depicted in Figure 2. The network differences as well as similarities are also shown. The results indicates that of students have similar subjects interest while falls in other group of interest. In one group, we observe no interactions between mechanics and analysis nor statistics and vectors while in the other group there are interactions.

#### 4.2.2 Analysis of cell signalling data

We consider the application of our method on the flow cytometry dataset (cell signalling data) of Sachs et al. (2005). The data set contains flow cytometry of proteins measured on cells; from which we selected the first cells. The CellSignal data were collected after a series of stimulatory cues and inhibitory interventions with cell reactions stopped at min after stimulation by fixation, to profile the effects of each condition on the intracellular signaling networks. Each independent sample in the data set is made up of quantitative amounts of each of the phosphorylated molecules, simultaneously measured from single cells.

We again fit a two-mixture component to the data. The result of applying our PMLE algorithm to the data set using the first scheme is shown Figure 3. The result indicates that of the observation falls in one component whiles falls in the other cluster. We also display the differences and similarities in the two components. The following proteins interaction were seen to be present in each of the two components: (), (), (), () to mention but few. Differences in the interaction occur among the following proteins: (), (), (); see Figure 3 for details.

## 5 Conclusion

We have developed a penalized likelihood estimator for Gaussian graphical mixture models. We impose an penalty on the precision matrix with extra condition preventing the likelihood not to degenerate. The estimates were efficiently computed through a penalized version of the EM-algorithm. By taking advantage of the recent development in Gaussian graphical models, we have implemented our method with the use of the graphical lasso algorithm. We have provided consistency properties for the penalized maximum likelihood estimator in Gaussian Graphical Mixture Model. Our results indicate a better performance in parameter consistency as well as in graph selection consistency for . Another interesting situation is when the order , the number of mixture components in the model is unknown. This is a more practical problem than the one we have discussed and probably involves simultaneous model selection. Thus we intend to continue our research in this direction and present the results in a future paper.

## References

• Agakov et al. (2012) Agakov, F.V., Orchard, P., Storkey, A.J., 2012. Discriminative mixtures of sparse latent fields for risk management.

Journal of Machine Learning Research - Proceedings Track 22, 10–18.

• Banfield and Raftery (1993) Banfield, J.D., Raftery, A.E., 1993. Model-based gaussian and non-gaussian clustering. Biometrics 49, pp. 803–821.
• Bernardo (2003) Bernardo, J., 2003.

Bayesian clustering with variable and transformation selections, in: Bayesian Statistics 7: Proceedings of the Seventh Valencia International Meeting, Oxford University Press, USA. p. 249.

• Biernacki et al. (2000) Biernacki, C., Celeux, G., Govaert, G., 2000. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22, 719–725.
• Bozdogan (1983) Bozdogan, H., 1983. Testing the number of components in a normal mixture .
• Chanda (1954) Chanda, K.C., 1954. A note on the consistency and maxima of the roots of likelihood equations. Biometrika 41, pp. 56–61.
• Chen and Chen (2008) Chen, J., Chen, Z., 2008. Extended bayesian information criteria for model selection with large model spaces. Biometrika 95, 759–771.
• Cramer (1946) Cramer, H., 1946. Mathematical methods of statistics. Princeton University Press.
• Day (1969) Day, N.E., 1969. Estimating the components of a mixture of normal distributions. Biometrika 56, pp. 463–474.
• Dempster et al. (1977) Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39, pp. 1–38.
• Friedman et al. (2008) Friedman, J., Hastie, T., Tibshirani, R., 2008. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441.
• Kiefer and Wolfowitz (1956) Kiefer, J., Wolfowitz, J., 1956. Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. The Annals of Mathematical Statistics 27, pp. 887–906.
• Lauritzen (1996) Lauritzen, Steffen, L., 1996. Graphical models. volume volume 17 of Oxford Statistical Science Series.
• Lo et al. (2001) Lo, Y., Mendell, N.R., Rubin, D.B., 2001. Determining the number of component clusters in the standard multivariate normal mixture model using model-selection criteria. Biometrika 88, 767–778.
• McLachlan et al. (2002) McLachlan, G.J., Bean, R.W., Peel, D., 2002. A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18, 413–422.
• Nicolai et al. (2006) Nicolai, M., Peter, B., Eth, Z., 2006. High dimensional graphs and variable selection with the lasso. Annals of Statistics 34, 1436–1462.
• Pan and Shen (2007) Pan, W., Shen, X., 2007. Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res. 8, 1145–1164.
• Redner (1980) Redner, R., 1980. Maximum Likelihood Estimation for Mixture Models. JSC (Series), Lyndon B. Johnson Space Center, NASA.
• Redner (1981) Redner, R., 1981. Note on the consistency of the maximum likelihood estimate for nonidentifiable distributions. The Annals of Statistics 9, pp. 225–228.
• Ruan et al. (2011) Ruan, L., Yuan, M., Zou, H., 2011. Regularized parameter estimation in high-dimensional gaussian mixture models. Neural Comput. 23, 1605–1622.
• Sachs et al. (2005) Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D.A., Nolan, G.P., 2005. Causal protein-signaling networks derived from multiparameter single-cell data. Science 308, 523–529.
• Surajit and Bruce (2005) Surajit, R., Bruce, G., 2005. The topography of multivariate normal mixtures. Annals of Statistics 33, pp. 2042–2065.
• Wald (1949) Wald, 1949. Note on the consistency of the maximum likelihood estimate. The Annals of Mathematical Statistics 20, pp. 595–601.
• Yuan and Lin (2007) Yuan, M., Lin, Y., 2007. Model selection and estimation in the gaussian graphical model. Biometrika 94, 19–35.
• Zhou et al. (2009) Zhou, H., Pan, W., Shen, X., 2009. Penalized model-based clustering with unconstrained covariance matrices. Electron J Sta 3, 1473 1496.