Global Convergence of Least Squares EM for Demixing Two Log-Concave Densities

This work studies the location estimation problem for a mixture of two rotation invariant log-concave densities. We demonstrate that Least Squares EM, a variant of the EM algorithm, converges to the true location parameter from a randomly initialized point. We establish the explicit convergence rates and sample complexity bounds, revealing their dependence on the signal-to-noise ratio and the tail property of the log-concave distribution. Moreover, we show that this global convergence property is robust under model mis-specification. Our analysis generalizes previous techniques for proving the convergence results of Gaussian mixtures, and highlights that an angle-decreasing property is sufficient for establishing global convergence for Least Squares EM.

Authors

• 4 publications
• 14 publications
• 34 publications
• EM Converges for a Mixture of Many Linear Regressions

We study the convergence of the Expectation-Maximization (EM) algorithm ...
05/28/2019 ∙ by Jeongyeol Kwon, et al. ∙ 0

• Singularity, Misspecification, and the Convergence Rate of EM

A line of recent work has characterized the behavior of the EM algorithm...
10/01/2018 ∙ by Raaz Dwivedi, et al. ∙ 0

• Near-Optimal Sample Complexity Bounds for Maximum Likelihood Estimation of Multivariate Log-concave Densities

We study the problem of learning multivariate log-concave densities with...
02/28/2018 ∙ by Timothy Carpenter, et al. ∙ 0

• Randomly initialized EM algorithm for two-component Gaussian mixture achieves near optimality in O(√(n)) iterations

We analyze the classical EM algorithm for parameter estimation in the sy...
08/28/2019 ∙ by Yihong Wu, et al. ∙ 0

• Bi-log-concavity: some properties and some remarks towards a multi-dimensional extension

Bi-log-concavity of probability measures is a univariate extension of th...
03/18/2019 ∙ by Adrien Saumard, et al. ∙ 0

• Challenges with EM in application to weakly identifiable mixture models

We study a class of weakly identifiable location-scale mixture models fo...
02/01/2019 ∙ by Raaz Dwivedi, et al. ∙ 0

• Efficient data augmentation techniques for Gaussian state space models

We propose a data augmentation scheme for improving the rate of converge...
12/24/2017 ∙ by Linda S. L. Tan, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One important problem in statistics and machine learning is to learn a finite mixture of distributions

[20, 26]

. In the parametric setting where the functional form of the distribution is known, this problem is to estimate parameters (e.g., mean and covariance) that specify the distribution of each mixture component. The parameter estimation problem for mixture models is inherently nonconvex, posing challenges for both computation and analysis. While many algorithms have been proposed, rigorous performance guarantees are often elusive. One exception is the Gaussian Mixture Model (GMM), for which much theoretical progress has been made in recent years. The goal of this paper is to study algorithmic guarantees for a much broader class of mixture models, namely log-concave distributions. This class includes may common distributions

111Familiar examples of log-concave distributions include Gaussian, Laplace, Gamma, and Logistics [3]. and is interesting from both modelling and theoretical perspectives [2, 3, 6, 14, 28, 25].

We focus on the Expectation Maximization (EM) algorithm

[12], which is one of the most popular methods for estimating mixture models. Understanding the convergence property of EM is highly non-trivial due to the non-convexity of the negative log-likelihood function. The work in [4] developed a general framework for establishing local convergence to the true parameter. Proving global convergence of EM is more challenging, even in the simplest setting with a mixture of two Gaussians (2GMM). The recent work in [11, 30] considered balanced 2GMM with known covariance matrix and showed for the first time that EM converges to the true location parameter using random initialization. Subsequent work established global convergence results for a mixture of two truncated Gaussians [21]

, two linear regressions (2MLR)

[19, 18], and two one-dimensional Laplace distributions [5].

All the above results (with the exception of [5]

) rely on the explicit density form and specific properties of the Gaussian distribution. In particular, under the Gaussian distribution, the M-step in the EM algorithm has a closed-form expression, which allows a straightforward analysis of the convergence behavior of the algorithm. However, for general log-concave distributions, the M-step no longer admits a closed-form solution, which poses significant challenges for analysis. To address this difficulty, we consider a modification of the standard EM algorithm,

Least Squares EM (LS-EM), for learning the location parameter of a mixture of two log-concave distributions. The LS-EM algorithm admits a simple, explicit update rule in the M-step.

As the main result of this paper, we show that for a mixture of rotation invariant log-concave distribution, LS-EM converges to the true location parameter from a randomly initialized point. Moreover, we provide explicit convergence rates and sample complexity bounds, which depend on the signal-to-noise ratio as well as the tail property of the distribution. As the functional form of the true density may be unknown, we further establish a robustness property of LS-EM when using a mis-specified density. As a special case, we show that using a Gaussian distribution, LS-EM globally converges to a solution close to the true parameter whenever the variance of the true log-concave density is moderate.

Technical Contributions

We generalize the sensitivity analysis in [11] to a broad class of log-concave distributions. In the process, we demonstrate that log-concavity and rotation invariance of the distribution are the only properties required to guarantee the global convergence of LS-EM. Moreover, our analysis highlights the fundamental role of an angle-decreasing property in establishing the convergence of LS-EM to the true location parameter in the high dimension settings. Note that contraction in the distance, upon which the previous convergence results were built, no longer holds globally for general log-concave mixtures.

Organization

In Section 2, we formulate the parameter estimation problem for a mixture of log-concave distributions and review related work. In Section 3, we delineate the Least Squares EM algorithm and elucidate its connection with classical EM. Analysis of the global convergence of LS-EM is provided in Section 4 under the population setting. Finite-sample results are presented in Section 5, with Section 6 dedicated to the model mis-specification setting. The paper concludes with a discussion of future directions in Section 7. Some details of the proofs are deferred to the Appendix.

Notations

We use and

to denote scalars and vectors, respectively;

and

to denote scalar and vector random variables, respectively. The

-th coordinate of (or ) is (or), and the -th data point is denoted by or . The Euclidean norm in is . For two vectors , we use to denote the angle between them, and to denote their inner product. Finally, is a -by-identity matrix.

2 Problem Setup

In this section, we set up the model for a mixture of log-concave distributions, and discuss the corresponding location estimation problem in the context of existing work.

2.1 Data Generating Model

Let be a class of rotation invariant log-concave densities in defined as follows:

 F={f: f(x)=1Cgexp(−g(∥x∥2)),g is convex and strictly increasing on [0,∞), (1) ∫f(x)dx=1,∫x2if(x)dx=1,∀i∈[d]}.

Without loss of generality, we assume .222Note that is a convex function, as it is the composition of a convex function and a convex increasing function. The normalization constant can be computed explicitly by with where is the volume of a unit ball in . It can be verified that each has mean and covariance matrix . For each , we may generate a location-scale family consisting of the densities , which has mean and covariance matrix .

We assume that each data point is sampled independently from the distribution , defined as a balanced mixture of two densities from the above log-concave location-scale family:

 D(β∗,σ):=12fβ∗,σ+12f−β∗,σ. (2)

It is often useful to view this mixture model as an equivalent latent variable model: independently for each , an unobserved label is first generated according to

 P(Zi=1)=P(Zi=2)=12,

and then the data point is sampled from the corresponding mixture component, i.e., from if and from otherwise.

Since is a location-scale family, the above generative process can be equivalently written as

 Xi={β∗+σEi,if Zi=1,−β∗+σEi,if Zi=2,

where can be viewed as the additive noise. This equivalent representation motivates us to define the signal-to-noise ratio (SNR)

 η:=∥β∗∥2σ, (3)

which is used throughout this paper.

Examples:

Below are several familiar examples of one-dimensional log-concave distributions from :

1. Polynomial distributions: with . When , it corresponds to the Gaussian distribution. When , it corresponds to the Laplace distribution.

2. Logistic distribution: .

These distributions can be generalized to higher dimensional scenarios by replacing with . In Appendix B, we provide a review of some elementary properties of log-concave distributions.

2.2 Location Estimation and the EM Algorithm

We assume that is known, and our goal is to estimate the location parameter from data sampled i.i.d. from the mixture distribution as defined in (2). We first consider this problem for a given log-concave family for which the base density (equivalently, ) is known. The case with an unknown is discussed in Section 6.

Since the negative log-likelihood function of the mixture (2) is nonconvex, computing the standard MLE for involves a nonconvex optimization problem. EM is a popular iterative method for computing the MLE, consisting of an expectation (E) step and a maximization (M) step. In a standard implementation of EM, the E-step computes the conditional distribution of the labels under the current estimate of , and the M-step computes a new estimate by maximizing the conditional log-likelihood based on the distribution obtained in the E-step. The LS-EM algorithm we consider, described in Section 3 to follow, is a variant of the standard EM algorithm with a modified M-step.

2.3 Convergence of EM and Related Work

Despite the popularity and empirical success of the EM algorithm, our understanding of its theoretical property is far from complete. Due to the nonconvexity of negative log-likelihood functions, EM is only guaranteed to converge to a stationary point in general [29]. Quantitative convergence results only began to emerge in recent years. The work [4] proposed a general framework for establishing the local convergence of EM when initialized near the true parameter, with applications to 2GMM, 2MLR, and regression with missing coefficients. Extensions to multiple components are considered in [31].

Beyond local convergence, it is known that the likelihood function of GMM may have bad local optima when there are more than two components, and EM fails to find the true parameter without a careful initialization [16]. Analysis of the global convergence of EM has hence been focused on the two component setting, as is done in this paper. The work in [11, 30] showed that EM converges from a random initialization for 2GMM. Subsequent work in [19, 18, 21] established similar results in other settings, most of which involve Gaussian models. An exception is [5], which proved the global convergence of EM for a mixture of 2 Laplace distributions and derived an explicit convergence rate, but only in the one-dimensional population (infinite sample) setting. We also note that the work [9]

studied convergence properties of Lloyd’s k-means algorithm—a close relative of EM—for Gaussian mixtures. In general, properties of EM for mixtures of other distributions are much less understood, which is the problem we target at in this paper.

The log-concave family we consider is a natural and flexible generalization of Gaussian. This family includes many common distributions, and has broad applications in economics [2, 3], reliability theory [6] and sampling analysis [14]; see [28, 25] for a further review. Existing work on estimating log-concave distributions and mixtures has mostly considered the non/semi-parametric setting [28, 10, 17, 23, 10, 13]; these methods are flexible but typically more computational and data intensive than the parametric approach we consider. Other approaches of learning general mixtures include spectral methods [1, 24]

and tensor methods

[15, 8], and EM algorithm is often applied to the output of these methods.

3 The Least Squares EM Algorithm

As mentioned, the M-step in the standard EM involves maximizing the conditional log-likelihood. For GMM, the M-step is equivalent to solving a least-squares problem. While for a mixture of log-concave distributions, the M-step is equivalent to solving a convex optimization problem, and this optimization problem does not admit a closed form solution in general. This introduces complexity for both computation and analysis.

We instead consider Least Squares EM (LS-EM), a variant of EM that solves a least-squares problem in the M-step even for non-Gaussian mixtures. To elucidate the algorithmic property, we first consider LS-EM in the population setting, where we have access to an infinite number of data sampled from the mixture distribution . The finite sample version is discussed in Section 5.

Each iteration of the population LS-EM algorithm consists of the following two steps:

• [leftmargin = 5mm]

• E-step:

Compute the conditional probabilities of the label

given the current location estimate :

 p1β,σ(X):=fβ,σ(X)fβ,σ(X)+f−β,σ(X),p2β,σ(X):=f−β,σ(X)fβ,σ(X)+f−β,σ(X). (4)
• Least-squares M-step: Update the location estimate via weighted least squares:

 β+= \operatornamewithlimitsargminbEX∼D(β∗,σ)[p1β,σ(X)∥X−b∥22+p2β,σ(X)∥X+b∥22] (5) = EX∼D(β∗,σ)Xtanh(12g(1σ∥X+β∥2)−12g(1σ∥X−β∥2)):=M(β∗,β).

In (5), we minimize the sum of squared distances of to each component’s location, weighted by the conditional probability of belonging to that component. One may interpret LS-EM as a soft version of the K-means algorithm: instead of assigning each exclusively to one of the components, we assign a corresponding probability computed using the log-concave density.

3.1 Connection to Standard EM

In contrast to LS-EM, the M-step in the standard EM algorithm involves maximizing the weighted log-likelihood function (or minimizing the weighted negative log-likelihood function):

 Standard M-step: \operatornamewithlimitsargmaxbQ(b∣β):=EX∼D(β∗,σ)[p1β,σ(X)logfb,σ(X)+p2β,σ(X)logf−b,σ(X)]. (6)

The standard EM iteration, consisting of (4) and (6), corresponds to a minorization-maximization procedure for finding the MLE under the statistical setting (2). In particular, the function above is a lower bound of the (marginal) log-likelihood function of (2), and the standard M-step (6) finds the maximizer of this lower bound. In general, this maximization can only be solved approximately. For example, the “gradient EM” algorithm considered in [4] performs one gradient ascent step on the function.

The least-squares M-step (5) admits an explicit update. Moreover, it may also be viewed as an approximation to the standard M-step (6), as we observe numerically (see Appendix H.1) that the LS-EM update satisfies

 Q(β+∣β)>Q(β∣β)if β≠β∗. (7)

This observation indicates that the least-squares M-step finds an improved solution (compared to the previous iterate ) for function

4 Analysis of Least Squares EM

In this section, we analyze the convergence behavior of the LS-EM update (5) in the population setting. We first consider the one dimensional case () in Section 4.1 and establish the global convergence of LS-EM, extending the techniques in [11] for 2GMM to log-concave mixtures. In Section 4.2, we prove global convergence in the multi-dimensional case (). In this setting, the LS-EM update is not contractive in , so the analysis requires the new ingredient of an angle decreasing property.

For convenience, we introduce the shorthand ; when , we simply write . Since the integrand in (5) is an even function of , the update (5) can be simplified to an equivalent form by integrating over one component of the mixture:

 (8)

Throughout the section, we refer to the technical conditions permitting the interchange of differentiation and integration as the regularity condition. This condition is usually satisfied by log-concave distributions — a detailed discussion is provided in Appendix E.

4.1 One Dimensional Case (d=1)

For one dimensional log-concave mixtures, the behavior of LS-EM is similar to that of EM algorithm for 2GMM: there exist only 3 fixed points, , , and , among which is non-attractive. Consequently, LS-EM converges to the true parameter ( or ) from any non-zero initial solution . This is established in the following theorem.

Theorem 4.1 (Global Convergence, 1D).

Suppose that satisfies the regularity condition. The LS-EM update (5), , has exactly three fixed points: , and . Moreover, the following one-step bound holds:

 |M(β∗,β)−sign(ββ∗)β∗|≤κ(β∗,β,σ)⋅∣∣β−sign(ββ∗)β∗∣∣,

where the contraction factor

 κ(β∗,β,σ):=EX∼fmin(|β|,|β∗|),σ[1−tanh(0.5Fmin(|β|,|β∗|),σ(X))]

satisfies when .

We prove this theorem in Appendix C.1. The crucial property used in the proof is the self-consistency of the LS-EM update (5), namely for all . This property allows us to extend the sensitivity analysis technique for 2GMM to general log-concave distributions.

It can be further shown that the contraction factor becomes smaller as the iterate approaches the true (see Lemma C.2). We thus obtain the following corollary on global convergence at a geometric rate. Without loss of generality, we assume .

Corollary 4.2 (t-step Convergence Rate, 1D).

Suppose that satisfies the regularity condition. Let denote the output of LS-EM after iterations, starting from . The following holds:

 |βt−sign(β0β∗)β∗|≤κ(β∗,β0,σ)t⋅∣∣β0−sign(β0β∗)β∗∣∣.

If is in or , running LS-EM for iterations outputs a solution in . In addition, if is in , running LS-EM for iterations outputs an -close estimate of , where is a constant depending only on and the SNR .

Special cases

We provide explicit convergence rates for mixtures of some common log-concave distributions. Again, we assume and without loss of generality, and set .

• [leftmargin = 8mm]

• Gaussian: and

• Laplace: and

• Logistic: and .

See Appendix C.2 for the proofs of the above results. Note that the convergence rate depends on the signal-to-noise ratio as well as the asymptotic growth rate of the log-density function . In the above examples, , where for Laplace and Logistic distributions, and for Gaussian distribution.

4.2 High Dimensional Case (d>1)

Extension to higher dimensions is more challenging for log-concave mixtures than for Gaussian mixtures. Unlike Gaussian, a log-concave distribution with diagonal covariance may not have independent coordinates. A more severe challenge arises because LS-EM is not contractive in distance to the true parameter for general log-concave mixtures. This phenomenon, proved in the lemma below, stands in sharp contrast to the Gaussian mixture problem.

Lemma 4.3 (Non-contraction in ℓ2).

Consider a log-concave density of the form with . When , is the only fixed point of LS-EM in the direction ortoghonal to . When , there exists a fixed point other than in the orthogonal direction. Consequently, when , there exists such that .

We prove Lemma 4.3 in Appendix D.3. The lemma shows that it is fundamentally impossible to prove global convergence of LS-EM solely based on distance, which was the approach taken in [11] for Gaussian mixtures.

Despite the above challenges, we show affirmatively that LS-EM converges globally to for mixtures of rotation-invariant log-concave distributions, as long as the initial iterate is not orthogonal to  (a measure zero set).

As the first step, we use rotation invariance to show that the LS-EM iterates stay in a two-dimensional space. The is done in the following lemma, with proof in Appendix D.1.

Lemma 4.4 (LS-EM is 2-Dimensional).

The LS-EM update satisfies: . Moreover, if or , then .

We next establish the asymptotic global convergence property of LS-EM.

Theorem 4.5 (Global Convergence, d-Dimensional).

Suppose that satisfies the regularity condition. The LS-EM algorithm converges to from any randomly initialized point that is not orthogonal to .

We prove the theorem using a sensitivity analysis that shows decrease in angle rather than in distance to the true parameter. The proof does not depend on the explicit form of the density, but only log-concavity and rotation invariance. We sketch the main ideas of proof below, deferring the details to Appendix D.2.

Proof Sketch.

Let be the initial point that is not orthogonal to . Without loss of generality, we assume . Consequently, all the future iterates satisfy (see Lemma D.3).

If is in the span of (i.e., parallels ), Lemma D.2 ensures that the iterates remain in the direction of and converge to . On the other hand, if is not in the span of , we make use of the following two key properties of the LS-EM update :

1. [nosep,leftmargin=5mm]

2. Angle Decreasing Property (Lemma D.1): Whenever , the LS-EM update strictly decreases the iterate’s angle toward , i.e., ;

3. Local Contraction Region (Corollary D.6): there is a local region around such that if any iterate falls in that region, all the future iterates remain in that region.

Since the sequence of LS-EM iterates is bounded, it must have accumulation points. Using the angle decreasing property and the continuity of in the second variable , we show that all the accumulation points must be in the direction of . In view of the dynamics of the 1-dimensional case (Theorem 4.1), we can further show that the set of accumulation points must fall into one of the following three possibilities: , , or . Below we argue that and are impossible by contradiction.

• [nosep, leftmargin = 5mm]

• If is the set of accumulation points, the sequence of non-zero iterates would converge to and stay in a neighborhood of after some time ; in this case, Lemma D.7 states that the norm of the iterates is bounded away from zero in the limit and hence they cannot converge to .

• If is the set of accumulation points, then there is at least one iterate in the local region of ; by the local contraction region property above, all the future iterates remain close to . Therefore, cannot be another accumulation point.

At last, we conclude that is the only accumulation point, which LS-EM converges to. ∎

5 Finite Sample Analysis

In this section, we consider the finite sample scenario, where we are given data points sampled i.i.d. from . Using the equivalent expression (8) for the population LS-EM update, and replacing the expectation with the sample average, we obtain the finite-sample LS-EM update:333This expression is for analytic purpose only. To actually implement LS-EM, we use samples from the mixture distribution , which is equivalent to (9).

 ˜β+=1nn∑i=1Xitanh(0.5Fβ,σ(Xi)),where Xii.i.d.∼fβ∗,σ. (9)

One approach to extend the population results (in Section 4) to this case is by coupling the population update with the finite-sample update . To this end, we make use of the fact that log-concave distributions are automatically sub-exponential (see Lemma F.2), so the random variables are i.i.d. sub-exponential for each coordinate . Therefore, the concentration bound holds, and we expect that the convergence properties of the population LS-EM carry over to the finite-sample case, modulo a statistical error of .

The above argument is made precise in following proposition for the one-dimensional case, which is proved in Appendix F.1.

Proposition 5.1 (1-d Finite Sample).

Suppose the density function satisfies the regularity condition. With being the current estimate, the finite-sample LS-EM update (9) satisfies the following bound with probability at least :

 |˜β+−β∗|≤κ(β∗,β,σ)⋅|β−β∗|+(β∗+Cfσ)⋅O(√1nlog1δ), (10)

where is contraction factor defined in Theorem 4.1 and is the Orlicz norm (i.e., the sub-exponential parameter) of a random variable with density .

Using Proposition 5.1, we further deduce the global convergence of LS-EM in the finite sample case, which parallels the population result in Corollary 4.2. We develop this result assuming sample splitting, i.e., each iteration uses a fresh, independent set of samples. This assumption is standard in finite-sample analysis of EM [4, 31, 11, 30, 19, 18]. In this setting, we establish the following quantitative convergence guarantee for LS-EM initialized at any non-zero .

Without loss of generality, let . The convergence has two stages. In the first stage, the LS-EM iterates enter a local neighborhood around , regardless of whether is close to or far from . This is the content of the result below.

Proposition 5.2 (First Stage: Escape from 0 and ∞).

Suppose the initial point is either close to (e.g, ) or far away from (e.g, ). After iterations, with fresh samples per iteration, LS-EM outputs a solution with probability at least .

Within this local neighborhood, the LS-EM iterates converge to geometrically, up to a statistical error determined by the sample size. This second stage convergence result is given below.

Proposition 5.3 (Second Stage: Local Convergence).

The following holds for any . Suppose . After iterations, with fresh samples per iteration, LS-EM outputs a solution satisfying with probability at least .

We prove Propositions 5.2 and 5.3 in Appendix F.2.

Next, we parse the above results in the special cases of Gaussian, Laplace and Logistic, assuming that for simplicity. Accordinly, . In Section 4.1 we showed that , where is the growth rate of the log density . Consequently, the first stage requires iterations with samples per iteration, and the second stage requires iterations with samples per iteration. It is seen that we have better iteration and sample complexities with a larger (larger separation between the components) and a larger (lighter tail of the components).

In contrast, in the low SNR regime with , the sample complexity actually becomes worse for a larger (lighter tails). Indeed, low SNR means that two components are close in location when . If their tails are lighter, then it becomes more likely that the mixture density has a unique mode at 0 instead of two modes at . In this case, the mixture problem becomes harder as it is more difficult to distinguish between the two components.

In the higher dimensional setting, we can similarly show coupling in (i.e., bounding ) via sub-exponential concentration. However, extending the convergence results above to is more subtle, due to the issue of non-contraction (see Lemma 4.3). Addressing this issue would require coupling in a different metric (e.g., in angle—see [19, 30]); we leave this to future work.

6 Robustness Under Model Mis-specification

In practice, it is sometimes difficult to know a priori the exact parametric form of a log-concave distribution that generates the data. This motivates us to consider the following scenario: the data is from the mixture in (2) with a true log-concave distribution and unknown location parameter , but we run LS-EM assuming some other log-concave distribution . Using the same symmetry argument as in deriving (8), we obtain the following expression for the mis-specified LS-EM update in the population case:

 ˆβ+=ˆM(β∗,β):=EX∼fβ∗,σXtanh(0.5ˆFβ,σ(X)), (11)

where .

Multiple properties of the LS-EM update are preserved in the mis-specification setting. In particular, using the same approach as in Lemma 4.4 and Lemma D.1, we can show that the mis-specified LS-EM update is also a two dimensional object and satisfies the same strict angle decreasing property . Therefore, to study the convergence behavior of mis-specified LS-EM, it suffices to understand the one-dimensional case (i.e., along the direction).

We provide results focusing on the setting in which is Gaussian, that is, we fit a Gaussian mixture to a true mixture of log concave distributions. In this setting, we can show that mis-specified LS-EM has only 3 fixed points (Lemma G.1). Moreover, we can bound the distance between and the true , thereby establishing the following convergence result:

Proposition 6.1 (Fit with 2GMM).

Under the above one dimensional setting with Gaussian , the following holds for some absolute constant : If , then the LS-EM algorithm with a non-zero initialization point converges to a solution satisfying and

 ∣∣¯¯¯β−sign(β0β∗)β∗∣∣≤10σ.

We prove this proposition in Appendix G.1. The proposition establishes the robustness of LS-EM: even in the mis-specified setting, LS-EM still converges globally. Moreover, when the SNR is high (i.e., small noise level ), the final estimation error is small and scales linearly with .

7 Conclusion

In this paper, we have established the global convergence of the Least Squares EM algorithm for a mixture of two log-concave densities. The rotation invariance property is the only requirement for a theoretical guarantee. An immediate future direction is to establish quantitative global convergence guarantees in high dimensions for both population and finite sample case, which would require generalizing the angle convergence property in [19]

to log-concave distributions. It is also of interest to relax the rotation invariance assumption (as many interesting log-concave distributions are skewed) and to consider mixtures with multiple components.

Acknowledgement

W. Qian and Y. Chen are partially supported by NSF CRII award 1657420 and grant 1704828. Y. Zhang is supported by NSF award 1740822.

References

• [1] Dimitris Achlioptas and Frank McSherry. On spectral learning of mixtures of distributions. In

International Conference on Computational Learning Theory

, pages 458–469. Springer, 2005.
• [2] Mark Yuying An.

Log-concave probability distributions: Theory and statistical testing.

Duke University Dept of Economics Working Paper, (95-03), 1997.
• [3] Mark Bagnoli and Ted Bergstrom. Log-concave probability and its applications. Economic theory, 26(2):445–469, 2005.
• [4] Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guarantees for the EM algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77–120, 2017.
• [5] Babak Barazandeh and Meisam Razaviyayn. On the behavior of the expectation-maximization algorithm for mixture models. In 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pages 61–65. IEEE, 2018.
• [6] Richard E. Barlow and Frank Proschan. Statistical theory of reliability and life testing: probability models. Technical report, Florida State Univ Tallahassee, 1975.
• [7] Patrick Billingsley. Probability and measure. John Wiley & Sons, 2008.
• [8] Arun Tejasvi Chaganty and Percy Liang. Spectral experts for estimating mixtures of linear regressions. In International Conference on Machine Learning, pages 1040–1048, 2013.
• [9] Kamalika Chaudhuri, Sanjoy Dasgupta, and Andrea Vattani. Learning mixtures of gaussians using the k-means algorithm. arXiv preprint arXiv:0912.0086, 2009.
• [10] Madeleine Cule and Richard Samworth. Theoretical properties of the log-concave maximum likelihood estimator of a multidimensional density. Electronic Journal of Statistics, 4:254–270, 2010.
• [11] Constantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. Ten steps of EM suffice for mixtures of two Gaussians. arXiv preprint arXiv:1609.00368, 2016.
• [12] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
• [13] Ilias Diakonikolas, Anastasios Sidiropoulos, and Alistair Stewart. A polynomial time algorithm for maximum likelihood estimation of multivariate log-concave densities. arXiv preprint arXiv:1812.05524, 2018.
• [14] Walter R. Gilks and Pascal Wild. Adaptive rejection sampling for gibbs sampling. Journal of the Royal Statistical Society: Series C (Applied Statistics), 41(2):337–348, 1992.
• [15] Daniel Hsu and Sham M. Kakade.

Learning mixtures of spherical Gaussians: moment methods and spectral decompositions.

In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 11–20. ACM, 2013.
• [16] Chi Jin, Yuchen Zhang, Sivaraman Balakrishnan, Martin J. Wainwright, and Michael I. Jordan. Local maxima in the likelihood of Gaussian mixture models: Structural results and algorithmic consequences. In Advances in neural information processing systems, pages 4116–4124, 2016.
• [17] Geurt Jongbloed. The iterative convex minorant algorithm for nonparametric estimation. Journal of Computational and Graphical Statistics, 7(3):310–321, 1998.
• [18] Jason M. Klusowski, Dana Yang, and W. D. Brinda. Estimating the coefficients of a mixture of two linear regressions by expectation maximization. IEEE Transactions on Information Theory, 2019.
• [19] Jeongyeol Kwon, Wei Qian, Constantine Caramanis, Yudong Chen, and Damek Davis. Global Convergence of EM Algorithm for Mixtures of Two Component Linear Regression. arXiv preprint arXiv:1810.05752, 2018.
• [20] Bruce G. Lindsay. Mixture models: theory, geometry and applications. In NSF-CBMS regional conference series in probability and statistics, pages i–163. JSTOR, 1995.
• [21] Sai Ganesh Nagarajan and Ioannis Panageas. On the convergence of EM for truncated mixtures of two Gaussians. arXiv preprint arXiv:1902.06958, 2019.
• [22] Nathan Ross. Fundamentals of Stein’s method. Probability Surveys, 8:210–293, 2011.
• [23] Kaspar Rufibach. Log-concave density estimation and bump hunting for IID observations. PhD thesis, Verlag nicht ermittelbar, 2006.
• [24] Arora Sanjeev and Ravi Kannan. Learning mixtures of arbitrary Gaussians. In

Proceedings of the thirty-third annual ACM symposium on Theory of computing

, pages 247–257. ACM, 2001.
• [25] Adrien Saumard and Jon A. Wellner. Log-concavity and strong log-concavity: a review. Statistics surveys, 8:45, 2014.
• [26] D. Michael Titterington, Adrian F. M. Smith, and Udi E. Makov. Statistical analysis of finite mixture distributions. Wiley,, 1985.
• [27] Roman Vershynin.

High-dimensional probability: An introduction with applications in data science

, volume 47.
Cambridge University Press, 2018.
• [28] Guenther Walther. Detecting the presence of mixing with multiscale maximum likelihood. Journal of the American Statistical Association, 97(458):508–513, 2002.
• [29] CF Jeff Wu. On the convergence properties of the EM algorithm. The Annals of statistics, 11(1):95–103, 1983.
• [30] Ji Xu, Daniel J. Hsu, and Arian Maleki. Global analysis of expectation maximization for mixtures of two Gaussians. In Advances in Neural Information Processing Systems, pages 2676–2684, 2016.
• [31] Bowei Yan, Mingzhang Yin, and Purnamrita Sarkar. Convergence of gradient EM on multi-component mixture of Gaussians. In Advances in Neural Information Processing Systems, pages 6956–6966, 2017.

Appendix A Additional Notations for Appendix

We use to denote the unit vector of , and to denote a vector orthogonal to . is the -th standard basis vector.

Appendix B Elementary Properties of Log-concave Distributions

A function is log-concave if it satisfies:

 f(α(x)+(1−α)y)≥f(x)αf(y)1−α,

for every and . Equivalently, is a concave function. We consider log-concave distribution which further satisfies: . The following is a classical result for log-concave distributions, which says that the log-concavity property is preserved by marginalization and convolution.

Theorem B.1.

All marginals as well as the density function of a log-concave distribution is log-concave. The convolution of two log-concave distributions is again a log-concave distribution.

The log-concave distribution on has the following monotone likelihood ratio property:

Proposition B.2.

A density function on is log-concave if an only if the translation family has a monotone likelihood ratio: for every , the ratio is a monotone nondecreasing function of .

Furthermore, log-concave distribution has finite moments of all order.

Lemma B.3.

For a rotation invariant log-concave density: , all the moments exist.

Proof.

It suffices to show that . By the rotation invariant property, we need to show:

 ∫|x1|k(∫f(x)dx2…dxd)dx1<∞. (12)

Note that is the marginal distribution, thus log-concave by Theorem B.1. The problem is now further reduced to show that a one-dimensional symmetric log-concave distribution has finite moments. By the convexity,

 g(x)≥g(x0)+∂g(x0)(x−x0), (13)

for some and . In particular, we have that shown that there exist such that . Therefore,

 ∫x|x|kexp(−g(x))dx = 2∫x≥0xkexp(−g(x))dx ≤ 2∫x≥0xkexp(−b−a(x−x0))dx = 2exp(−b+ax0)∫x≥0xkexp(−ax)dx<∞.

We conclude that all the moments exist. ∎

We refer the reader to [25] and [28] for a detailed review for other properties of log-concave distributions.

Appendix C Analysis for d=1

In this section, we prove the convergence results for . Especially, the proof of Theorem 4.1 is presented in Section C.1, and in Section C.2, we discuss the convergence rate for some explicit log-concave distribution examples.

c.1 Proof of Theorem 4.1

We recall the shorthand notation:

 Fβ,σ(x)=g(1σ|x+β|)−g(1σ|x−β|).

When , we abbreviate as . For readability, we restate the theorem here:

See 4.1

Proof.

Without loss of generality, . When , one verifies that , therefore is a trivial fixed point. Without loss of generality, we assume (in the case for , replace with .) By scaling , and , we can further assume that in the following analysis.

We first establish the consistency property of the LS-EM update: for all . This follows from the algebra:

 M(β,β) =∫x12(f(x−β)+f(x+β))x[f(x−β)−f(x+β)f(x−β)+f(x+β)]dx =12∫xx(f(x−β)−f(x+β))dx =12∫x(x−β)f(x−β)dx−12∫(x+β)f(x+β)dx+β =β,

where the last step holds since

is an odd function. Consequently, the integral

and vanish.

We next argue that the LS-EM update has a unique fixed point when ( is another fixed point when by symmetry). In the region where , we have

 M(β∗,β)−β∗ =M(β∗,β)−M(β,β)+β−β∗ =∂M(z,β)∂z∣z∈(β∗,β)(β∗−β)+β−β∗ =(β−β∗)(1−∂M(β,z)∂z∣z∈(β∗,β)) ≤supz∈(β∗,β)(1−∂M(β,z)∂z)(β−β∗). (14)

In the first step above, we decompose the difference using the consistency property. The allows us to apply the intermediate value theorem for function with respect to the first argument in the second step above. In the case when , we can derive the following relation in a similar way:

 β∗−M(β,β∗) =β∗−β+M(β,β)−M(β,β∗) =β∗−β+∂M(β,z)∂z∣z∈(β,β∗)(β−β∗) =(β∗−β)(1−∂M(β,z)∂z∣z∈(β,β∗)) ≤supz∈(β,β∗)(1−∂M(β,z)∂z)(β∗−β). (15)

In view of the above two cases, we conclude that: If ,

 |M(β∗,β)−β∗|≤supt∈[0,1][1−∂M(z,β)∂z∣z=tβ∗+(1−t)β]κ(β∗,β)|β−β∗|. (16)

The problem is reduced to lower bound , where is between and . Recall:

 M(z,β)= EX∼fzXtanh(0.5Fβ(X)) = ∫xf(x−z)(x)tanh(0.5Fβ(x))dx = ∫xf(x)(x+z)tanh(0.5Fβ(x+z))h(x,z)dx.

In the last step, we applied change of variable for the term . To differentiate with respect to , we can interchange the order of differentiation and integral: by the regularity condition. Note that has the following expression:

 ∂h(x,z)∂z= f(x)(tanh(0.5Fβ(x+z))+0.5(x+z)(∂∂xFβ(x+z))tanh′(0.5Fβ(x+z))).

Therefore,

 ∂M(z,β)∂z= EX∼fztanh(0.5Fβ(X))T1+EX∼fz[0.5XF′β(X)tanh′(0.5Fβ(X))]T2.

From Lemma C.2, we see that , thus a lower bound for follows:

 ∂M(z,β)∂z≥ EX∼fztanh(0.5Fβ(X)) ≥ EX∼fmin(β,β∗)tanh(0.5Fmin(β,β∗)(X)), (17)

where (17) holds since increases with and , which is also established in Lemma C.2.

Combining inequalities (16) and (17), we conclude that

 |M(β∗,β)−β∗|≤EX∼fmin(β,β∗)[1−tanh(0.5Fmin(β,β∗)(X))]|β−β∗|. (18)

by Corollary C.3. From the bound in (18), we see that moves closer to whenever and , therefore, is the unique fixed point on . Similarly, is the unique fixed point on . We have completed the proof of Theorem 4.1. ∎