1 Introduction
One important problem in statistics and machine learning is to learn a finite mixture of distributions
[20, 26]. In the parametric setting where the functional form of the distribution is known, this problem is to estimate parameters (e.g., mean and covariance) that specify the distribution of each mixture component. The parameter estimation problem for mixture models is inherently nonconvex, posing challenges for both computation and analysis. While many algorithms have been proposed, rigorous performance guarantees are often elusive. One exception is the Gaussian Mixture Model (GMM), for which much theoretical progress has been made in recent years. The goal of this paper is to study algorithmic guarantees for a much broader class of mixture models, namely logconcave distributions. This class includes may common distributions
^{1}^{1}1Familiar examples of logconcave distributions include Gaussian, Laplace, Gamma, and Logistics [3]. and is interesting from both modelling and theoretical perspectives [2, 3, 6, 14, 28, 25].We focus on the Expectation Maximization (EM) algorithm
[12], which is one of the most popular methods for estimating mixture models. Understanding the convergence property of EM is highly nontrivial due to the nonconvexity of the negative loglikelihood function. The work in [4] developed a general framework for establishing local convergence to the true parameter. Proving global convergence of EM is more challenging, even in the simplest setting with a mixture of two Gaussians (2GMM). The recent work in [11, 30] considered balanced 2GMM with known covariance matrix and showed for the first time that EM converges to the true location parameter using random initialization. Subsequent work established global convergence results for a mixture of two truncated Gaussians [21], two linear regressions (2MLR)
[19, 18], and two onedimensional Laplace distributions [5].All the above results (with the exception of [5]
) rely on the explicit density form and specific properties of the Gaussian distribution. In particular, under the Gaussian distribution, the Mstep in the EM algorithm has a closedform expression, which allows a straightforward analysis of the convergence behavior of the algorithm. However, for general logconcave distributions, the Mstep no longer admits a closedform solution, which poses significant challenges for analysis. To address this difficulty, we consider a modification of the standard EM algorithm,
Least Squares EM (LSEM), for learning the location parameter of a mixture of two logconcave distributions. The LSEM algorithm admits a simple, explicit update rule in the Mstep.As the main result of this paper, we show that for a mixture of rotation invariant logconcave distribution, LSEM converges to the true location parameter from a randomly initialized point. Moreover, we provide explicit convergence rates and sample complexity bounds, which depend on the signaltonoise ratio as well as the tail property of the distribution. As the functional form of the true density may be unknown, we further establish a robustness property of LSEM when using a misspecified density. As a special case, we show that using a Gaussian distribution, LSEM globally converges to a solution close to the true parameter whenever the variance of the true logconcave density is moderate.
Technical Contributions
We generalize the sensitivity analysis in [11] to a broad class of logconcave distributions. In the process, we demonstrate that logconcavity and rotation invariance of the distribution are the only properties required to guarantee the global convergence of LSEM. Moreover, our analysis highlights the fundamental role of an angledecreasing property in establishing the convergence of LSEM to the true location parameter in the high dimension settings. Note that contraction in the distance, upon which the previous convergence results were built, no longer holds globally for general logconcave mixtures.
Organization
In Section 2, we formulate the parameter estimation problem for a mixture of logconcave distributions and review related work. In Section 3, we delineate the Least Squares EM algorithm and elucidate its connection with classical EM. Analysis of the global convergence of LSEM is provided in Section 4 under the population setting. Finitesample results are presented in Section 5, with Section 6 dedicated to the model misspecification setting. The paper concludes with a discussion of future directions in Section 7. Some details of the proofs are deferred to the Appendix.
Notations
We use and
to denote scalars and vectors, respectively;
andto denote scalar and vector random variables, respectively. The
th coordinate of (or ) is (or), and the th data point is denoted by or . The Euclidean norm in is . For two vectors , we use to denote the angle between them, and to denote their inner product. Finally, is a byidentity matrix.2 Problem Setup
In this section, we set up the model for a mixture of logconcave distributions, and discuss the corresponding location estimation problem in the context of existing work.
2.1 Data Generating Model
Let be a class of rotation invariant logconcave densities in defined as follows:
(1)  
Without loss of generality, we assume .^{2}^{2}2Note that is a convex function, as it is the composition of a convex function and a convex increasing function. The normalization constant can be computed explicitly by with where is the volume of a unit ball in . It can be verified that each has mean and covariance matrix . For each , we may generate a locationscale family consisting of the densities , which has mean and covariance matrix .
We assume that each data point is sampled independently from the distribution , defined as a balanced mixture of two densities from the above logconcave locationscale family:
(2) 
It is often useful to view this mixture model as an equivalent latent variable model: independently for each , an unobserved label is first generated according to
and then the data point is sampled from the corresponding mixture component, i.e., from if and from otherwise.
Since is a locationscale family, the above generative process can be equivalently written as
where can be viewed as the additive noise. This equivalent representation motivates us to define the signaltonoise ratio (SNR)
(3) 
which is used throughout this paper.
Examples:
Below are several familiar examples of onedimensional logconcave distributions from :

Polynomial distributions: with . When , it corresponds to the Gaussian distribution. When , it corresponds to the Laplace distribution.

Logistic distribution: .
These distributions can be generalized to higher dimensional scenarios by replacing with . In Appendix B, we provide a review of some elementary properties of logconcave distributions.
2.2 Location Estimation and the EM Algorithm
We assume that is known, and our goal is to estimate the location parameter from data sampled i.i.d. from the mixture distribution as defined in (2). We first consider this problem for a given logconcave family for which the base density (equivalently, ) is known. The case with an unknown is discussed in Section 6.
Since the negative loglikelihood function of the mixture (2) is nonconvex, computing the standard MLE for involves a nonconvex optimization problem. EM is a popular iterative method for computing the MLE, consisting of an expectation (E) step and a maximization (M) step. In a standard implementation of EM, the Estep computes the conditional distribution of the labels under the current estimate of , and the Mstep computes a new estimate by maximizing the conditional loglikelihood based on the distribution obtained in the Estep. The LSEM algorithm we consider, described in Section 3 to follow, is a variant of the standard EM algorithm with a modified Mstep.
2.3 Convergence of EM and Related Work
Despite the popularity and empirical success of the EM algorithm, our understanding of its theoretical property is far from complete. Due to the nonconvexity of negative loglikelihood functions, EM is only guaranteed to converge to a stationary point in general [29]. Quantitative convergence results only began to emerge in recent years. The work [4] proposed a general framework for establishing the local convergence of EM when initialized near the true parameter, with applications to 2GMM, 2MLR, and regression with missing coefficients. Extensions to multiple components are considered in [31].
Beyond local convergence, it is known that the likelihood function of GMM may have bad local optima when there are more than two components, and EM fails to find the true parameter without a careful initialization [16]. Analysis of the global convergence of EM has hence been focused on the two component setting, as is done in this paper. The work in [11, 30] showed that EM converges from a random initialization for 2GMM. Subsequent work in [19, 18, 21] established similar results in other settings, most of which involve Gaussian models. An exception is [5], which proved the global convergence of EM for a mixture of 2 Laplace distributions and derived an explicit convergence rate, but only in the onedimensional population (infinite sample) setting. We also note that the work [9]
studied convergence properties of Lloyd’s kmeans algorithm—a close relative of EM—for Gaussian mixtures. In general, properties of EM for mixtures of other distributions are much less understood, which is the problem we target at in this paper.
The logconcave family we consider is a natural and flexible generalization of Gaussian. This family includes many common distributions, and has broad applications in economics [2, 3], reliability theory [6] and sampling analysis [14]; see [28, 25] for a further review. Existing work on estimating logconcave distributions and mixtures has mostly considered the non/semiparametric setting [28, 10, 17, 23, 10, 13]; these methods are flexible but typically more computational and data intensive than the parametric approach we consider. Other approaches of learning general mixtures include spectral methods [1, 24]
and tensor methods
[15, 8], and EM algorithm is often applied to the output of these methods.3 The Least Squares EM Algorithm
As mentioned, the Mstep in the standard EM involves maximizing the conditional loglikelihood. For GMM, the Mstep is equivalent to solving a leastsquares problem. While for a mixture of logconcave distributions, the Mstep is equivalent to solving a convex optimization problem, and this optimization problem does not admit a closed form solution in general. This introduces complexity for both computation and analysis.
We instead consider Least Squares EM (LSEM), a variant of EM that solves a leastsquares problem in the Mstep even for nonGaussian mixtures. To elucidate the algorithmic property, we first consider LSEM in the population setting, where we have access to an infinite number of data sampled from the mixture distribution . The finite sample version is discussed in Section 5.
Each iteration of the population LSEM algorithm consists of the following two steps:

[leftmargin = 5mm]

Leastsquares Mstep: Update the location estimate via weighted least squares:
(5)
In (5), we minimize the sum of squared distances of to each component’s location, weighted by the conditional probability of belonging to that component. One may interpret LSEM as a soft version of the Kmeans algorithm: instead of assigning each exclusively to one of the components, we assign a corresponding probability computed using the logconcave density.
3.1 Connection to Standard EM
In contrast to LSEM, the Mstep in the standard EM algorithm involves maximizing the weighted loglikelihood function (or minimizing the weighted negative loglikelihood function):
Standard Mstep:  
(6) 
The standard EM iteration, consisting of (4) and (6), corresponds to a minorizationmaximization procedure for finding the MLE under the statistical setting (2). In particular, the function above is a lower bound of the (marginal) loglikelihood function of (2), and the standard Mstep (6) finds the maximizer of this lower bound. In general, this maximization can only be solved approximately. For example, the “gradient EM” algorithm considered in [4] performs one gradient ascent step on the function.
The leastsquares Mstep (5) admits an explicit update. Moreover, it may also be viewed as an approximation to the standard Mstep (6), as we observe numerically (see Appendix H.1) that the LSEM update satisfies
(7) 
This observation indicates that the leastsquares Mstep finds an improved solution (compared to the previous iterate ) for function
4 Analysis of Least Squares EM
In this section, we analyze the convergence behavior of the LSEM update (5) in the population setting. We first consider the one dimensional case () in Section 4.1 and establish the global convergence of LSEM, extending the techniques in [11] for 2GMM to logconcave mixtures. In Section 4.2, we prove global convergence in the multidimensional case (). In this setting, the LSEM update is not contractive in , so the analysis requires the new ingredient of an angle decreasing property.
For convenience, we introduce the shorthand ; when , we simply write . Since the integrand in (5) is an even function of , the update (5) can be simplified to an equivalent form by integrating over one component of the mixture:
(8) 
Throughout the section, we refer to the technical conditions permitting the interchange of differentiation and integration as the regularity condition. This condition is usually satisfied by logconcave distributions — a detailed discussion is provided in Appendix E.
4.1 One Dimensional Case ()
For one dimensional logconcave mixtures, the behavior of LSEM is similar to that of EM algorithm for 2GMM: there exist only 3 fixed points, , , and , among which is nonattractive. Consequently, LSEM converges to the true parameter ( or ) from any nonzero initial solution . This is established in the following theorem.
Theorem 4.1 (Global Convergence, 1D).
Suppose that satisfies the regularity condition. The LSEM update (5), , has exactly three fixed points: , and . Moreover, the following onestep bound holds:
where the contraction factor
satisfies when .
We prove this theorem in Appendix C.1. The crucial property used in the proof is the selfconsistency of the LSEM update (5), namely for all . This property allows us to extend the sensitivity analysis technique for 2GMM to general logconcave distributions.
It can be further shown that the contraction factor becomes smaller as the iterate approaches the true (see Lemma C.2). We thus obtain the following corollary on global convergence at a geometric rate. Without loss of generality, we assume .
Corollary 4.2 (step Convergence Rate, 1D).
Suppose that satisfies the regularity condition. Let denote the output of LSEM after iterations, starting from . The following holds:
If is in or , running LSEM for iterations outputs a solution in . In addition, if is in , running LSEM for iterations outputs an close estimate of , where is a constant depending only on and the SNR .
Special cases
We provide explicit convergence rates for mixtures of some common logconcave distributions. Again, we assume and without loss of generality, and set .

[leftmargin = 8mm]

Gaussian: and

Laplace: and

Logistic: and .
See Appendix C.2 for the proofs of the above results. Note that the convergence rate depends on the signaltonoise ratio as well as the asymptotic growth rate of the logdensity function . In the above examples, , where for Laplace and Logistic distributions, and for Gaussian distribution.
4.2 High Dimensional Case ()
Extension to higher dimensions is more challenging for logconcave mixtures than for Gaussian mixtures. Unlike Gaussian, a logconcave distribution with diagonal covariance may not have independent coordinates. A more severe challenge arises because LSEM is not contractive in distance to the true parameter for general logconcave mixtures. This phenomenon, proved in the lemma below, stands in sharp contrast to the Gaussian mixture problem.
Lemma 4.3 (Noncontraction in ).
Consider a logconcave density of the form with . When , is the only fixed point of LSEM in the direction ortoghonal to . When , there exists a fixed point other than in the orthogonal direction. Consequently, when , there exists such that .
We prove Lemma 4.3 in Appendix D.3. The lemma shows that it is fundamentally impossible to prove global convergence of LSEM solely based on distance, which was the approach taken in [11] for Gaussian mixtures.
Despite the above challenges, we show affirmatively that LSEM converges globally to for mixtures of rotationinvariant logconcave distributions, as long as the initial iterate is not orthogonal to (a measure zero set).
As the first step, we use rotation invariance to show that the LSEM iterates stay in a twodimensional space. The is done in the following lemma, with proof in Appendix D.1.
Lemma 4.4 (LSEM is 2Dimensional).
The LSEM update satisfies: . Moreover, if or , then .
We next establish the asymptotic global convergence property of LSEM.
Theorem 4.5 (Global Convergence, Dimensional).
Suppose that satisfies the regularity condition. The LSEM algorithm converges to from any randomly initialized point that is not orthogonal to .
We prove the theorem using a sensitivity analysis that shows decrease in angle rather than in distance to the true parameter. The proof does not depend on the explicit form of the density, but only logconcavity and rotation invariance. We sketch the main ideas of proof below, deferring the details to Appendix D.2.
Proof Sketch.
Let be the initial point that is not orthogonal to . Without loss of generality, we assume . Consequently, all the future iterates satisfy (see Lemma D.3).
If is in the span of (i.e., parallels ), Lemma D.2 ensures that the iterates remain in the direction of and converge to . On the other hand, if is not in the span of , we make use of the following two key properties of the LSEM update :

[nosep,leftmargin=5mm]

Angle Decreasing Property (Lemma D.1): Whenever , the LSEM update strictly decreases the iterate’s angle toward , i.e., ;

Local Contraction Region (Corollary D.6): there is a local region around such that if any iterate falls in that region, all the future iterates remain in that region.
Since the sequence of LSEM iterates is bounded, it must have accumulation points. Using the angle decreasing property and the continuity of in the second variable , we show that all the accumulation points must be in the direction of . In view of the dynamics of the 1dimensional case (Theorem 4.1), we can further show that the set of accumulation points must fall into one of the following three possibilities: , , or . Below we argue that and are impossible by contradiction.

[nosep, leftmargin = 5mm]

If is the set of accumulation points, the sequence of nonzero iterates would converge to and stay in a neighborhood of after some time ; in this case, Lemma D.7 states that the norm of the iterates is bounded away from zero in the limit and hence they cannot converge to .

If is the set of accumulation points, then there is at least one iterate in the local region of ; by the local contraction region property above, all the future iterates remain close to . Therefore, cannot be another accumulation point.
At last, we conclude that is the only accumulation point, which LSEM converges to. ∎
5 Finite Sample Analysis
In this section, we consider the finite sample scenario, where we are given data points sampled i.i.d. from . Using the equivalent expression (8) for the population LSEM update, and replacing the expectation with the sample average, we obtain the finitesample LSEM update:^{3}^{3}3This expression is for analytic purpose only. To actually implement LSEM, we use samples from the mixture distribution , which is equivalent to (9).
(9) 
One approach to extend the population results (in Section 4) to this case is by coupling the population update with the finitesample update . To this end, we make use of the fact that logconcave distributions are automatically subexponential (see Lemma F.2), so the random variables are i.i.d. subexponential for each coordinate . Therefore, the concentration bound holds, and we expect that the convergence properties of the population LSEM carry over to the finitesample case, modulo a statistical error of .
The above argument is made precise in following proposition for the onedimensional case, which is proved in Appendix F.1.
Proposition 5.1 (1d Finite Sample).
Suppose the density function satisfies the regularity condition. With being the current estimate, the finitesample LSEM update (9) satisfies the following bound with probability at least :
(10) 
where is contraction factor defined in Theorem 4.1 and is the Orlicz norm (i.e., the subexponential parameter) of a random variable with density .
Using Proposition 5.1, we further deduce the global convergence of LSEM in the finite sample case, which parallels the population result in Corollary 4.2. We develop this result assuming sample splitting, i.e., each iteration uses a fresh, independent set of samples. This assumption is standard in finitesample analysis of EM [4, 31, 11, 30, 19, 18]. In this setting, we establish the following quantitative convergence guarantee for LSEM initialized at any nonzero .
Without loss of generality, let . The convergence has two stages. In the first stage, the LSEM iterates enter a local neighborhood around , regardless of whether is close to or far from . This is the content of the result below.
Proposition 5.2 (First Stage: Escape from 0 and ).
Suppose the initial point is either close to (e.g, ) or far away from (e.g, ). After iterations, with fresh samples per iteration, LSEM outputs a solution with probability at least .
Within this local neighborhood, the LSEM iterates converge to geometrically, up to a statistical error determined by the sample size. This second stage convergence result is given below.
Proposition 5.3 (Second Stage: Local Convergence).
The following holds for any . Suppose . After iterations, with fresh samples per iteration, LSEM outputs a solution satisfying with probability at least .
Next, we parse the above results in the special cases of Gaussian, Laplace and Logistic, assuming that for simplicity. Accordinly, . In Section 4.1 we showed that , where is the growth rate of the log density . Consequently, the first stage requires iterations with samples per iteration, and the second stage requires iterations with samples per iteration. It is seen that we have better iteration and sample complexities with a larger (larger separation between the components) and a larger (lighter tail of the components).
In contrast, in the low SNR regime with , the sample complexity actually becomes worse for a larger (lighter tails). Indeed, low SNR means that two components are close in location when . If their tails are lighter, then it becomes more likely that the mixture density has a unique mode at 0 instead of two modes at . In this case, the mixture problem becomes harder as it is more difficult to distinguish between the two components.
In the higher dimensional setting, we can similarly show coupling in (i.e., bounding ) via subexponential concentration. However, extending the convergence results above to is more subtle, due to the issue of noncontraction (see Lemma 4.3). Addressing this issue would require coupling in a different metric (e.g., in angle—see [19, 30]); we leave this to future work.
6 Robustness Under Model Misspecification
In practice, it is sometimes difficult to know a priori the exact parametric form of a logconcave distribution that generates the data. This motivates us to consider the following scenario: the data is from the mixture in (2) with a true logconcave distribution and unknown location parameter , but we run LSEM assuming some other logconcave distribution . Using the same symmetry argument as in deriving (8), we obtain the following expression for the misspecified LSEM update in the population case:
(11) 
where .
Multiple properties of the LSEM update are preserved in the misspecification setting. In particular, using the same approach as in Lemma 4.4 and Lemma D.1, we can show that the misspecified LSEM update is also a two dimensional object and satisfies the same strict angle decreasing property . Therefore, to study the convergence behavior of misspecified LSEM, it suffices to understand the onedimensional case (i.e., along the direction).
We provide results focusing on the setting in which is Gaussian, that is, we fit a Gaussian mixture to a true mixture of log concave distributions. In this setting, we can show that misspecified LSEM has only 3 fixed points (Lemma G.1). Moreover, we can bound the distance between and the true , thereby establishing the following convergence result:
Proposition 6.1 (Fit with 2GMM).
Under the above one dimensional setting with Gaussian , the following holds for some absolute constant : If , then the LSEM algorithm with a nonzero initialization point converges to a solution satisfying and
We prove this proposition in Appendix G.1. The proposition establishes the robustness of LSEM: even in the misspecified setting, LSEM still converges globally. Moreover, when the SNR is high (i.e., small noise level ), the final estimation error is small and scales linearly with .
7 Conclusion
In this paper, we have established the global convergence of the Least Squares EM algorithm for a mixture of two logconcave densities. The rotation invariance property is the only requirement for a theoretical guarantee. An immediate future direction is to establish quantitative global convergence guarantees in high dimensions for both population and finite sample case, which would require generalizing the angle convergence property in [19]
to logconcave distributions. It is also of interest to relax the rotation invariance assumption (as many interesting logconcave distributions are skewed) and to consider mixtures with multiple components.
Acknowledgement
W. Qian and Y. Chen are partially supported by NSF CRII award 1657420 and grant 1704828. Y. Zhang is supported by NSF award 1740822.
References

[1]
Dimitris Achlioptas and Frank McSherry.
On spectral learning of mixtures of distributions.
In
International Conference on Computational Learning Theory
, pages 458–469. Springer, 2005. 
[2]
Mark Yuying An.
Logconcave probability distributions: Theory and statistical testing.
Duke University Dept of Economics Working Paper, (9503), 1997.  [3] Mark Bagnoli and Ted Bergstrom. Logconcave probability and its applications. Economic theory, 26(2):445–469, 2005.
 [4] Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guarantees for the EM algorithm: From population to samplebased analysis. The Annals of Statistics, 45(1):77–120, 2017.
 [5] Babak Barazandeh and Meisam Razaviyayn. On the behavior of the expectationmaximization algorithm for mixture models. In 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pages 61–65. IEEE, 2018.
 [6] Richard E. Barlow and Frank Proschan. Statistical theory of reliability and life testing: probability models. Technical report, Florida State Univ Tallahassee, 1975.
 [7] Patrick Billingsley. Probability and measure. John Wiley & Sons, 2008.
 [8] Arun Tejasvi Chaganty and Percy Liang. Spectral experts for estimating mixtures of linear regressions. In International Conference on Machine Learning, pages 1040–1048, 2013.
 [9] Kamalika Chaudhuri, Sanjoy Dasgupta, and Andrea Vattani. Learning mixtures of gaussians using the kmeans algorithm. arXiv preprint arXiv:0912.0086, 2009.
 [10] Madeleine Cule and Richard Samworth. Theoretical properties of the logconcave maximum likelihood estimator of a multidimensional density. Electronic Journal of Statistics, 4:254–270, 2010.
 [11] Constantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. Ten steps of EM suffice for mixtures of two Gaussians. arXiv preprint arXiv:1609.00368, 2016.
 [12] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
 [13] Ilias Diakonikolas, Anastasios Sidiropoulos, and Alistair Stewart. A polynomial time algorithm for maximum likelihood estimation of multivariate logconcave densities. arXiv preprint arXiv:1812.05524, 2018.
 [14] Walter R. Gilks and Pascal Wild. Adaptive rejection sampling for gibbs sampling. Journal of the Royal Statistical Society: Series C (Applied Statistics), 41(2):337–348, 1992.

[15]
Daniel Hsu and Sham M. Kakade.
Learning mixtures of spherical Gaussians: moment methods and spectral decompositions.
In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 11–20. ACM, 2013.  [16] Chi Jin, Yuchen Zhang, Sivaraman Balakrishnan, Martin J. Wainwright, and Michael I. Jordan. Local maxima in the likelihood of Gaussian mixture models: Structural results and algorithmic consequences. In Advances in neural information processing systems, pages 4116–4124, 2016.
 [17] Geurt Jongbloed. The iterative convex minorant algorithm for nonparametric estimation. Journal of Computational and Graphical Statistics, 7(3):310–321, 1998.
 [18] Jason M. Klusowski, Dana Yang, and W. D. Brinda. Estimating the coefficients of a mixture of two linear regressions by expectation maximization. IEEE Transactions on Information Theory, 2019.
 [19] Jeongyeol Kwon, Wei Qian, Constantine Caramanis, Yudong Chen, and Damek Davis. Global Convergence of EM Algorithm for Mixtures of Two Component Linear Regression. arXiv preprint arXiv:1810.05752, 2018.
 [20] Bruce G. Lindsay. Mixture models: theory, geometry and applications. In NSFCBMS regional conference series in probability and statistics, pages i–163. JSTOR, 1995.
 [21] Sai Ganesh Nagarajan and Ioannis Panageas. On the convergence of EM for truncated mixtures of two Gaussians. arXiv preprint arXiv:1902.06958, 2019.
 [22] Nathan Ross. Fundamentals of Stein’s method. Probability Surveys, 8:210–293, 2011.
 [23] Kaspar Rufibach. Logconcave density estimation and bump hunting for IID observations. PhD thesis, Verlag nicht ermittelbar, 2006.

[24]
Arora Sanjeev and Ravi Kannan.
Learning mixtures of arbitrary Gaussians.
In
Proceedings of the thirtythird annual ACM symposium on Theory of computing
, pages 247–257. ACM, 2001.  [25] Adrien Saumard and Jon A. Wellner. Logconcavity and strong logconcavity: a review. Statistics surveys, 8:45, 2014.
 [26] D. Michael Titterington, Adrian F. M. Smith, and Udi E. Makov. Statistical analysis of finite mixture distributions. Wiley,, 1985.

[27]
Roman Vershynin.
Highdimensional probability: An introduction with applications in data science
, volume 47. Cambridge University Press, 2018.  [28] Guenther Walther. Detecting the presence of mixing with multiscale maximum likelihood. Journal of the American Statistical Association, 97(458):508–513, 2002.
 [29] CF Jeff Wu. On the convergence properties of the EM algorithm. The Annals of statistics, 11(1):95–103, 1983.
 [30] Ji Xu, Daniel J. Hsu, and Arian Maleki. Global analysis of expectation maximization for mixtures of two Gaussians. In Advances in Neural Information Processing Systems, pages 2676–2684, 2016.
 [31] Bowei Yan, Mingzhang Yin, and Purnamrita Sarkar. Convergence of gradient EM on multicomponent mixture of Gaussians. In Advances in Neural Information Processing Systems, pages 6956–6966, 2017.
Appendix A Additional Notations for Appendix
We use to denote the unit vector of , and to denote a vector orthogonal to . is the th standard basis vector.
Appendix B Elementary Properties of Logconcave Distributions
A function is logconcave if it satisfies:
for every and . Equivalently, is a concave function. We consider logconcave distribution which further satisfies: . The following is a classical result for logconcave distributions, which says that the logconcavity property is preserved by marginalization and convolution.
Theorem B.1.
All marginals as well as the density function of a logconcave distribution is logconcave. The convolution of two logconcave distributions is again a logconcave distribution.
The logconcave distribution on has the following monotone likelihood ratio property:
Proposition B.2.
A density function on is logconcave if an only if the translation family has a monotone likelihood ratio: for every , the ratio is a monotone nondecreasing function of .
Furthermore, logconcave distribution has finite moments of all order.
Lemma B.3.
For a rotation invariant logconcave density: , all the moments exist.
Proof.
It suffices to show that . By the rotation invariant property, we need to show:
(12) 
Note that is the marginal distribution, thus logconcave by Theorem B.1. The problem is now further reduced to show that a onedimensional symmetric logconcave distribution has finite moments. By the convexity,
(13) 
for some and . In particular, we have that shown that there exist such that . Therefore,
We conclude that all the moments exist. ∎
Appendix C Analysis for
In this section, we prove the convergence results for . Especially, the proof of Theorem 4.1 is presented in Section C.1, and in Section C.2, we discuss the convergence rate for some explicit logconcave distribution examples.
c.1 Proof of Theorem 4.1
We recall the shorthand notation:
When , we abbreviate as . For readability, we restate the theorem here:
See 4.1
Proof.
Without loss of generality, . When , one verifies that , therefore is a trivial fixed point. Without loss of generality, we assume (in the case for , replace with .) By scaling , and , we can further assume that in the following analysis.
We first establish the consistency property of the LSEM update: for all . This follows from the algebra:
where the last step holds since
is an odd function. Consequently, the integral
and vanish.We next argue that the LSEM update has a unique fixed point when ( is another fixed point when by symmetry). In the region where , we have
(14) 
In the first step above, we decompose the difference using the consistency property. The allows us to apply the intermediate value theorem for function with respect to the first argument in the second step above. In the case when , we can derive the following relation in a similar way:
(15) 
In view of the above two cases, we conclude that: If ,
(16) 
The problem is reduced to lower bound , where is between and . Recall:
In the last step, we applied change of variable for the term . To differentiate with respect to , we can interchange the order of differentiation and integral: by the regularity condition. Note that has the following expression:
Therefore,
From Lemma C.2, we see that , thus a lower bound for follows:
(17) 
where (17) holds since increases with and , which is also established in Lemma C.2.
Comments
There are no comments yet.