1 Introduction
Mixtures of linear regressions are useful for modeling different linear relationships between input and response variables across several unobserved heterogeneous groups in a population. First proposed by
[24] as a generalization of “switching regressions”, this model has found broad applications in areas such as plant science [28], musical perception theory [11, 30], and educational policy [16].In this paper, we consider estimating the model parameters in a symmetric two component mixture of linear regressions. Towards a theoretical understanding of this model, suppose we observe data , where
(1.1) 
, , , and , and
are independent of each other. In other words, each predictor variable is Gaussian, and the response is centered at either the
or linear combination of the predictor. The two classes are equally probable, and the label of each observation is unknown. We seek to estimate (or , which produces the same model distribution).The likelihood function of the model
where and , is a multidimensional, multimodal (it has many spurious local maxima), and nonconvex objective function, and hence direct maximization (e.g., grid search) is intractable. Even the population likelihood (in the infinite data setting) has global maxima at and , and a local minimum at the zero vector. Given these computational concerns, other less expensive methods have been used to estimate the model coefficients. For example, mixtures of linear regressions can be interpreted as a particular instance of subspace clustering, since each regressor / regressand pair lies in the dimensional subspace determined by their model parameter vectors ( and ). When the covariates and errors are Gaussian, algebrogeometric and probabilistic interpretations of PCA [29, 27] motivate related clustering schemes, since there is an inherent geometric aspect to such mixture models.
Another competitor is the ExpectationMaximization (EM) algorithm, which has been shown to have desirable empirical performance in various simulation studies [11], [30], [18]. Introduced in a seminal paper of Dempster, Laird, and Rubin [12], the EM algorithm is a widely used technique for parameter estimation, with common applications in latent variable models (e.g., mixture models) and incompletedata problems (e.g., corrupted or missing data) [2]. It is an iterative procedure that monotonically increases the likelihood [12, Theorem 1]. When the likelihood is not concave, it is well known that EM can converge to a nonglobal optimum [31, page 97]. However, recent work has sidestepped the question of whether EM reaches the likelihood maximizer, instead by directly working out statistical guarantees on its loss. For certain wellspecified models, these explorations have identified regions of local contractivity of the EM operator near the true parameter so that, when initialized properly, the EM iterates approach the true parameter with high probability.
This line of research was spurred by [1], which established general conditions for which a ball centered at the true parameter would be a basin of attraction for the population version of the EM operator. For a large enough sample size, the difference (in that ball) between the sample EM operator and the population EM operator can be bounded such that the EM estimate approaches the true parameter with high probability. That bound is the sum of two terms with distinct interpretations. There is an algorithmic convergence term for initial guess , truth , and some modulus of contraction ; this comes from the analysis of the population EM operator. The second term captures statistical convergence and is proportional to the supremum norm of , the difference between the population and sample EM operators, and , respectively. This result is also shown for a “samplesplitting” version of EM, where the sample is partitioned into batches and each batch governs a single step of the algorithm.
Our purpose here is to follow up on the analysis of [1] by proving a larger basin of attraction for the mixture of two linear models and by establishing an exact probabilistic bound on the error of the samplesplitting EM estimate when the initial guess falls in the specified region. In particular, we show that

The EM algorithm converges to the target parameter vector when it is initialized in a cone (defined in terms of the cosine similarity between the initial guess
and the target model parameter ). 
The EM algorithm can fail to converge to if the cosine similarity is too small.
In related works, typically some variant of the mean value theorem is employed to establish contractivity toward the true parameter and the rate of geometric decay is then determined by relying heavily on the fact that initial guess belongs to a bounded set and is not too far from the target parameter vector (i.e., a ball centered at the target parameter vector). Our technique relies on Stein’s Lemma, which allows us to reduce the problem to the twodimensional case and exploit certain monotonicity properties of the population EM operator. Such methods allow one to be very careful and explicit in the analysis and more cleanly reveal the role of the initial conditions. These results cannot be deduced from preexisting works (such as [1]), even by sharpening their analysis. Our improvements are not solely in terms of constants. Indeed, we will show that as long as the cosine angle between the initial guess and the target parameter vector (i.e., their degree of alignment) is sufficiently large, the EM algorithm converges to the target parameter vector . In particular, the norm of the initial guess can be arbitrarily large, provided the cosine angle condition is met.
In the machine learning community, mixtures of linear regressions are known as Hierarchical Mixture of Experts (HME) and, there, the EM algorithm has also been employed
[20]. The mixtures of linear regressions problem has also drawn recent attention from other scholars (e.g., [9, 8, 33, 7, 34, 25, 22]), although none of them have attempted to sharpen the EM algorithm in the sense that many works still require initialization is a small ball around the target parameter vector. For example, the general case with multiple components was considered in [34], but initialization is still required to be in a ball around each of the true component coefficient vectors.This paper is organized as follows. In Section 2, we explain the model and explain how the population EM operator is contractive toward the true parameter on a cone in . We also show that the operator is not contractive toward the true parameter on certain regions of . We connect our problem to phase retrieval in sec:phase and borrow preexisting techniques to find a good initial guess in sec:initial. Section 5 looks at the behavior of the samplesplitting EM operator in this cone and states our main result in the form of a highprobability bound. sec:proof and sec:proofthm are devoted to proving the contractivity of the population EM operator toward the target vector over a cone and proving our main result, respectively. A discussion of our findings, including evidence of the failure of the EM algorithm for poor initial guesses from a simulated experiment, is provided in sec:discussion. A simulation study of the EM algorithm under model misspecification is also given therein. Finally, more technical proofs are relegated to app:appendix.
2 The Empirical and Population EM Operator
The EM operator for estimating (see [1, page 6] for a derivation) is
(2.1) 
where is a horizontally stretched logistic sigmoid. Here is the inverse of the Gram matrix . In the limit with infinite data, the population EM operator replaces sample averages with expectations, and thus
(2.2) 
As we mentioned in the introduction, [1] showed that if the EM operator eq:sampleem is initialized in a ball around with radius proportional , then the EM algorithm converges to with high probability. It is natural to ask whether this good region of initialization can be expanded, possibly allowing for initial guesses with unbounded norm. The purpose of this paper is to relax the aforementioned ball condition of [1] and show that if the cosine angle between
and the initial guess is not too small, the EM algorithm also converges. We also simplify the analysis considerably and use only elementary facts about multivariate Gaussian distributions. Our improvement is manifested in the set containment
since for all in the set on the left side,
(2.3) 
The conditions in [1, Corollary 5] require the initial guess to be at most away from , which corresponds to and , whereas our condition allows for the norm of to be unbounded and .
Let be the unit vector in the direction of and let
be the unit vector that belongs to the hyperplane spanned by
and orthogonal to (i.e., and ). Let . We will later show in sec:proof that belongs to , as illustrated in fig:thetas. Denote the angle between and as , with and . As we will see from the following results, as long as is not too small, is a contracting operation that is always closer to the truth than . The next lemma allows us to derive a region of on which is contractive toward . We defer its proof until sec:proof.Lemma 2.1.
For any in with ,
(2.4) 
where
(2.5) 
and
(2.6) 
If we define the input signaltonoise ratio as and model signaltonoise ratio (SNR) as and use the fact that , then the contractivity constant eq:gamma can be rewritten as
(2.7) 
Remark 2.1.
If , , and , then is bounded by a universal constant less than and is bounded by a universal constant less than , implying the population EM operator converges to the truth exponentially fast.
3 Relationship to Phase Retrieval
The problem of estimating the true parameter vector in a mixture of two linear regressions is related to the phase retrieval problem, where one has access to magnitudeonly data according to the model
(3.1) 
In the no noise case, i.e., , one can obtain the phase retrieval model from the symmetric two component mixture of linear regressions by squaring each response variable from eq:mixreg and visa versa by setting , where is independent of the data . Here the sample subsets giving rise to the model parameters and are and
, respectively. Even in the case of noise, squaring each response variable and subtracting the variance
of the error distribution yields(3.2) 
where
is a mean zero random variable with variance
. This is essentially the phase retrieval model eq:phase with heteroskedastic errors. See also [8, Section 3.5] for a similar reduction to the “Noisy Phase Model”, where the measurement error is preadded to the inner product and then squared, viz., .Recent algorithms used to recover from include PhaseLift [6], PhaseMax [19, 13], PhaseLamp [15, 14] and Wirtinger flow [4, 5]
, to name a few. PhaseLift operates by solving a semidefinite relaxation of the nonconvex formulation of the phase retrieval problem. PhaseMax and PhaseLamp solve a linear program over a polytope via convex programming. Finally, Wirtinger flow is an iterative gradientbased method that requires proper initialization. Parallel to our work,
[15, 14] reveal that exact recovery (when ) in PhaseMax is governed by a critical threshold [15, Theorem 3], which is measured in terms of the cosine angle between the initial guess and the target parameter vector. Analogous to our lmm:negative (which is asymptotic in the sense that ), they prove that recovery can fail is this cosine angle is too small. PhaseLamp is an iterative variant of PhaseMax that allows for a smaller cosine angle criterion than the critical threshold from PhaseMax. Our setting is slightly more general than [15, 14] in that we allow for measurement error and our bounds are nonasymptotic in and .4 Initialization
thm:main_splitting_empirical below requires the initial guess to have a good inner product with . But how should one initialize in practice? There is considerable literature showing the efficacy of initialization based on spectral [33], [7], [34] or Bayesian [30] methods. For example, inspired by the link eq:connection between phase retrieval and our problem, we can use the same spectral initialization method of [5] for the Wirtinger flow iterates (c.f., [33] for a similar strategy). That is, set
(4.1) 
and take
equal to be the eigenvector corresponding to the largest eigenvalue of
(4.2) 
scaled so that . According to [5, Theorem 3.3], we are guaranteed that with high probability , and hence by eq:innerprodbound, and . Provided that , we will see in thm:main_splitting_empirical that this
satisfies our criteria for a good initial guess. Although the joint distributions of
and are not exactly the same, for large , , and hence eq:lam and eq:eigen are approximately equal to the same quantity with replaced by .The next lemma, proved in app:appendix, shows that the initialization conditions in rmk:contract are essentially necessary in the sense that contractivity of toward can fail for certain initial guesses that do not meet our cosine angle criterion. In contrast, it is known [10, 32] that the population EM operator for a symmetric mixture of two Gaussians is contractive toward on the entire halfplane defined by .^{1}^{1}1Note that this is the best one can hope for: if (reps. ), then the population EM operator is contractive toward (resp. the zero vector). Thus, unless (i.e., belongs to the hyperplane perpendicular to ), the population EM is contractive towards either model parameter or . The disparity between the EM operators for the two models is revealed in the proof of the contractivity of toward (see sec:proof). Indeed, we will see in rmk:smoothed that the population EM operator for mixtures of regressions is essentially a “stretched” version of the population EM operator for Gaussian mixtures.
Lemma 4.1.
There is a subset of with positive Lebesgue measure, each of whose members satisfies and
While this result does not generally imply that the empirical iterates will fail to converge to for , it does suggest that difficulties may arise in this regime. Indeed, the discussion in sec:discussion gives empirical evidence for this theoretical observation.
5 Main Theorem
As in [1], we analyze a samplesplitting version of the EM algorithm, where for an allocation of samples and iterations, we divide the data into subsets of size . We then perform the updates , using a new subset of samples to compute at each iteration. The advantage of samplesplitting is purely for ease of analysis. In particular, conditional on the portion of data used to construct at iteration , the distribution of depends only on the other portion of the data through . For the next theorem, let denote the initial SNR and denote the model SNR.
Theorem 1.
Let for , , and . Fix . Suppose furthermore that for some positive universal constant and positive constant . Then there exists a universal modulus of contraction and a positive universal constant such that the samplesplitting empirical EM iterates based on samples per step satisfy
with probability at least .
Note that governs the number of iterations of the EM operator; if it is too small, the term from thm:main_splitting_empirical may fail to reach the parametric rate. Hence, must scale like .
We will prove thm:main_splitting_empirical in sec:proofthm. The main aspect of the analysis lies in showing that satisfies an invariance property, i.e., , where is a set on which is contractive toward . The algorithmic error is a result of repeated evaluation of the population EM operator and the contractivity of towards from lmm:main. The stochastic error is obtained from a highprobability bound on , which is contained in the proof of [1, Corollary 5]).
6 Proof of lmm:main
If , a few applications of Stein’s Lemma [26, Lemma 1] yields
(6.1) 
Letting
(6.2) 
and
(6.3) 
we see that belongs to . This is a crucial fact that we will exploit multiple times.
Observe that for any in ,
and
Specializing this to yields
The strategy for establishing contractivity of toward will be to show that the sum of and is less than . This idea was used in [10] to obtain contractivity of the population EM operator for a mixture of two Gaussians. Due to the similarity of the two problems, it turns out that many of the same ideas transfer to our (more complicated) setting.
To reduce this dimensional problem to a dimensional problem , we first show that
where . The coefficients and are
and
This is because of the distributional equality
(6.4) 
Note further that
because they have the same moment generating function. Using this, we deduce that
(6.5) 
lmm:ABbounds implies that
and consequently,
(6.6) 
Next, we note that
Finally, define
Note that by definition of , . In fact, is the onedimensional population EM operator for this model when and . By the selfconsistency property of EM [23, page 79], . Translating this to our problem, we have that . Since , we have from lmm:derivative,
Combining this with inequality eq:perp_inequality yields eq:main_bound. This completes the proof of lmm:main.
Remark 6.1.
The function is related to the EM operator for the onedimensional symmetric mixture of two Gaussians model . One can derive that (see [21, page 11]) the population EM operator is
Then is a “stretched” version of as seen through the identity
In light of this relationship, it is perhaps not surprising that the EM operator for the mixture of linear regressions problem also enjoys a large basin of attraction.
On the other hand, from [21, page 11], the population EM operator for the symmetric two component mixture of Gaussians , is equal to
where and .
Compare the values of and with and from eq:A and eq:B. We see that is essentially a “stretched” and “scaled” version of by the random dilation factors and . As will be seen in the proof lmm:negative in app:appendix, this additional source of variability causes the repellant behavior of in lmm:negative.
Remark 6.2.
Recently in [3]
, the authors analyzed gradient descent for a singlehidden layer convolutional neural network structure with no overlap and Gaussian input. In this setup, we observe i.i.d. data
, where and and are independent of each other. The neural network has the form and the only nonzero coordinates of are in the successive block of coordinates and are equal to a fixed dimensional filter vector . One desires to minimize the risk . Interestingly, the gradient of belongs to the linear span of and , akin to our (and also in the Gaussian mixture problem [21]). This property also plays a critical role in their analysis.7 Proof of thm:main_splitting_empirical
The first step of the proof is to show that the empirical EM operator satisfies , where is a set on which is contractive toward . In other words, the empirical EM iterates remain in a set where is closer to than its input . To this end, define the set . By rmk:contract, the stated conditions on , , and ensure that is contractive toward on and that .
Next, we use lmm:Mbounds which shows that
The fact that allows us to claim that when is large enough, , and hence . To show this, assume . That implies
(7.1) 
For the last inequality, we used the fact that for all in , which follows from eq:span and lmm:ABbounds. By eq:close and lmm:Mbounds eq:cosinea, we have that
provided and, by eq:span and lmm:ABbounds,
provided , which is positive since .
For , let be the smallest number such that for any fixed in , we have
with probability at least . Moreover, suppose is a constant so that if , then
This guarantees that . For any iteration , we have
with probability at least . Thus by a union bound and ,
with probability at least .
Hence if belongs to , then by lmm:main,
with probability at least . Solving this recursive inequality yields,
with probability at least .
Finally, by a slight modification to the proof of [1, Corollary 5] that uses from eq:cosinea1, it follows that if , then there exists a universal constant such that
with probability at least . This completes the proof of thm:main_splitting_empirical.
8 Discussion
In this paper, we showed that the empirical EM iterates converge to true coefficients of a mixture of two linear regressions as long as the initial guess lies within a cone (see the condition on thm:main_splitting_empirical: ).
In fig:SimEM, we perform a simulation study of with , , , and . All entries of the covariate vector and the noise are generated i.i.d. from a standard Gaussian distribution. We consider the error plotted as a function of at iterations
Comments
There are no comments yet.