Mixtures of linear regressions are useful for modeling different linear relationships between input and response variables across several unobserved heterogeneous groups in a population. First proposed by as a generalization of “switching regressions”, this model has found broad applications in areas such as plant science , musical perception theory [11, 30], and educational policy .
In this paper, we consider estimating the model parameters in a symmetric two component mixture of linear regressions. Towards a theoretical understanding of this model, suppose we observe data , where
, , , and , and
are independent of each other. In other words, each predictor variable is Gaussian, and the response is centered at either theor linear combination of the predictor. The two classes are equally probable, and the label of each observation is unknown. We seek to estimate (or , which produces the same model distribution).
The likelihood function of the model
where and , is a multi-dimensional, multi-modal (it has many spurious local maxima), and nonconvex objective function, and hence direct maximization (e.g., grid search) is intractable. Even the population likelihood (in the infinite data setting) has global maxima at and , and a local minimum at the zero vector. Given these computational concerns, other less expensive methods have been used to estimate the model coefficients. For example, mixtures of linear regressions can be interpreted as a particular instance of subspace clustering, since each regressor / regressand pair lies in the -dimensional subspace determined by their model parameter vectors ( and ). When the covariates and errors are Gaussian, algebro-geometric and probabilistic interpretations of PCA [29, 27] motivate related clustering schemes, since there is an inherent geometric aspect to such mixture models.
Another competitor is the Expectation-Maximization (EM) algorithm, which has been shown to have desirable empirical performance in various simulation studies , , . Introduced in a seminal paper of Dempster, Laird, and Rubin , the EM algorithm is a widely used technique for parameter estimation, with common applications in latent variable models (e.g., mixture models) and incomplete-data problems (e.g., corrupted or missing data) . It is an iterative procedure that monotonically increases the likelihood [12, Theorem 1]. When the likelihood is not concave, it is well known that EM can converge to a non-global optimum [31, page 97]. However, recent work has side-stepped the question of whether EM reaches the likelihood maximizer, instead by directly working out statistical guarantees on its loss. For certain well-specified models, these explorations have identified regions of local contractivity of the EM operator near the true parameter so that, when initialized properly, the EM iterates approach the true parameter with high probability.
This line of research was spurred by , which established general conditions for which a ball centered at the true parameter would be a basin of attraction for the population version of the EM operator. For a large enough sample size, the difference (in that ball) between the sample EM operator and the population EM operator can be bounded such that the EM estimate approaches the true parameter with high probability. That bound is the sum of two terms with distinct interpretations. There is an algorithmic convergence term for initial guess , truth , and some modulus of contraction ; this comes from the analysis of the population EM operator. The second term captures statistical convergence and is proportional to the supremum norm of , the difference between the population and sample EM operators, and , respectively. This result is also shown for a “sample-splitting” version of EM, where the sample is partitioned into batches and each batch governs a single step of the algorithm.
Our purpose here is to follow up on the analysis of  by proving a larger basin of attraction for the mixture of two linear models and by establishing an exact probabilistic bound on the error of the sample-splitting EM estimate when the initial guess falls in the specified region. In particular, we show that
The EM algorithm converges to the target parameter vector when it is initialized in a cone (defined in terms of the cosine similarity between the initial guessand the target model parameter ).
The EM algorithm can fail to converge to if the cosine similarity is too small.
In related works, typically some variant of the mean value theorem is employed to establish contractivity toward the true parameter and the rate of geometric decay is then determined by relying heavily on the fact that initial guess belongs to a bounded set and is not too far from the target parameter vector (i.e., a ball centered at the target parameter vector). Our technique relies on Stein’s Lemma, which allows us to reduce the problem to the two-dimensional case and exploit certain monotonicity properties of the population EM operator. Such methods allow one to be very careful and explicit in the analysis and more cleanly reveal the role of the initial conditions. These results cannot be deduced from preexisting works (such as ), even by sharpening their analysis. Our improvements are not solely in terms of constants. Indeed, we will show that as long as the cosine angle between the initial guess and the target parameter vector (i.e., their degree of alignment) is sufficiently large, the EM algorithm converges to the target parameter vector . In particular, the norm of the initial guess can be arbitrarily large, provided the cosine angle condition is met.
In the machine learning community, mixtures of linear regressions are known as Hierarchical Mixture of Experts (HME) and, there, the EM algorithm has also been employed. The mixtures of linear regressions problem has also drawn recent attention from other scholars (e.g., [9, 8, 33, 7, 34, 25, 22]), although none of them have attempted to sharpen the EM algorithm in the sense that many works still require initialization is a small ball around the target parameter vector. For example, the general case with multiple components was considered in , but initialization is still required to be in a ball around each of the true component coefficient vectors.
This paper is organized as follows. In Section 2, we explain the model and explain how the population EM operator is contractive toward the true parameter on a cone in . We also show that the operator is not contractive toward the true parameter on certain regions of . We connect our problem to phase retrieval in sec:phase and borrow preexisting techniques to find a good initial guess in sec:initial. Section 5 looks at the behavior of the sample-splitting EM operator in this cone and states our main result in the form of a high-probability bound. sec:proof and sec:proofthm are devoted to proving the contractivity of the population EM operator toward the target vector over a cone and proving our main result, respectively. A discussion of our findings, including evidence of the failure of the EM algorithm for poor initial guesses from a simulated experiment, is provided in sec:discussion. A simulation study of the EM algorithm under model misspecification is also given therein. Finally, more technical proofs are relegated to app:appendix.
2 The Empirical and Population EM Operator
The EM operator for estimating (see [1, page 6] for a derivation) is
where is a horizontally stretched logistic sigmoid. Here is the inverse of the Gram matrix . In the limit with infinite data, the population EM operator replaces sample averages with expectations, and thus
As we mentioned in the introduction,  showed that if the EM operator eq:sample-em is initialized in a ball around with radius proportional , then the EM algorithm converges to with high probability. It is natural to ask whether this good region of initialization can be expanded, possibly allowing for initial guesses with unbounded norm. The purpose of this paper is to relax the aforementioned ball condition of  and show that if the cosine angle between
and the initial guess is not too small, the EM algorithm also converges. We also simplify the analysis considerably and use only elementary facts about multivariate Gaussian distributions. Our improvement is manifested in the set containment
since for all in the set on the left side,
The conditions in [1, Corollary 5] require the initial guess to be at most away from , which corresponds to and , whereas our condition allows for the norm of to be unbounded and .
Let be the unit vector in the direction of and let
be the unit vector that belongs to the hyperplane spanned byand orthogonal to (i.e., and ). Let . We will later show in sec:proof that belongs to , as illustrated in fig:thetas. Denote the angle between and as , with and . As we will see from the following results, as long as is not too small, is a contracting operation that is always closer to the truth than . The next lemma allows us to derive a region of on which is contractive toward . We defer its proof until sec:proof.
For any in with ,
If we define the input signal-to-noise ratio as and model signal-to-noise ratio (SNR) as and use the fact that , then the contractivity constant eq:gamma can be rewritten as
If , , and , then is bounded by a universal constant less than and is bounded by a universal constant less than , implying the population EM operator converges to the truth exponentially fast.
3 Relationship to Phase Retrieval
The problem of estimating the true parameter vector in a mixture of two linear regressions is related to the phase retrieval problem, where one has access to magnitude-only data according to the model
In the no noise case, i.e., , one can obtain the phase retrieval model from the symmetric two component mixture of linear regressions by squaring each response variable from eq:mixreg and visa versa by setting , where is independent of the data . Here the sample subsets giving rise to the model parameters and are and
, respectively. Even in the case of noise, squaring each response variable and subtracting the varianceof the error distribution yields
is a mean zero random variable with variance. This is essentially the phase retrieval model eq:phase with heteroskedastic errors. See also [8, Section 3.5] for a similar reduction to the “Noisy Phase Model”, where the measurement error is pre-added to the inner product and then squared, viz., .
, to name a few. PhaseLift operates by solving a semi-definite relaxation of the nonconvex formulation of the phase retrieval problem. PhaseMax and PhaseLamp solve a linear program over a polytope via convex programming. Finally, Wirtinger flow is an iterative gradient-based method that requires proper initialization. Parallel to our work,[15, 14] reveal that exact recovery (when ) in PhaseMax is governed by a critical threshold [15, Theorem 3], which is measured in terms of the cosine angle between the initial guess and the target parameter vector. Analogous to our lmm:negative (which is asymptotic in the sense that ), they prove that recovery can fail is this cosine angle is too small. PhaseLamp is an iterative variant of PhaseMax that allows for a smaller cosine angle criterion than the critical threshold from PhaseMax. Our setting is slightly more general than [15, 14] in that we allow for measurement error and our bounds are non-asymptotic in and .
thm:main_splitting_empirical below requires the initial guess to have a good inner product with . But how should one initialize in practice? There is considerable literature showing the efficacy of initialization based on spectral , ,  or Bayesian  methods. For example, inspired by the link eq:connection between phase retrieval and our problem, we can use the same spectral initialization method of  for the Wirtinger flow iterates (c.f.,  for a similar strategy). That is, set
scaled so that . According to [5, Theorem 3.3], we are guaranteed that with high probability , and hence by eq:innerprodbound, and . Provided that , we will see in thm:main_splitting_empirical that this
satisfies our criteria for a good initial guess. Although the joint distributions ofand are not exactly the same, for large , , and hence eq:lam and eq:eigen are approximately equal to the same quantity with replaced by .
The next lemma, proved in app:appendix, shows that the initialization conditions in rmk:contract are essentially necessary in the sense that contractivity of toward can fail for certain initial guesses that do not meet our cosine angle criterion. In contrast, it is known [10, 32] that the population EM operator for a symmetric mixture of two Gaussians is contractive toward on the entire half-plane defined by .111Note that this is the best one can hope for: if (reps. ), then the population EM operator is contractive toward (resp. the zero vector). Thus, unless (i.e., belongs to the hyperplane perpendicular to ), the population EM is contractive towards either model parameter or . The disparity between the EM operators for the two models is revealed in the proof of the contractivity of toward (see sec:proof). Indeed, we will see in rmk:smoothed that the population EM operator for mixtures of regressions is essentially a “stretched” version of the population EM operator for Gaussian mixtures.
There is a subset of with positive Lebesgue measure, each of whose members satisfies and
While this result does not generally imply that the empirical iterates will fail to converge to for , it does suggest that difficulties may arise in this regime. Indeed, the discussion in sec:discussion gives empirical evidence for this theoretical observation.
5 Main Theorem
As in , we analyze a sample-splitting version of the EM algorithm, where for an allocation of samples and iterations, we divide the data into subsets of size . We then perform the updates , using a new subset of samples to compute at each iteration. The advantage of sample-splitting is purely for ease of analysis. In particular, conditional on the portion of data used to construct at iteration , the distribution of depends only on the other portion of the data through . For the next theorem, let denote the initial SNR and denote the model SNR.
Let for , , and . Fix . Suppose furthermore that for some positive universal constant and positive constant . Then there exists a universal modulus of contraction and a positive universal constant such that the sample-splitting empirical EM iterates based on samples per step satisfy
with probability at least .
Note that governs the number of iterations of the EM operator; if it is too small, the term from thm:main_splitting_empirical may fail to reach the parametric rate. Hence, must scale like .
We will prove thm:main_splitting_empirical in sec:proofthm. The main aspect of the analysis lies in showing that satisfies an invariance property, i.e., , where is a set on which is contractive toward . The algorithmic error is a result of repeated evaluation of the population EM operator and the contractivity of towards from lmm:main. The stochastic error is obtained from a high-probability bound on , which is contained in the proof of [1, Corollary 5]).
6 Proof of lmm:main
If , a few applications of Stein’s Lemma [26, Lemma 1] yields
we see that belongs to . This is a crucial fact that we will exploit multiple times.
Observe that for any in ,
Specializing this to yields
The strategy for establishing contractivity of toward will be to show that the sum of and is less than . This idea was used in  to obtain contractivity of the population EM operator for a mixture of two Gaussians. Due to the similarity of the two problems, it turns out that many of the same ideas transfer to our (more complicated) setting.
To reduce this -dimensional problem to a -dimensional problem , we first show that
where . The coefficients and are
This is because of the distributional equality
Note further that
because they have the same moment generating function. Using this, we deduce that
lmm:ABbounds implies that
Next, we note that
Note that by definition of , . In fact, is the one-dimensional population EM operator for this model when and . By the self-consistency property of EM [23, page 79], . Translating this to our problem, we have that . Since , we have from lmm:derivative,
Combining this with inequality eq:perp_inequality yields eq:main_bound. This completes the proof of lmm:main.
The function is related to the EM operator for the one-dimensional symmetric mixture of two Gaussians model . One can derive that (see [21, page 11]) the population EM operator is
Then is a “stretched” version of as seen through the identity
In light of this relationship, it is perhaps not surprising that the EM operator for the mixture of linear regressions problem also enjoys a large basin of attraction.
On the other hand, from [21, page 11], the population EM operator for the symmetric two component mixture of Gaussians , is equal to
where and .
Compare the values of and with and from eq:A and eq:B. We see that is essentially a “stretched” and “scaled” version of by the random dilation factors and . As will be seen in the proof lmm:negative in app:appendix, this additional source of variability causes the repellant behavior of in lmm:negative.
Recently in  , the authors analyzed gradient descent for a single-hidden layer convolutional neural network structure with no overlap and Gaussian input. In this setup, we observe i.i.d. data
, the authors analyzed gradient descent for a single-hidden layer convolutional neural network structure with no overlap and Gaussian input. In this setup, we observe i.i.d. data, where and and are independent of each other. The neural network has the form and the only nonzero coordinates of are in the successive block of coordinates and are equal to a fixed dimensional filter vector . One desires to minimize the risk . Interestingly, the gradient of belongs to the linear span of and , akin to our (and also in the Gaussian mixture problem ). This property also plays a critical role in their analysis.
7 Proof of thm:main_splitting_empirical
The first step of the proof is to show that the empirical EM operator satisfies , where is a set on which is contractive toward . In other words, the empirical EM iterates remain in a set where is closer to than its input . To this end, define the set . By rmk:contract, the stated conditions on , , and ensure that is contractive toward on and that .
Next, we use lmm:Mbounds which shows that
The fact that allows us to claim that when is large enough, , and hence . To show this, assume . That implies
For the last inequality, we used the fact that for all in , which follows from eq:span and lmm:ABbounds. By eq:close and lmm:Mbounds eq:cosinea, we have that
provided and, by eq:span and lmm:ABbounds,
provided , which is positive since .
For , let be the smallest number such that for any fixed in , we have
with probability at least . Moreover, suppose is a constant so that if , then
This guarantees that . For any iteration , we have
with probability at least . Thus by a union bound and ,
with probability at least .
Hence if belongs to , then by lmm:main,
with probability at least . Solving this recursive inequality yields,
with probability at least .
Finally, by a slight modification to the proof of [1, Corollary 5] that uses from eq:cosinea1, it follows that if , then there exists a universal constant such that
with probability at least . This completes the proof of thm:main_splitting_empirical.
In this paper, we showed that the empirical EM iterates converge to true coefficients of a mixture of two linear regressions as long as the initial guess lies within a cone (see the condition on thm:main_splitting_empirical: ).
In fig:SimEM, we perform a simulation study of with , , , and . All entries of the covariate vector and the noise are generated i.i.d. from a standard Gaussian distribution. We consider the error plotted as a function of at iterations