This paper considers the problem of mixed linear regression
mixed linear regression, where the output variable we see comes from one of two unknown regressors. Thus we see data , where
where can be thought of as a hidden label, and is the noise. Given the label for each sample, the problem decomposes into two standard regression problems, and can be easily solved. Without it, however, the problem is significantly more difficult. The main challenge of mixture models, and in particular mixed regression falls in the intersection of the statistical and computational constraints: the problem is difficult when one cares both about an efficient algorithm, and about near-optimal (
) sample complexity. Exponential-effort brute force search typically results in statistically near-optimal estimators; on the other hand, recent tensor-based methods give a polynomial-time algorithm, but at the cost ofsample complexity (recall ) instead of the optimal rate, .111It should be possible to improve the tensor rates to for the case of Gaussian design.
The Expectation Maximization (EM) algorithm is computationally very efficient, and widely used in practice. However, its behavior is poorly understood, and in particular, no theoretical guarantees on global convergence are known.
In this paper, we tackle both statistical and algorithmic objectives at once. The algorithms we give are efficient, specified by solutions of convex optimization problems; in the noiseless, arbitrary noise and stochastic noise regimes, they provide the best known sample complexity results; in the balanced case where nearly half the samples come from each of and , we provide matching minimax lower bounds, showing our results are optimal.
Specifically, our contributions are as follows:
In the arbitrary noise setting where the noise can be adversarial, we show that under certain technical conditions, as long as the number of observations for each regressor satisfy , our algorithm produces an estimator which satisfies
Note that this immediately implies exact recovery in the noiseless case with samples.
In the stochastic noise setting with sub-Gaussian noise and balanced labels, we show under the necessary assumption and a Gaussian design matrix, our estimate satisfies the following (ignoring polylog factors):
where and is any lower bound of and
is the variance of the noise.
In both the arbitrary and stochastic noise settings, we provide minimax lower bounds that match the above upper bounds up to at most polylog factors, thus showing that the results obtained by our convex optimization solution are information-theoretically optimal. Particularly in the stochastic setting, the situation is a bit more subtle: the minimax rates in fact depend on the signal-to-noise and exhibit several phases, thus showing a qualitatively different behavior than in standard regression and many other parametric problems (for which the scaling is ).
2 Related Work and Contributions
-means clustering are popular examples of unsupervised learning for mixture models. The most popular and broadly implemented approach to mixture problems, including mixed regression, is the so-called Expectation-Maximization (EM) algorithm[14, 18]. In fact, EM has been used for mixed regression for various application domains [28, 16]. Despite its wide use, still little is known about its performance beyond local convergence [30, 3].
One exception is the recent work in , which considers mixed regression in the noiseless setting, where they propose an alternating minimization approach initialized by a grid search and show that it recovers the regressors in the noiseless case with a sample complexity of . Extension to the noisy setting is very recently considered in . Focusing on the stochastic noise setting and the high SNR regime (i.e., when ; cf. Section 1), they show that the EM algorithm with good initialization achieves the error bound . Another notable exception is the work in . There, EM is adapted to the high-dimensional sparse regression setting, where the regressors are known to be sparse. The authors use EM to solve a penalized (for sparsity) likelihood function. A generalized EM approach achieves support-recovery, though once restricted to that support where the problem becomes a standard mixed regression problem, only convergence to a local optimum can be guaranteed.
Mixture models have been recently explored using the recently developed technology of tensors in [1, 17]. In , the authors consider a tensor-based approach, regressing against , and then using the tensor decomposition techniques to efficiently recover each . These methods are not limited to the mixture of only two models, as we are. Yet, the tensor approach requires samples, which is several orders of magnitude more than the that our work requires. As noted in their work, the higher sampling requirement for using third order tensors seems intrinsic.
In this work we consider the setting with two mixture components. Many interesting applications have binary latent factors: gene mutation present/not, gender, healthy/sick individual, children/adult, etc.; see also the examples in . Theoretically, the minimax rate was previously unknown even in the two-component case. Extension to more than two components is of great interest.
Finally, we note that our focus is on estimating the regressors rather than identifying the hidden labels or predicting the response for future data points. The relationship between covariates and response is often equally (some times more) important as prediction. For example, the regressors may correspond to unknown signals or molecular structures, and the response-covariate pairs are linear measurements; here the regressors are themselves the object of interest. For many mixture problems, including clustering, identifying the labels accurately for all data points may be (statistically) impossible. Obtaining the regressors allows for an estimate of this label (see  for a related setting).
3 Main Results
In this section we present this paper’s main results. In addition, we present the precise setup and assumptions, and introduce the basic notation we use.
3.1 Problem Set Up
Suppose there are two unknown vectorsand in . We observe noisy linear measurements which satisfy the following: for and ,
where with and with denote the subsets of the measurements corresponding to and , respectively. Given , the goal is to recover and . In particular, for the true regressor pair and an estimator of it, we are interested in bounding the recovery error
i.e., the total error in both regressors up to permutation. Unlike the noiseless setting, in the presence of noise, the correct labels are in general irrecoverable.
The key high-level insight that leads to our optimization formulations, is to work in the lifted space of matrices, yet without lifting to -tensors. Using basic matrix concentration results not available for tensors, this ultimately allows us to provide optimal statistical rates. In this work, we seek to recover the following:
Clearly and can be recovered from and . Indeed, note that
Let and. We have ; together with we can recover and . Given approximate versions and of and , we obtain estimates and using a similar approach, which we give in Algorithm 1.
We show below that in fact this recovery procedure is stable, so that if and are close to and , Algorithm 1 outputs that are close to .
We now give the two formulations for arbitrary and stochastic noise, and we state the main results of the paper. For the arbitrary noise case, while one can use the same quadratic objective as we do in arbitrary case, it turns out that the analysis is more complicated than considering a similar objective – an objective. In the noiseless setting, our results immediately imply exact recovery with an optimal number of samples, and in fact remove the additional log factors in the sample complexity requirements in . In both the arbitrary/adversarial noise setting and the stochastic noise setting, our results are information-theoretically optimal, as they match (up to at most a polylog factor) the minimax lower bounds we derive in Section 3.4.
We use lower case bold letters to denote vectors, and capital bold-face letters for matrices. For a vector , and both denote its -th coordinate. We use standard notation for matrix and vector norms, e.g.,
to denote the nuclear norm (as known as the trace norm, which is the sum of the singular values of a matrix),the Frobenius norm, and the operator norm. We define a quantity we use repeatedly. Let
Note that when , and is always bounded by . We say a number is a numerical constant if is independent of the dimension , the number of measurements and the quantity . For ease of parsing, we typically use to denote a large constant, and for a small constant.
3.2 Arbitrary Noise
We consider first the setting of arbitrary noise, with the following specific setting. We take to have i.i.d., zero-mean and sub-Gaussian entries222Recall that, as shown in , the general deterministic covariate mixed regression problem is NP-hard even in the noiseless setting. with sub-Gaussian norm bounded by a numeric constant, , and for all and . We assume that is a fixed constant and independent of and . If are standard Gaussian vectors, then these assumptions are satisfied with sub-Gaussian norm and . The only assumption on the noise is that it is bounded in norm. The noise is otherwise arbitrary, possibly adversarial, and even possibly depending on and .
We consider the following convex program:
The intuition is that in the noiseless case with , if we substitute the desired solution given by (2) into the above program, the LHS of (5) becomes zero; moreover, the rank of is , and minimizing the nuclear norm term in (4) encourages the optimal solution to have low rank. Our theoretical results give a precise way to set the right hand side, , of the constraint. The next two theorems summarize our results for arbitrary noise. Theorem 1 provides guarantees on how close the optimal solution is to ; then the companion result, Theorem 2, provides quality bounds on , produced by using Algorithm 1 on the output .
Theorem 1 (Arbitrary Noise).
We then use Algorithm 1 to estimate , which is stable as shown by the theorem below.
Theorem 2 (Estimating , arbitrary noise).
Theorem 2 immediately implies exact recovery in the noiseless case.
Corollary 1 (Exact Recovery).
Discussion of Assumptions:
(1) In Theorem 1, the condition is satisfied, for instance, if is Gaussian (with ). Moreover, this condition is in general necessary. To see this, suppose each is a Rademacher variable, which has , and
. The response variablemust have the form
Consider two possibilities: or . In both cases, may take any one of the values in with equal probabilities. Thus, it is impossible to distinguish between these two possibilities.
(2) The condition holds if and are not equal. Suppose is lower-bounded by a constant. The main assumption on the noise, namely, (the condition 4 in Theorem 1) cannot be substantially relaxed if we want a bound on . Indeed, if for all , then an adversary may choose such that
in which case the convex program (4)–(5) becomes independent of . That said, the case with condition 4 violated can be handled trivially. Suppose for any constant . A standard argument for ordinal linear regression shows that the blind estimator satisfies w.h.p.
and this bound is optimal (see the minimax lower bound in Section 3.4). Therefore, the condition 4 in Theorem 1 is not really restrictive, i.e., the case when it holds is precisely the interesting setting.
(3) Finally, note that if or , then a single explains 100% (asymptotically) of the observed data. Moreover, the standard least squares solution recovers this at the same rates as in standard (not mixed) regression.
Optimality of sample complexity.
3.3 Stochastic Noise and Consistency
We now consider the stochastic noise setting. We show that for Gaussian covariate in the balanced setting, we have asymptotic consistency and the rates we obtain match information-theoretic bounds we give in Section 3.4, and hence are minimax optimal. Specifically, our setup is as follows. We assume the covariates have i.i.d. Gaussian entries with zero mean and unit variance . For the noise, we assume are i.i.d., zero-mean sub-Gaussian with and their sub-Gaussian norm for some absolute constant , and are independent of .
Much like in standard regression, the independence assumption on makes the least-squares objective analytically convenient. In particular, we consider a Lagrangian formulation, regularizing the squared loss objective with the nuclear norm of . Thus, we solve the following:
We assume the noise variance is known and can be estimated.333We note that similar assumptions are made in . It might be possible to avoid the dependence on by using a symmetrized error term (see, e.g., ). As with the arbitrary noise case, our first theorem guarantees is close to , and then a companion theorem gives error bounds on estimating .
For any constant , there exist numerical positive constant , which might depend on , such that the following hold. Assume Suppose: (1) ; (2) ; (3) are Gaussian; and (4) satisfies
With probability at least , any optimal solution to the regularized least squares program (6) satisfies
The bounds in the above theorem depend on . This appears as a result of the objective function in the formulation (6) and not an artifact of our analysis.444Intuitively, if the majority of the observations are generated by one of the , then the objective produces a solution that biases toward this since this solution fits more observations. It might be possible to compensate for such bias by optimizing a different objective. Nevertheless, in the balanced setting with small, we have consistency with optimal convergence rate. In this case, running Algorithm 1 on the optimal solution of the program (6) to estimate the ’s, we have the following guarantees.
Theorem 4 (Estimating , stochastic noise).
Notice the error bound has three terms which are proportional to , and , respectively (ignoring log factors). We shall see that these three terms match well with the information-theoretic lower bounds given in Section 3.4, and represent three phases of the error rate.
Discussion of Assumptions.
The theoretical results in this sub-section assume Gaussian covariate distribution in addition to sub-Gaussianity of the noise. This assumption can be relaxed, but using our analysis, it comes at a cost in terms of convergence rate (and hence sample complexity required for bounded error). It can be shown that suffices under a general sub-Gaussian assumption on the covariate. We believe this additional cost is an artifact of our analysis.
3.4 Minimax Lower Bounds
In this subsection, we derive minimax lower bounds on the estimation errors for both the arbitrary and stochastic noise settings. Recall that is the true regressor pairs, and we use to denote any estimator, which is a measurable function of the observed data . For any and in , we have defined the error (semi)-metric
We show in the appendix that satisfies the triangle inequality.
We consider the following class of parameters:
i.e., pairs of regressors whose norms and separation are lower bounded.
We first consider the arbitrary noise setting, where the noise is assumed to lie in the -ball and otherwise arbitrary. We have the following theorem.
Theorem 5 (Lower bound, arbitrary noise).
There exist universal constants such that the following is true. If , then for any and any hidden labels , we have
with probability at least , where the probability is w.r.t. the randomness in .
The lower bound above matches the upper bound given in Theorem 2, thus showing that our convex formulation is minimax optimal and cannot be improved. Therefore, Theorems 2 and 5 together establish the following minimax rate of the arbitrary noise setting
which holds when .
For the stochastic noise setting, we further assume the two components have equal mixing weights. Recall that is the -th hidden label, i.e., if and only if for . We have the following theorem.
Theorem 6 (Lower bound, stochastic noise).
Suppose , has i.i.d. standard Gaussian entries, has i.i.d. zero-mean Gaussian entries with variance and . The following holds for some absolute constants .
For any , we have
For any , we have
For any , we have
Here denotes the expectation w.r.t. the covariate , the hidden labels and the noise .
We see that the three lower bounds in the above theorem match the three terms in the upper bound given in Theorem 4 respectively up to a polylog factor, proving the minimax optimality of the error bounds of our convex formulation. Therefore, Theorems 4 and 6 together establish the following minimax error rate (up to a polylog factor) in the stochastic noise setting:
where is any lower bound on Notice how the scaling of the minimax error rate exhibits three phases depending on the Signal-to-Noise Ratio (SNR) . (1) In the high SNR regime with , we see a fast rate – proportional to – that is dominated by the error of estimating a single and is the same as the rate for standard linear regression. (2) In the low SNR regime with , we have a slow rate that is proportional to and is associated with the demixing of the two components . (3) In the medium SNR regime, the error rate transitions between the fast and slow phases and depends in a precise way on the SNR. For a related phenomenon, see [2, 11].
3.5 Implications for Phase Retrieval
As an illustration of the power of our results, we discuss an application to the Phase Retrieval problem, which has recently received much attention (e.g., [9, 6, 7, 12, 19, 5]). Recall that in the real setting, the phase retrieval problem is essentially a regression problem without sign information. Most recent work has focused on the noiseless case. Here, the problem is as follows: we observe , , where
The goal is to recover the unknown vector . The stability of recovery algorithms has also been considered. Most work has focused on the setting where noise is added to the phase-less measurements, that is,
In many applications, however, it is also natural to consider the setting where the measurement noise is added before the phase is lost. This corresponds to the model:
We may call (13) the Noisy Phase Model, as opposed to the Noisy Magnitude Model (12) considered by previous work on phase retrieval. This problem can be reduced to a mixed regression problem and solved by our algorithm. The reduction is as follows. We generate
independent Rademacher random variables. For each , we set . Let and , where we use the convention that . Then we have
If we let , , and , then the model becomes
which is precisely the mixed regression model we consider.
Note that with probability at least , for , so . Also note that . Conditioned on , the distribution of is the same as its unconditional distribution. Therefore, applying our arbitrary-noise result from Theorem 2, we immediately get the following guarantees for phase retrieval under the Noisy Phase Model.
Corollary 2 (Phase retrieval, arbitrary noise).
Consider the Noisy Phase Model in (13). Suppose the are i.i.d., zero-mean sub-Gaussian with bounded sub-Gaussian norm, unit variance and fourth moment
are i.i.d., zero-mean sub-Gaussian with bounded sub-Gaussian norm, unit variance and fourth moment, , and the noise is arbitrary, but bounded in magnitude: . Then using the reduction described above, the output of the program (4)–(5) followed by Algorithm 1 satisfies
with probability at least .
The error bound above is again order-wise optimal, as we cannot achieve a smaller error even if the phase is not lost. Similarly as before, the large noise case with can be handled trivially using the blind estimator , which in this case satisfies the optimal error bound .
Next, consider the stochastic noise case where is i.i.d., zero-mean symmetric sub-Gaussian with variance . Conditioned on , the conditional distributions of and inherit the properties of and the unconditional , and are independent of each other. Applying Theorem 4, we have the following.
Corollary 3 (Phase retrieval, stochastic noise).
Consider the Noisy Phase Model in (13). Suppose the are i.i.d., zero-mean Gaussian with unit variance, and suppose that the noise is i.i.d., zero-mean symmetric sub-Gaussian with sub-Gaussian norm bounded by and variance equal to . Suppose further that and . Then using the reduction described above, the output of the program (6) followed by Algorithm 1 satisfies (up to the sign of )
with probability at least .
Phase retrieval is most interesting in the complex setting. Extension to this case is an interesting future direction.
Finally, we make a comment on the scalability of the approach illustrated here. Both formulations (4)–(5) and (6) are Semidefinite Programs (SDP). In the arbitrary noise setting, the constraint in the convex program (4)–(5) can be rewritten as a collection of linear constraints through the standard transformation of convex constraints. The Lagrangian formulation (6) in the setting of stochastic noise, involves minimizing the sum of a trace norm term and a smooth quadratic term. The complexity of solving this regularized quadratic in the matrix space has similar complexity to problems such as matrix completion and PhaseLift, and first order methods can easily be adapted, thus allowing solution of large scale instances of the mixed regression problem.
4 Proof Outline
In this section, we provide the outline and the key ideas in the proofs of Theorems 1, 3, 5 and 6. The complete proofs, along with the perturbation results of Theorems 2, 4, are deferred to the appendix.
The main hurdle is proving strict curvature near the desired solution in the allowable directions. This is done by demonstrating that a linear operator related to the errors satisfies a restricted-isometry-like condition, and that this in turn implies a strict convexity condition along the cone centered at of all directions defined by potential optima.
4.1 Notation and Preliminaries
We use to denote if and if . Let Without loss of generality, we assume and . For , we define , and ; correspondingly, for , we define , and . For each , let be the matrix with rows . For and , define the matrix . Also let
For , define the mapping by
Since , , we have for any , and for all ,
For each , we also define the matrices , and the mapping given by
The following notation and definitions are standard. Let the rank- SVD of be . Note that and have the same column space, which equals . Define the projection matrix