In this paper, we consider the following mixed linear equation problem. Suppose we are given samples of response-covariate pairs that are determined by equations
where , are model parameters corresponding to different linear models, and is the unobserved label of sample indicating which model it is generated from. We assume random label assignment, i.e.,
are i.i.d. copies of a multinomial random variablethat has distribution
Here represent the weights of every linear model, and naturally satisfy . Our goal is to find parameters from mixed samples . While solving linear systems is straightforward, this problem, with the introduction of latent variables, is hard to solve in general. Work in [yi2014alternating] shows that the subset sum problem can be reduced to mixed linear equations in the case of and certain designs of and . Therefore, given , determining whether there exist two ’s that satisfy is NP-complete, and thus in general the case is already hard. In this paper, we consider the setting for general , where the covariates
’s are independently drawn from the standard Gaussian distribution:
Under this random design, we provide a tractable algorithm for the mixed linear equation problem, and give sufficient conditions on the
’s under which we guarantee exact recovery with high probability.
The problem of solving mixed linear equations (or regression when each is perturbed by a small amount of noise) arises in applications where samples are from a mixture of discriminative linear models and the interest is in parameter estimation. Mixed standard and generalized linear regression models are introduced in 1990s [wedel1995mixture] and have become an important set of techniques for market segmentation [wedel2012market]. These models have also been applied to study music perception [Viele2002] and health care demand [deb2000estimates]. See [grun2007applications] for other related applications and datasets. Mixed linear regression is closely related to another classical model called hierarchical mixtures of experts [jacobs1991adaptive]
, which also allows the distribution of labels to be adaptive to covariate vectors.
Due to the combinatorial nature of mixture models, popular approaches, including EM and gradient descent, are often based on local optimization and thus suffer from local minima. Indeed, to the best of our knowledge, there is no rigorous analysis of the convergence behavior of EM or other gradient descent-based methods for . Beyond real-world applications, the statistical limits of solving problem (1) by computationally efficient algorithms are even less well understood. This paper is motivated by this question: how many samples are necessary to recover exactly and efficiently?
In a nutshell, we prove that under certain technical conditions, there exists an efficient algorithm for solving mixed linear equations with sample size , and we provide an algorithm which achieves this. Notably, the dependence on is nearly linear and thus optimal up to some log factors. Our proposed algorithm has two phases. The first step is a spectral method called tensor decomposition, which is guaranteed to produce -close solutions with samples. In the second step, we apply an alternating minimization (AltMin) procedure to successively refine the estimation until exact recovery happens. As a key ingredient, we show that AltMin, as a non-convex optimization technique, enjoys linear convergence to the global optima when initialized closely enough to the optimal solution.
1.1 Comparison to Prior Art
The use of the method of moments for learning latent variable models can be dated back to Pearson’s work[pearson1894contributions]
on estimating Gaussian mixtures. There is now an increasing interest in computing high order moments and leveraging tensor decomposition for parameter estimation in various mixture models including Hidden Markov Models[anandkumar2012method], Gaussian mixtures [Hsu2012Gussian], and topic models [anandkumar2012two]. Following the same idea, we propose some novel moments for mixed linear equations, on which approximate estimation of parameters can be computed by tensor decomposition. Different from our moments, Chaganty and Liang [chaganty13] propose a method of regressing against to estimate a certain third-moment tensor of mixed linear regression under bounded and random covariates. Because of performing regression in the lifted space with dimension , their method suffers from much higher sample complexity compared to our results, while the latter builds on a different covariate assumption (3).
Mixed linear equation/regression with two components is now well understood. In particular, our earlier work [yi2014alternating] proves the local convergence of AltMin for mixed linear equations with two components. Through a convex optimization formula, work in [chen2014convex] establishes the minimax optimal statistical rates under stochastic and deterministic noises. Notably, Balakrishnan et al. [balakrishnan2014statistical]
develop a framework for analyzing the local convergence of expectation-maximization (EM), i.e., EM is guaranteed to produce statistically consistent points with good initializations. In the case of mixed linear regression with
and Gaussian noise with variance, applying the framework leads to estimation error . Even in the case of no noise (), their results do not imply exact recovery. Moreover, it is unclear how to apply the framework to the case of components. It is obvious that AltMin is equivalent to EM in the noiseless setting. Our analysis of AltMin takes a step further towards understanding EM in the case of multiple latent clusters.
Beyond linear models, learning mixture of generalized linear models is recently studied in [sun2014learning] and [sedghi2014provable]. Specifically, [sun2014learning] proposes a spectral method based on second order moments for estimating the subspace spanned by model parameters. Later on, Sedghi et al. [sedghi2014provable] construct specific third order moments that allow tensor decomposition to be applied to estimate individual vectors. In detail, when , they show that obtaining recovery error requires sample size . In a more recent update [sedghi2016provable] of their paper, they establish the same sample complexity for mixed linear regression using different moments, which we realize coincide with ours during the preparation of this paper. Nevertheless, we perform a sharper analysis that leads to a near-linear-in- sample complexity .
Conceptually, we establish the power of combining spectral method and likelihood based estimation for learning latent variable models. Spectral method excludes most bad local optima on the surface of likelihood loss, and as a consequence, it becomes much easier for non-convex local search methods such as EM and AltMin, to find statistically efficient solutions. Such phenomenon in the context of mixed linear regression is observed empirically in [chaganty13]. We provide a theoretical explanation in this paper. It is worth mentioning the applications of such idea in other problems including crowdsourcing [zhang2014spectral], phase retrieval (e.g. [candes2015phase, chen2015solving]) and matrix completion (e.g. [jain2013low, sun2015guaranteed, chen2015fast]). Most of these works focus on estimating bilinear or low rank structures. In the context of crowdsourcing, work in [zhang2014spectral] shows that performing one step of EM can achieve optimal rate given good initialization. In contrast, we establish an explicit convergence trajectory of multiple steps of AltMin for our problem. It would be interesting to study the convergence path of AltMin or EM for other latent variable models.
1.2 Notation and Outline
We lay down some notations commonly used throughout this paper. For counting number , we use to denote the set . We let , denote respectively. For sub-Gaussian random variable , we denote its -Orlicz norm [van1996weak] by , i.e.,
where . For vector , we use to denote the standard norm of . For matrix , we use to denote its
-th largest singular value. We also commonly useto denote and . In particular, we denote the operator norm of matrix as . We also use to denote the operator norm of symmetric third order tensor , namely
Here, denotes the multi-linear matrix multiplication of by , namely,
For two sequences indexed by , we write to mean there exists a constant such that for all . By , we mean there exist constants such that . We also use as shorthand for . Similarly, we say if .
The rest of this paper is organized as follows. In Section 2, we describe the specific details of our two-phase algorithm for solving mixed linear equations. We present the theoretical results of initialization and AltMin in Section 3.1 and 3.2 respectively. We combine these two parts and give the overall sample and time complexities for exact recovery in Section 3.3. We provide the experimental results in Section 4. All proofs are collected in Section 5.
A natural idea to solve problem (1) is to apply an alternating minimization (AltMin) procedure between parameters and labels : (1) Given , assign the labels for each sample by choosing a model that has minimal recovery error ; (2) When labels are available, each parameter is updated by applying the method of least square optimization to samples with the corresponding labels. One can show that in our setting, alternating minimization is equivalent to Expectation-Maximization (EM), which is one of the most important algorithms for inference in latent variable models. In general, similar to EM, AltMin is vulnerable to local optima. Our experiment (see Figure 1) demonstrates that even under random setting , AltMin with random initializations fails to exactly recover each with significantly large probability.
To overcome the local-optima issue of AltMin, our algorithm consists of two stages. The first stage builds on carefully designed moments of samples, and aims to find rough estimates of . Starting with the initialization, the second stage involves using AltMin to successively refine the estimates. In the following, we describe these two steps with more details.
2.1 Tensor Decomposition
In the first step, we use method of moments to compute initial estimates of . Consider moments , and as
where is a mapping from to with form
It is reasonable to choose these moments because of the next result, which shows that the expectations of and contain the structure of . See Section 5.1 for its proof.
Lemma 1 (Moment Expectation).
With the special structure given on the right hand sides of (7) and (8), tensor decomposition techniques can discover in three steps under a non-degeneracy condition (see Condition 1). First, apply SVD on to compute a whitening matrix such that . Then we use to transform into an orthogonal tensor2). Lastly,
can be reconstructed by applying simple linear transformation upon the previously discovered spectral components from. With sufficient amount of samples, it is reasonable to believe that and are close to their expectations such that the stability of tensor decomposition will lead to good enough estimates. For the ease of analysis, we need to ensure the independence between whitening matrix and . Accordingly, we split the samples used in initialization into two disjoint parts for computing and respectively. We present the details in Algorithm 1.
2.2 Alternating Minimization
The motivation for using AltMin is to consider the least-square loss function below
The minimization over discrete labels makes the above loss function non-convex and yields hardness of solving mixed linear equations in general. A natural idea to minimize is by minimizing and alternatively and iteratively. Given initial estimates , each iteration consists of the following two steps:
Label Assignment: Pick the model that has the smallest reconstruction error for each sample
AltMin runs quickly and is thus favored in practice. However, as we discussed before, its convergence to global optima is commonly intractable. In order to alleviate such issue, we already discussed how to construct good initial estimates by method of moments. Here, we introduce another ingredient—resampling—for making the analysis of AltMin tractable. The key idea is to split all samples into multiple disjoint subsets and use a fresh piece of samples in each iteration. While slightly inefficient regarding sample complexity, this trick decouples the probabilistic dependence between two successive estimates and , and thus makes our analysis hold. The details are presented in Algorithm 3.
3 Theoretical Results
Moreover, we impose the following non-degeneracy condition on .
Condition 1 (Non-degeneracy).
Parameters are linearly independent and all weights are strictly greater than , namely
Under the above condition, has rank , which leads to
We use to denote the minimum distance between any two parameters, namely
The above three quantities represent the hardness of our problem, and will appear in the results of our analysis. For estimates , we define the estimation error as
where the infimum is taken over all permutations on .
3.1 Analysis of Tensor Decomposition
Theorem 1 (Tensor Decomposition).
Consider Algorithm 1 for initial estimation of . Pick any . There exist constants such that the following holds. Pick any . If
then with probability at least , the output satisfy
Theorem 1 shows that have inverse dependencies on . In the well balanced setting, we have . In general, can be quite small, especially in the case where some parameter almost lies in the subspace spanned by the rest parameters and has a very small magnitude along the orthogonal direction. Below we provide a sufficient condition under which has a well established lower bound.
Condition 2 (Nearly Orthonormal Condition()).
For all , . Moreover, for all .
Under the above condition, the next result provides a lower bound of . See Section 5.2 for the proof.
Suppose satisfy the nearly orthonormal condition with . Then we have
Remark 1 (Sample Complexity).
We treat in Theorem 1 as a constant. Then (13) implies that is sufficient to guarantee that the estimates produced by Algorithm 1 have accuracy at most . Moreover, we have , , which indicates that more samples are required to compute than . To provide some intuitions why this conclusion makes sense, note that the estimation accuracy of determines the accuracy of identifying the subspace spanned by in the original -dimensional space. While has higher order, it is only required to concentrate well on a -dimensional subspace computed from thanks to the whitening procedure. It turns out subspace accuracy has a more critical impact on the final error and needs to sharpened with more samples.
Remark 2 (Time Complexity).
Except the line 6 in Algorithm 1, the other steps have total complexity . Note that it’s not necessary to compute directly since we can compute from whitened covariate vectors . Running time of robust tensor power method is . According to Lemma 4, it is sufficient to set and for some polynomial function . When is large enough, can be very close to be linear in (see Theorem 5.1 in [anandkumar2014tensor] for details). Roughly, we take , which gives the running time of Algorithm 2 as when . Therefore, the overall complexity of Algorithm 1 is .
3.2 Analysis of Alternating Minimization
Now we turn to the analysis of Algorithm 3. Let .
Theorem 2 (Alternating Minimization).
Consider Algorithm 3 for successively refining estimation of . Pick any . There exist constants such that the following holds. Suppose
With probability at least , satisfies
See Section 5.5 for the proof of the above result. Theorem 3 suggests that with good enough initialization, iterates have at least linear convergence to the ground truth parameters. Due to the fast convergence, it is sufficient to set to obtain estimation with accuracy . In the case of well balanced clusters, i.e. , is required to be in order to guarantee the convergence to global optima. Next, we give two remarks for sample and time complexities. In our discussion, we assume and that is a small constant.
Remark 3 (Sample Complexity).
For accuracy , it is sufficient to have when satisfies . Compared to the sample complexity of tensor decomposition, AltMin avoids the high-order polynomial factor of . Moreover, it also changes the dependence on from to , which is a big save especially when we focus on exact recovery, which can happen as we show in the next section, after one step of AltMin when . Notably, the statistical efficiency comes from a good initialization provided by tensor/spectral method. On one hand, AltMin alleviates the statistical inefficiency of spectral method; on the other hand, spectral method resolves the algorithmic intractability of AltMin.
Remark 4 (Time Complexity).
Each iteration of AltMin has time complexity . Hence, the overall running time is 222Factor in the second term stands for the complexity of inverting a -by- matrix by Gauss-Jordan elimination. It can be further reduced by more complicated algorithms such as Strassen algorithm that has .. Using the minimum requirement of , we obtain complexity . Recall that solving linear regression by most practical algorithms has complexity . Therefore, even labels are available, solving sets of linear equations requires time . AltMin almost has an extra factor as the price for addressing latent variables.
3.3 Exact Recovery and Overall Guarantee
We now consider putting the previous analysis of tensor decomposition and AltMin together to show exact recovery of .
We provide the proof of the above result in Section 5.6. Putting all ingredients together, we have the following overall guarantee:
Corollary 1 (Exact Recovery).
Consider splitting samples from (1) into two disjoint sets with size as inputs of Algorithm 1 and 3 for solving mixed linear equations as a two-stage method. Pick any . There exist constants such that the following holds. If we choose in Algorithm 3, and satisfy
then with probability at least , we have exact recovery, i.e. .
The proof is provided in Section 5.3. When and Condition 2 holds with and ( in the case), Corollary 1 implies that is enough for exact recovery with high probability, say . With this amount of samples, Remarks 2 and 4 give the overall time complexity as . Note that solving sets of linear equations (labels are known) needs at least samples, and usually requires time . Hence, under the aforementioned setting, our two-stage algorithm is nearly optimal in with respect to sample and time complexities.
4 Numerical Results
In this section, we provide some numerical results to demonstrate the empirical performance of the proposed method (combination of Algorithms 1 and 3) for solving mixed linear equations, and also compare it with random initialized Alternating minimization (AltMin). All algorithms are implemented in MATLAB. While sample-splitting is useful for our theoretical analysis, we find it unnecessary in practice. Therefore, we remove the sample-splittings in Algorithms 1 and 3, and use the whole sample set in the entire process. AltMin is implemented to terminate when the label assignment no longer changes or the maximal number of iterations is reached. In all experiments, we set .
For given problem size , we generate synthetic datasets as follows. Covariate vectors are drawn independently from . Model parameters are a random set of vectors in , where every two distinct s have distance . Therefore, these parameters are not orthogonal. Suppose denotes the matrix with as the -th column. We let , where represents the basis of a random -dimensional subspace in . Matrices are from the eigen-decomposition of symmetric matrix , where the diagonal terms of are and the rest entries are . We assign equal weights for all clusters.
Our first set of results, presented in Figure 1, show the convergence of estimation errors of AltMin with random and tensor initializations. Recall that estimation error is defined in (12). In random setting, AltMin starts with a set of uniformly random vectors in . We find that AltMin with random starting points has quite slow convergence, and fails to produce true s with significant probability. In contrast, with the same amount of samples, tensor method provides more accurate starting points, which leads to much faster convergence of AltMin to the global optima. These results thus back up our convergence theory of AltMin (Theorem 2), and demonstrate the power of using tensor decomposition initialization.
The second set of results, presented in Figure 2, explore the statistical efficiency of the proposed algorithm—tensor initialized AltMin. For fixed , Figure (1(a)) reveals a linear dependence of the necessary sample size on , which matches our results in Corollary 1. With fixed , Figure (1(b)) indicates that samples could be enough in practice, which is much better than our theoretical guarantee . Sharpening the polynomial factor on is an interesting direction of future research.
5.1 Proof of Lemma 1
Recall that denotes the latent label associated with each sample. Suppose , and has the distribution of each . We find that
One can check that for any , . Therefore,
One can check , which leads to .