1 Introduction
Discriminative latentvariable models, which combine the high accuracy of discriminative models with the compact expressiveness of latentvariable models, have been widely applied to many tasks, including object recognition (Quattoni et al., 2004), human action recognition (Wang & Mori, 2009), syntactic parsing (Petrov & Klein, 2008), and machine translation (Liang et al., 2006). However, parameter estimation in these models is difficult; past approaches rely on local optimization (EM, gradient descent) and are vulnerable to local optima.
Our broad goal is to develop efficient provably consistent estimators for discriminative latentvariable models. In this paper, we provide a first step in this direction by proposing a new algorithm for a simple model, a mixture of linear regressions (Viele & Tong, 2002).
Recently, method of moments estimators have been developed for
generative latentvariable models, including mixture models, HMMs (Anandkumar et al., 2012b), Latent Dirichlet Allocation (Anandkumar et al., 2012a), and parsing models (Hsu et al., 2012). The basic idea of these methods is to express the unknown model parameters as a tensor factorization of the thirdorder moments of the model distribution, a quantity which can be estimated from data. The moments have a special symmetric structure which permits the factorization to be computed efficiently using the robust tensor power method (Anandkumar et al., 2012c).In a mixture of linear regressions, using thirdorder moments does not directly reveal the tensor structure of the problem, so we cannot simply apply the above tensor factorization techniques. Our approach is to employ lowrank linear regression (Negahban & Wainwright, 2009; Tomioka et al., 2011) to predict the second and third powers of the response. The solution to these regression problems provide the appropriate symmetric tensors, on which we can then apply the tensor power method to retrieve the final parameters.
The result is a simple and efficient twostage algorithm, which we call Spectral Experts. We prove that our algorithm yields consistent parameter estimates under certain identifiability conditions. We also conduct an empirical evaluation of our technique to understand its statistical properties (Section 5). While Spectral Experts generally does not outperform EM, presumably due to its weaker statistical efficiency, it serves as an effective initialization for EM, significantly outperforming EM with random initialization.
1.1 Notation
Let denote the first positive integers. We use to denote a function such that .
We use to represent the th order tensor formed by taking the tensor product of ; i.e. . We will use to denote the generalized dot product between two th order tensors: . A tensor is symmetric if for all which are permutations of each other, = (all tensors in this paper will be symmetric). For a th order tensor , the mode unfolding of is a matrix , whose th row contains all the elements of whose th index is equal to .
For a vector
, let denote the 2norm. For a matrix , letdenote the nuclear (trace) norm (sum of singular values),
denote the Frobenius norm (square root of sum of squares of singular values), denote the max norm (elementwise maximum), denote the operator norm (largest singular value), and be the th largest singular value of . For a th order tensor , let denote the average nuclear norm over all unfoldings, and let denote the average operator norm over all unfoldings.Let be the vectorization of a th order tensor. For example, if , . For a tensor , let be the collapsed vectorization of . For example, if , . In general, each component of is indexed by a vector of counts with total sum . The value of that component is , where are the set of index vectors whose count profile is . We note that for a symmetric tensor and any tensor , ; this property is not true in general though. Later, we’ll see that vectorization allow us to perform regression on tensors, and collapsing simplifies our identifiability condition.
2 Model
The mixture of linear regressions model (Viele & Tong, 2002) defines a conditional distribution over a response given covariates . Let be the number of mixture components. The generation of given involves three steps: (i) draw a mixture component according to mixture proportions ; (ii) draw observation noise from a known zeromean noise distribution , and (iii) set deterministically based on and . More compactly:
(1)  
(2)  
(3) 
The parameters of the model are , where are the mixture proportions and are the regression coefficients. Note that the choice of mixture component and the observation noise are independent. The learning problem is stated as follows: given i.i.d. samples drawn from the model with some unknown parameters , return an estimate of the parameters .
The mixture of linear regressions model has been applied in the statistics literature for modelling music perception, where is the actual tone and is the tone perceived by a musician (Viele & Tong, 2002). The model is an instance of the hierarchical mixture of experts (Jacobs et al., 1991), in which the mixture proportions are allowed to depend on , known as a gating function. This dependence allow the experts to be localized in input space, providing more flexibility, but we do not consider this dependence in our model.
The estimation problem for a mixture of linear regressions is difficult because the mixture components
are unobserved, resulting in a nonconvex log marginal likelihood. The parameters are typically learned using expectation maximization (EM) or Gibbs sampling
(Viele & Tong, 2002), which suffers from local optima. In the next section, we present a new algorithm that sidesteps the local optima problem entirely.3 Spectral Experts algorithm
In this section, we describe our Spectral Experts algorithm for estimating model parameters . The algorithm consists of two steps: (i) lowrank regression to estimate certain symmetric tensors; and (ii) tensor factorization to recover the parameters. The two steps can be performed efficiently using convex optimization and tensor power method, respectively.
To warm up, let us consider linear regression on the response given . From the model definition, we have . The challenge is that the regression coefficients depend on the random . The first key step is to average over this randomness by defining average regression coefficients . Now we can express as a linear function of with nonrandom coefficients plus a noise term :
(4) 
The noise is the sum of two terms: (i) the mixing noise due to the random choice of the mixture component , and (ii) the observation noise . Although the noise depends on , it still has zero mean conditioned on . We will later show that we can perform linear regression on the data to produce a consistent estimate of . But clearly, knowing is insufficient for identifying all the parameters , as only contains degrees of freedom whereas contains .
Intuitively, performing regression on given provides only firstorder information. The second key insight is that we can perform regression on higherorder powers to obtain more information about the parameters. Specifically, for an integer , let us define the average th order tensor power of the parameters as follows:
(5) 
Now consider performing regression on given . Expanding , using the fact that , we have:
(6)  
Again, we have expressed has a linear function of with regression coefficients , plus a known bias and noise.^{1}^{1}1If were not known, we could treat it as another coefficient to be estimated. The coefficients and can be estimated jointly provided that does not already contain a bias ( must be nonconstant for every ). Importantly, the noise has mean zero; in fact each of the three terms has zero mean by definition of and independence of and .
Performing regression yields a consistent estimate of , but still does not identify all the parameters . In particular, is only identified up to rotation: if satisfies and is uniform, then
for any orthogonal matrix
.Let us now look to the third moment for additional information. We can write as a linear function of with coefficients , a known bias and some noise :
(7)  
The only wrinkle here is that does not quite have zero mean. It would if were replaced with , but is not available to us. Nonetheless, as concentrates around , the noise bias will go to zero. Performing this regression yields an estimate of . We will see shortly that knowledge of and are sufficient to recover all the parameters.
(8)  
(9)  
(10)  

Compute whitening matrix (such that ) using SVD.

Compute eigenvalues
and eigenvectors
of the whitened tensor by using the robust tensor power method. 
Return parameter estimates and .
Now we are ready to state our full algorithm, which we call Spectral Experts (Algorithm 1). First, we perform three regressions to recover the compound parameters (4), (6), and (7). Since and both only have rank , we can use nuclear norm regularization (Tomioka et al., 2011; Negahban & Wainwright, 2009) to exploit this lowrank structure and improve our compound parameter estimates. In the algorithm, the regularization strengths and are set to for some constant .
Having estimated the compound parameters , and , it remains to recover the original parameters . Anandkumar et al. (2012c) showed that for and of the forms in (5), it is possible to efficiently accomplish this. Specifically, we first compute a whitening matrix based on the SVD of and use that to construct a tensor whose factors are orthogonal. We can use the robust tensor power method to compute all the eigenvalues and eigenvectors of , from which it is easy to recover the parameters and .
Related work
In recent years, there has a been a surge of interest in “spectral” methods for learning latentvariable models. One line of work has focused on observable operator models (Hsu et al., 2009; Song et al., 2010; Parikh et al., 2012; Cohen et al., 2012; Balle et al., 2011; Balle & Mohri, 2012) in which a reparametrization of the true parameters are recovered, which suffices for prediction and density estimation. Another line of work is based on the method of moments and uses eigendecomposition of a certain tensor to recover the parameters (Anandkumar et al., 2012b, a; Hsu et al., 2012; Hsu & Kakade, 2013). Our work extends this second line of work to models that require regression to obtain the desired tensor.
In spirit, Spectral Experts bears some resemblance to the unmixing algorithm for estimation of restricted PCFGs (Hsu et al., 2012). In that work, the observations (moments) provided a linear combination over the compound parameters. “Unmixing” involves solving for the compound parameters by inverting a mixing matrix. In this work, each data point (appropriately transformed) provides a different noisy projection of the compound parameters.
Other work has focused on learning discriminative models, notably Balle et al. (2011) for finite state transducers (functions from strings to strings), and Balle & Mohri (2012) for weighted finite state automata (functions from strings to real numbers). Similar to Spectral Experts, Balle & Mohri (2012) used a twostep approach, where convex optimization is first used to estimate moments (the Hankel matrix in their case), after which these moments are subjected to spectral decomposition. However, these methods are developed in the observable operator framework, whereas we consider parameter estimation.
The idea of performing lowrank regression on has been explored in the context of signal recovery from magnitude measurements (Candes et al., 2011; Ohlsson et al., 2012). There, the actual observed response was , whereas in our case, we deliberately construct powers to identify the underlying parameters.
4 Theoretical results
In this section, we provide theoretical guarantees for the Spectral Experts algorithm. Our main result shows that the parameter estimates converge to at a rate that depends polynomially on the bounds on the parameters, covariates, and noise, as well the th smallest singular values of the compound parameters and various covariance matrices.
Theorem 1 (Convergence of Spectral Experts).
Assume each dataset (for ) consists of i.i.d. points independently drawn from a mixture of linear regressions model with parameter .^{2}^{2}2Having three independent copies simplifies the analysis. Further, assume , for all , and is rank . Let , and assume for each . Let . Suppose the number of samples is where
If each regularization strength is set to
for , then the parameter estimates returned by Algorithm 1 (with the columns appropriately permuted) satisfies
for all .
While the dependence on some of the norms () looks formidable, it is in some sense unavoidable, since we need to perform regression on thirdorder moments. Classically, the number of samples required is squared norm of the covariance matrix, which itself is bounded by the squared norm of the data, . This thirdorder dependence also shows up in the regularization strengths; the cubic terms bound each of , and
with high probability.
The proof of the theorem has two parts. First, we bound the error in the compound parameters estimates using results from Tomioka et al. (2011). Then we use results from Anandkumar et al. (2012c) to convert this error into a bound on the actual parameter estimates derived from the robust tensor power method. But first, let us study a more basic property: identifiability.
4.1 Identifiability from moments
In ordinary linear regression, the regression coefficients are identifiable if and only if the data has full rank: , and furthermore, identifying requires only moments and (by observing the optimality conditions for (4)). However, in mixture of linear regressions, these two moments only allow us to recover . Theorem 1 shows that if we have the higher order analogues, and for , we can then identify the parameters , provided the following identifiability condition holds: for .
This identifiability condition warrants a little care, as we can run into trouble when components of are dependent on each other in a particular algebraic way. For example, suppose , the common polynomial basis expansion, so that all the coordinates are deterministically related. While might be satisfied (sufficient for ordinary linear regression), is singular for any data distribution. To see this, note that contains components and , which are linearly dependent. Therefore, Spectral Experts would not be able to identify the parameters of a mixture of linear regressions for this data distribution.
We can show that some amount of unidentifiability is intrinsic to estimation from loworder moments, not just an artefact of our estimation procedure. Suppose . Even if we observed all moments and for for some , all the resulting coordinates would be monomials of up to only degree , and thus the moments live in a dimensional subspace. On the other hand, the parameters live in a subspace of at least dimension . Therefore, at least moments are required for identifiability of any algorithm for this monomial example.
4.2 Analysis of lowrank regression
In this section, we will bound the error of the compound parameter estimates and , where and . Our analysis is based on the lowrank regression framework of Tomioka et al. (2011) for tensors, which builds on Negahban & Wainwright (2009) for matrices. The main calculation involved is controlling the noise , which involves various polynomial combinations of the mixing noise and observation noise.
Let us first establish some notation that unifies the three regressions ((8), (9), and (10)). Define the observation operator mapping compound parameters :
(11) 
Let be the restricted strong convexity constant, and let be the adjoint.
Lemma 1 (Tomioka et al. (2011), Theorem 1).
Suppose there exists a restricted strong convexity constant such that
Then the error of is bounded as follows:
Going forward, we need to lower bound the restricted strong convexity constant and upper bound the operator norm of the adjoint operator . The proofs of the following lemmas follow from standard concentration inequalities and are detailed in Appendix A.
Lemma 2 (lower bound on restricted strong convexity constant).
If
then with probability at least :
for each .
Lemma 3 (upper bound on adjoint operator).
If
then with probability at least :
for each .
4.3 Analysis of the tensor factorization
Having bounded the error of the compound parameter estimates and , we will now study how this error propagates through the tensor factorization step of Algorithm 1, which includes whitening, applying the robust tensor power method (Anandkumar et al., 2012c), and unwhitening.
Lemma 4.
Let . Let and both be less than
for some . Then, there exists a permutation of indices such that the parameter estimates found in step 2 of Algorithm 1 satisfy the following with probability at least :
for all .
The proof follows by applying standard matrix perturbation results for the whitening and unwhitening operators and can be found in Appendix B.
4.4 Synthesis
Together, these lemmas allow us to control the compound parameter error and the recovery error. We now apply them in the proof of Theorem 1:
5 Empirical evaluation
In the previous section, we showed that Spectral Experts provides a consistent estimator. In this section, we explore the empirical properties of our algorithm on simulated data. Our main finding is that Spectral Experts alone attains higher parameter error than EM, but this is not the complete story. If we initialize EM with the estimates returned by Spectral Experts, then we end up with much better estimates than EM from a random initialization.
5.1 Experimental setup
Algorithms
We experimented with three algorithms. The first algorithm (Spectral) is simply the Spectral Experts. We set the regularization strengths and ; the algorithm was not very sensitive to these choices. We solved the lowrank regression to estimate and using an offtheshelf convex optimizer, CVX (Grant & Boyd, 2012). The second algorithm (EM) is EM where the ’s are initialized from a standard normal and
was set to the uniform distribution plus some small perturbations. We ran EM for 1000 iterations. In the final algorithm (Spectral+EM), we initialized EM with the output of Spectral Experts.
Data
We generated synthetic data as follows: First, we generated a vector sampled uniformly over the dimensional unit hypercube . Then, to get the actual covariates , we applied a nonlinear function of that conformed to the identifiability criteria discussed in Section 3. The true regression coefficients were drawn from a standard normal and is set to the uniform distribution. The observation noise
is drawn from a normal with variance
. Results are presented below for , but we did not observe any qualitatively different behavior for choices of in the range .As an example, one feature map we considered in the onedimensional setting was . The data and the curves fit using Spectral Experts, EM with random initialization and EM initialized with the parameters recovered using Spectral Experts are shown in Figure 2. We note that even on wellseparated data such as this, EM converged to the correct basin of attraction only 13% of the time.
Variables ()  Features ()  Components ()  Spectral  EM  Spectral + EM 

1  4  2  2.45 3.68  0.28 0.82  0.17 0.57 
2  5  2  1.38 0.84  0.00 0.00  0.00 0.00 
2  5  3  2.92 1.71  0.43 1.07  0.31 1.02 
2  6  2  2.33 0.67  0.63 1.29  0.01 0.01 

5.2 Results
Variables ()  Features ()  Components ()  Spectral  EM  Spectral + EM 

1  4  2  1.70 0.85  0.29 0.85  0.03 0.09 
2  5  3  1.37 0.85  0.44 1.12  0.00 0.00 
2  6  5  9.89 4.46  2.53 1.77  2.69 1.83 
2  8  7  23.07 7.10  9.62 1.03  8.16 2.31 

Table 1 presents the Frobenius norm of the difference between true and estimated parameters for the model, averaged over 20 different random instances for each feature set and 10 attempts for each instance. The experiments were run using samples.
One of the main reasons for the high variance is the variation across random instances; some are easy for EM to find the global minima and others more difficult. In general, while Spectral Experts did not recover parameters by itself extremely well, it provided a good initialization for EM.
To study the stability of the solutions returned by Spectral Experts, consider the histogram in Figure 1, which shows the recovery errors of the algorithms over 170 attempts on a dataset with . Typically, Spectral Experts returned a stable solution. When these parameters were close enough to the true parameters, we found that EM almost always converged to the global optima. Randomly initialized EM only finds the true parameters a little over 10% of the time and shows considerably higher variance.
Effect of number of data points
In Figure 3, we show how the recovery error varies as we get more data. Each data point shows the mean error over 10 attempts, with error bars. We note that the recovery performance of EM does not particularly improve; this suggests that EM continues to get stuck in a local optima. The spectral algorithm’s error decays slowly, and as it gets closer to zero, EM initialized at the spectral parameters finds the true parameters more often as well. This behavior highlights the tradeoff between statistical and computational error.
Misspecified data
To evaluate how robust the algorithm was to model misspecification, we removed large contiguous sections from and ran the algorithms again. Table 2 reports recovery errors in this scenario. The error in the estimates grows larger for higher .
6 Conclusion
In this paper, we developed a computationally efficient and statistically consistent estimator for mixture of linear regressions. Our algorithm, Spectral Experts, regresses on higherorder powers of the data with a regularizer that encourages low rank structure, followed by tensor factorization to recover the actual parameters. Empirically, we found Spectral Experts to be an excellent initializer for EM.
Acknowledgements
We would like to thank Lester Mackey for his fruitful suggestions and the anonymous reviewers for their helpful comments.
References
 Anandkumar et al. (2012a) Anandkumar, A., Foster, D. P., Hsu, D., Kakade, S. M., and Liu, Y. Two SVDs suffice: Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocation. In Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, 2012a. MIT Press.

Anandkumar et al. (2012b)
Anandkumar, A., Hsu, D., and Kakade, S. M.
A method of moments for mixture models and hidden Markov models.
In Conference on Learning Theory (COLT), 2012b.  Anandkumar et al. (2012c) Anandkumar, Anima, Ge, Rong, Hsu, Daniel, Kakade, Sham M., and Telgarsky, Matus. Tensor decompositions for learning latent variable models. CoRR, abs/1210.7559, 2012c.
 Balle & Mohri (2012) Balle, B. and Mohri, M. Spectral learning of general weighted automata via constrained matrix completion. In Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, 2012. MIT Press.
 Balle et al. (2011) Balle, B., Quattoni, A., and Carreras, X. A spectral learning algorithm for finite state transducers. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2011.
 Candes et al. (2011) Candes, E. J., Strohmer, T., and Voroninski, V. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Technical report, ArXiv, 2011.
 Chaganty & Liang (2013) Chaganty, A. and Liang, P. Spectral experts for estimating mixtures of linear regressions. International Conference on Machine Learning, 2013.
 Cohen et al. (2012) Cohen, S. B., Stratos, K., Collins, M., Foster, D. P., and Ungar, L. Spectral learning of latentvariable PCFGs. In Association for Computational Linguistics (ACL). Association for Computational Linguistics, 2012.
 Grant & Boyd (2012) Grant, Michael and Boyd, Stephen. CVX: Matlab software for disciplined convex programming, version 2.0 beta. http://cvxr.com/cvx, September 2012.
 Hsu & Kakade (2013) Hsu, D. and Kakade, S. M. Learning mixtures of spherical gaussians: Moment methods and spectral decompositions. In Innovations in Theoretical Computer Science (ITCS), 2013.
 Hsu et al. (2009) Hsu, D., Kakade, S. M., and Zhang, T. A spectral algorithm for learning hidden Markov models. In Conference on Learning Theory (COLT), 2009.
 Hsu et al. (2012) Hsu, D., Kakade, S. M., and Liang, P. Identifiability and unmixing of latent parse trees. In Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, 2012. MIT Press.
 Jacobs et al. (1991) Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. Adaptive mixtures of local experts. Neural Computation, 3:79–87, 1991.
 Liang et al. (2006) Liang, P., BouchardCôté, A., Klein, D., and Taskar, B. An endtoend discriminative approach to machine translation. In International Conference on Computational Linguistics and Association for Computational Linguistics (COLING/ACL), Sydney, Australia, 2006. Association for Computational Linguistics.
 Negahban & Wainwright (2009) Negahban, S. and Wainwright, M. J. Estimation of (near) lowrank matrices with noise and highdimensional scaling. ArXiv eprints, December 2009.
 Ohlsson et al. (2012) Ohlsson, H., Yang, A., Dong, R., and Sastry, S. CPRL – an extension of compressive sensing to the phase retrieval problem. In Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, 2012. MIT Press.

Parikh et al. (2012)
Parikh, A., Song, L., Ishteva, M., Teodoru, G., and Xing, E.
A spectral algorithm for latent junction trees.
In
Uncertainty in Artificial Intelligence (UAI)
, 2012.  Petrov & Klein (2008) Petrov, S. and Klein, D. Discriminative loglinear grammars with latent variables. In Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, 2008. MIT Press.
 Quattoni et al. (2004) Quattoni, A., Collins, M., and Darrell, T. Conditional random fields for object recognition. In Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, 2004. MIT Press.
 Song et al. (2010) Song, L., Boots, B., Siddiqi, S., Gordon, G., and Smola, A. Hilbert space embeddings of hidden Markov models. In International Conference on Machine Learning (ICML), Haifa, Israel, 2010. Omnipress.
 Tomioka et al. (2011) Tomioka, R., Suzuki, T., Hayashi, K., and Kashima, H. Statistical performance of convex tensor decomposition. Advances in Neural Information Processing Systems (NIPS), pp. 137, 2011.
 Viele & Tong (2002) Viele, Kert and Tong, Barbara. Modeling with mixtures of linear regressions. Statistics and Computing, 12:315–330, 2002. ISSN 09603174. doi: 10.1023/A:1020779827503. URL http://dx.doi.org/10.1023/A%3A1020779827503.
 Wang & Mori (2009) Wang, Y. and Mori, G. Maxmargin hidden conditional random fields for human action recognition. In Computer Vision and Pattern Recognition (CVPR), 2009.
Appendix A Proofs: Regression
Let us review the regression problem set up in Section 3. We assume we are given data generated by the following process,
where , the expected value of , is an estimable bias and is zero mean noise. In particular, for , we showed that and were,
(13)  
(14)  
(15) 
We then defined the observation operator ,
for . This let us succinctly represent the lowrank regression problem for as follows,
Let us also recall the adjoint of the observation operator, ,
where we have used to represent the vector .
Tomioka et al. (2011) showed that error in the estimated can be bounded as follows;
Lemma 1 (Tomioka et al. (2011), Theorem 1).
Suppose there exists a restricted strong convexity constant such that
Then the error of is bounded as follows: .
In this section, we will derive an upper bound on and a lower bound on , allowing us to apply Lemma 1 in the particular context of our noise setting.
Lemma 2 (Lower bound on restricted strong convexity).
Let . If
then, with probability at least ,
Proof.
Recall that is defined to be a constant such that the following inequality holds,
where
To proceed, we will unfold the tensors and to get a lower bound in terms of . This will allow us to choose an appropriate value for that will hold with high probability.
First, note that is symmetric, and thus . This allows us to simplify as follows,
Let , so that . For symmetric , ^{3}^{3}3 The correspond to residuals , which can easily be shown to always be symmetric.. Then, we have
By Weyl’s theorem,
Since , it suffices to show that the empirical covariance concentrates in Frobenius norm. Applying Lemma 5, with probability at least ,
Now we seek to control . Since , we can use the bound
Finally, with probability at least if,
∎
Lemma 3 (Upper bound on adjoint operator).
With probability at least , the following holds,