Convergence of Parameter Estimates for Regularized Mixed Linear Regression Models

by   Taiyao Wang, et al.
Boston University

We consider Mixed Linear Regression (MLR), where training data have been generated from a mixture of distinct linear models (or clusters) and we seek to identify the corresponding coefficient vectors. We introduce a Mixed Integer Programming (MIP) formulation for MLR subject to regularization constraints on the coefficient vectors. We establish that as the number of training samples grows large, the MIP solution converges to the true coefficient vectors in the absence of noise. Subject to slightly stronger assumptions, we also establish that the MIP identifies the clusters from which the training samples were generated. In the special case where training data come from a single cluster, we establish that the corresponding MIP yields a solution that converges to the true coefficient vector even when training data are perturbed by (martingale difference) noise. We provide a counterexample indicating that in the presence of noise, the MIP may fail to produce the true coefficient vectors for more than one clusters. We also provide numerical results testing the MIP solutions in synthetic examples with noise.



There are no comments yet.


page 1

page 2

page 3

page 4


Outlier Detection Using Distributionally Robust Optimization under the Wasserstein Metric

We present a Distributionally Robust Optimization (DRO) approach to outl...

Iterative Least Trimmed Squares for Mixed Linear Regression

Given a linear regression setting, Iterative Least Trimmed Squares (ILTS...

Bayesian Estimation Approach for Linear Regression Models with Linear Inequality Restrictions

Univariate and multivariate general linear regression models, subject to...

Spectral Deconfounding and Perturbed Sparse Linear Models

Standard high-dimensional regression methods assume that the underlying ...

A New Look at an Old Problem: A Universal Learning Approach to Linear Regression

Linear regression is a classical paradigm in statistics. A new look at i...

Alternating Minimization Converges Super-Linearly for Mixed Linear Regression

We address the problem of solving mixed random linear equations. We have...

Penalized k-means algorithms for finding the correct number of clusters in a dataset

In many applications we want to find the number of clusters in a dataset...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Mixed Linear Regression (MLR) [1, 2] is also known as mixtures of linear regressions [3] or cluster-wise linear regression [4]. It involves the identification of two or more linear regression models from unlabeled samples generated from an unknown mixture of these models. This can be seen as a joint clustering and regression problem. The problem is related to the identification of hybrid and switched linear systems [5] and has many applications in diverse areas. In this work, we focus on the fundamental problem of establishing strong consistency of parameter estimates, i.e., establishing that the estimated parameters converge to their true values as the number of the training samples grows.

MLR is typically solved by local search such as Expectation Maximization (EM) or alternating minimization, where one alternates between clustering given regression parameters and regression given cluster labels. It has recently been shown that EM converges to the true parameters if it starts from a small enough neighborhood around them [1, 6, 7].

Much effort has focused on the case where training samples are not perturbed by noise. [8]

proposed a combination of tensor decomposition and alternating minimization, showing that initialization by the tensor method allows alternating minimization to converge to the global optimum at a linear rate with high probability (w.h.p.)

[2] proposes a non-convex objective function with a tensor method for initialization so that the initial coefficients are in the neighborhood of the true coefficients w.h.p. [9] proposes a second-order cone program, showing that it recovers all mixture components in the noiseless setting under conditions that include a well-separation assumption and a balanced measurement assumption on the data. [10] considers data generated from a mixture of Gaussians, showing global convergence of the proposed algorithm with nearly optimal sample complexity.

Establishing convergence (w.h.p.) and consistency results for MLR in the presence of noise is much harder. [3] develops a provably consistent estimator for MLR that relies on a low-rank linear regression to recover a symmetric tensor, which can be factorized into the parameters using a tensor power method. [11] provides a convex optimization formulation for MLR with two components and upper bounds on the recovery errors under subgaussian noise assumptions. [12] studies a MixLasso approach but convergence results are limited to the objective function and not the solution. [13] develops an algorithm based on ideas from sparse graph codes; the convergence results however are asymptotic under Gaussian noise and for MLR with only two components.

Recently, an increasing number of machine learning and statistics problems have been tackled using MIP methods 

[14, 15, 16, 17, 18]. [19]

proposes an MIP formulation for MLR and a pre-clustering heuristic approach to solve large-scale problems.

[4] proposes a Mixed-Integer Quadratic Program (MIQP) formulation for MLR.

At the same time, regularization in learning problems has become widespread, following the early success of LASSO [20, 21]. Recent work has obtained regularization as a consequence of solving a robust learning problem (see, [22] and references therein).

In this paper, we introduce a general MIP formulation for MLR subject to norm-based regularization constraints. We establish that optimal solutions of the MIP converge almost surely (rather than w.h.p.) to the true parameters in the noiseless case as the sample size increases. Subject to cluster separability assumptions, we also establish that MIP solutions can identify the proper cluster for each given sample. For the special case of a single cluster, we show that the MIP solution converges to the true parameter vector in the presence of noise satisfying a martingale difference assumption. For multiple clusters in the presence of noise we provide a counterexample, suggesting that one can not in general recover the true parameters.

Our convergence analysis leverages techniques for proving strong consistency of least-squares estimates for linear regression, but under weaker assumptions. The related literature is substantial. A breakthrough paper [23] established strong consistency of least-squares estimates for stochastic regression models. Asymptotic properties and strong consistency of least-squares parameter estimates have been studied in many areas including system identification and adaptive control [24], econometric theory [25] and time series analysis [26].

The paper is organized as follows. Sec. II, presents the formulation of the problem. The main results are given in Sec. III. Sec. IV presents numerical results. Sec. V concludes the paper.

Notation: All vectors are column vectors and denoted by bold lowercase letters. Bold uppercase letters denote matrices. For economy of space, we write to denote the column vector , where is the dimension of . Prime denotes transpose and the norm, where . Unless otherwise specified, denotes the norm and the counting “norm.” and

denote the minimum and maximum eigenvalue of a (symmetric) matrix. We use

to denote the empty set, for the set , and for the cardinality of the set . We write if there exist positive numbers and such that . We write if there exist positive numbers and such that . We write if both and hold. Finally, and denote expectation and probability, respectively.

Ii Problem Formulation

Consider the MLR model where data are generated by

where are the ground truth coefficient vectors. The problem is to estimate these parameters from data.

Given a training dataset , we formulate the problem as the following MIP:


where is a large constant (big-). Notice that when , the first two constraints imply , whereas implies that no constraint is imposed on since is a large positive constant. At optimality, (1) minimizes a

-norm loss function for the regression problem and assigns each data point to the cluster achieving minimal loss. The last constraint in (

1) imposes a -norm regularization constraint to each coefficient vector and are given constants. For , (1) is a linear MIP, while if either or (or both) are equal to it is a quadratic MIP. Both can be solved by modern MIP solvers.

Without the regularization constraint, another equivalent formulation of (1) is


Let denote an optimal solution to (1); we use a superscript to explicitly denote dependence on the training set. Define and which form the true and the estimated partition of the training set into the clusters, respectively. We can show the following lemma (proof omitted due to space limitations).

Lemma II.1

In order to analyze the strong consistency of parameter estimates obtained by the MIP formulation, we introduce the following assumptions.

  • The clusters are not degenerate and different, i.e., , and , .

  • The noise sequence is a martingale difference sequence, where is a sequence of increasing -fields, and a.s.

  • The noise sequence is a martingale difference sequence, where is a sequence of increasing -fields, and .

  • .

The noise assumptions are mild. For instance, (A2) holds if are random variables with zero mean and variance, including but not limited to Gaussian white noise.

The following lemmas are important in proving the strong consistency of least squares estimates in stochastic linear regression models.

Lemma II.2

([23]) Suppose the noise sequence satisfies Assumption (A2). Let be measurable for every . Define . Assume that a.s., and for , define

Let the maximum eigenvalue of . Then is non-decreasing in and

  • On , a.s.

  • On (, we have that for ,

  • On , the previous result can be strengthened to

    if the assumption (A2) is replaced by (A3).

The following estimates for the weighted sums of martingale difference sequences are based on the Kronecker lemma, the local convergence theorem and the strong law for martingales [27, 28].

Lemma II.3

([23, 24]) Suppose the noise sequence satisfies the assumption (A2). Let be measurable for every and . Then

  • if the assumption (A2) is replaced by (A3).

The estimates given by these results are not as sharp as those given by the law of the iterative logarithm [27] but the assumptions needed here are much more general.

Iii Main Results

Iii-a Noiseless case

Suppose are optimal solutions to (1) for and .

Theorem 1

Suppose Assumptions (A1) and (A4) hold in the MLR model, and


Then we have the strong consistency, i.e.,

where is a permutation of the clusters.

Furthermore, if the optimal solution of (1) is unique, or there are no ties for assigning samples to clusters, i.e., , it follows


such that


From (3), such that and


Since the true are a feasible solution of (1) and there is no noise, the optimal objective value must be , i.e., for some and all . It follows

As a result,



From the definition of the smallest eigenvalue, we have


Accordingly, when it follows from equations (5), (6) and (7) that .

Next, we show the mapping is a bijection by contradiction. Assume there exists for . Then, when is large enough, which contradicts the assumption that the clusters are different. Thus, is a permutation. Finally, we can prove the cluster set equality (4) by contradiction. ∎

Remark 1
  1. Assumption (3) on is not only sufficient but also necessary. For instance, if , and which violates Assumption (3), and model parameters are not identifiable no matter which method is being used.

  2. Thm. 1 holds for either or . From a computational complexity aspect, a linear MIP () can be solved faster than a quadratic MIP ().

  3. Even if we have the strong consistency for the model parameters, i.e., , we may not have when . For instance, if , for some , then the sample may be assigned to any cluster.

Thm. 1 holds for any and . Assumption (3) is equivalent to requiring that every cluster has at least linearly independent measurements. The exact convergence can also be achieved by other methods, e.g., the second-order cone program in [9]. However, to the best of our knowledge, our assumptions are the weakest since we do not need a “sufficient-separation” and a balanced measurement assumption on the data as in [9].

Iii-B Noisy case with a single cluster

Linear regularized regression can be seen as a specific case of MLR with a single cluster. In this section, we assume and consider the presence of noise. We establish strong consistency for the model parameters under general covariate and noise distributions. Consider the regularized regression problem



, corresponding to ridge regression, LASSO and regression with a subset selection constraint, respectively. For

, the problem can be formulated as a MIP problem [15]. Let denote an optimal solution of (8).

Let and .

Theorem 2

Suppose that Assumptions (A1), (A2) and (A4) hold for (8), and let be measurable for every . If

then for all ,


Since must is a feasible solution of (8) (cf. Ass. (A4)), we have

Substituting in the l.h.s., we obtain

and by expanding the l.h.s. we obtain


From the definition of the minimum eigenvalue, we have


Next, we study the r.h.s. of (9) in order to bound the convergence rate of . From Lemma II.3(ii), we have that for any ,

For any ,

where denotes the trace of a matrix. By combining the above two equations, we have for any ,

and thus,

We bound the r.h.s. of (9) as


Combining (9), (10) and (III-B), we obtain

If the Assumption (A2) is replaced by (A3), we have a stronger version of the previous theorem (proof omitted).

Theorem 3

Suppose that Assumptions (A1), (A3) and (A4) hold for (8), and let be measurable for every . If

then we have

As a point of comparison, the classical convergence rate for least square estimates of unregularized linear regression subject to noise is given by:


It can be seen that for well-conditioned data, i.e., , our yield the same convergence rate with the unregularized case. The following Corollary outlines necessary conditions to that effect.

Corollary 1

Suppose are i.i.d. random vectors with being a positive definite matrix. Suppose are i.i.d. random variables, independent of the , with zero mean and variance . Then, we have

Furthermore, if , we have


From the strong law of large numbers,

It follows

Thus, from Thm. 2 we have

Similarly, using Thm. 3 we can prove (13). ∎

We point out that our setting is more general, including ridge regression, LASSO and regression with subset selection. The convergence rate is sharper in terms of the sample size than the rate achieved for LASSO in [21]. In addition, our noise assumptions are more general and weaker than the Gaussian noise used in [21].

Iii-C A counterexample for the noisy case with two clusters

Next, we provide a counterexample indicating that the presence of noise may prevent convergence to the true coefficients when we have more than a single cluster.

Consider the 2-cluster MLR model where the data are generated as follows:

where are i.i.d. random variables with zero mean and positive variance. If , the optimal solution of (1) will be different from the ground truth parameters. In particular, the optimal objective function value of (1), is smaller than the objective function value of the ground truth parameters, and depend on rather than .

Not only (1), but also other methods may fail to recover the ground truth parameters since they are “unidentifiable,” i.e., two sets of parameters can generate the same distribution. This counterexample shows strong consistency may fail to hold under the weakest assumptions used in our earlier analysis.

Iv Numerical results

In this section we provide numerical results on the convergence of parameters estimates in MLR obtained by formulation (1). Computations were performed with GUROBI 8.0 and python 3.6.2 on the Boston University Shared Computing Cluster. The results below coincide with the intuition that if the clusters are well-separated, the convergence of the estimates becomes almost equivalent to having separate linear regression models.

Iv-a MLR under Gaussian noise

Consider a model where the data are generated independently by

The first element of of

consists of random samples from the uniform distribution over

. The second element of is constant.

are drawn from the standard normal distribution. One-half of the data samples are from Cluster 1. Fig. 

1 plots the evolution of as a function of the number of samples used in solving problem (1), where we used , , and did not include a regularization constraint.

Fig. 1: Gaussian noise case.

Iv-B MLR under uniform noise

Consider now a model where the data samples are generated independently by

The first element of consists of random samples from the uniform distribution over , and the second element is the constant . are drawn from the uniform distribution over . One-half of the samples are generated from Cluster 1. Fig. 2 plots as a function of the number of samples used in (1).

Fig. 2: Uniform noise case.

V Conclusions

We established the convergence of parameter estimates for regularized mixed linear regression models with multiple components in the noiseless case. Regularized linear regression can be seen as a specific case of MLR with a single cluster. We establish strong consistency and characterize the convergence rate of parameter estimates for such regression models subject to martingale difference noise under very weak assumptions on the data distribution.

To the best of our knowledge, our paper is the first to study strong consistency of parameter estimates for mixed linear regression models under general noise conditions and general feature conditions rather than convergence with high probability. It can be used directly and extended in many areas, including but not limited to system identification and control, econometric theory and time series analysis.


  • [1] X. Yi, C. Caramanis, and S. Sanghavi, “Alternating minimization for mixed linear regression,” in International Conference on Machine Learning, 2014, pp. 613–621.
  • [2] K. Zhong, P. Jain, and I. S. Dhillon, “Mixed linear regression with multiple components,” in Advances in neural information processing systems, 2016, pp. 2190–2198.
  • [3] A. T. Chaganty and P. Liang, “Spectral experts for estimating mixtures of linear regressions,” in International Conference on Machine Learning, 2013, pp. 1040–1048.
  • [4] Y. W. Park, Y. Jiang, D. Klabjan, and L. Williams, “Algorithms for generalized clusterwise linear regression,” INFORMS Journal on Computing, vol. 29, no. 2, pp. 301–317, 2017.
  • [5] S. Paoletti, A. L. Juloski, G. Ferrari-Trecate, and R. Vidal, “Identification of hybrid systems a tutorial,” European journal of control, vol. 13, no. 2-3, pp. 242–260, 2007.
  • [6] X. Yi and C. Caramanis, “Regularized em algorithms: A unified framework and statistical guarantees,” in Advances in Neural Information Processing Systems, 2015, pp. 1567–1575.
  • [7] S. Balakrishnan, M. J. Wainwright, B. Yu et al., “Statistical guarantees for the em algorithm: From population to sample-based analysis,” The Annals of Statistics, vol. 45, no. 1, pp. 77–120, 2017.
  • [8] X. Yi, C. Caramanis, and S. Sanghavi, “Solving a mixture of many random linear equations by tensor decomposition and alternating minimization,” arXiv preprint arXiv:1608.05749, 2016.
  • [9] P. Hand and B. Joshi, “A convex program for mixed linear regression with a recovery guarantee for well-separated data,” Information and Inference: A Journal of the IMA, vol. 7, no. 3, pp. 563–579, 2018.
  • [10] Y. Li and Y. Liang, “Learning mixtures of linear regressions with nearly optimal complexity,” arXiv preprint arXiv:1802.07895, 2018.
  • [11] Y. Chen, X. Yi, and C. Caramanis, “A convex formulation for mixed regression with two components: Minimax optimal rates,” in Conference on Learning Theory, 2014, pp. 560–604.
  • [12] I. E.-H. Yen, W.-C. Lee, K. Zhong, S.-E. Chang, P. K. Ravikumar, and S.-D. Lin, “Mixlasso: Generalized mixed regression via convex atomic-norm regularization,” in Advances in Neural Information Processing Systems, 2018, pp. 10 891–10 899.
  • [13] D. Yin, R. Pedarsani, Y. Chen, and K. Ramchandran, “Learning mixtures of sparse linear regressions using sparse graph codes,” IEEE Transactions on Information Theory, 2018.
  • [14] D. Bertsimas and A. King, “Or forum—an algorithmic approach to linear regression,” Operations Research, vol. 64, no. 1, pp. 2–16, 2015.
  • [15] D. Bertsimas, A. King, R. Mazumder et al., “Best subset selection via a modern optimization lens,” The annals of statistics, vol. 44, no. 2, pp. 813–852, 2016.
  • [16] T. Xu, T. S. Brisimi, T. Wang, W. Dai, and I. C. Paschalidis, “A joint sparse clustering and classification approach with applications to hospitalization prediction,” in Decision and Control (CDC), 2016 IEEE 55th Conference on.   IEEE, 2016, pp. 4566–4571.
  • [17] T. S. Brisimi, T. Xu, T. Wang, W. Dai, W. G. Adams, and I. C. Paschalidis, “Predicting chronic disease hospitalizations from electronic health records: an interpretable classification approach,” Proceedings of the IEEE, vol. 106, no. 4, pp. 690–707, 2018.
  • [18] T. S. Brisimi, T. Xu, T. Wang, W. Dai, and I. C. Paschalidis, “Predicting diabetes-related hospitalizations based on electronic health records,” Statistical Methods in Medical Research, vol. 0, no. 0, p. 0962280218810911, 0, pMID: 30474497. [Online]. Available:
  • [19] D. Bertsimas and R. Shioda, “Classification and regression via integer optimization,” Operations Research, vol. 55, no. 2, pp. 252–271, 2007.
  • [20] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
  • [21] S. Chatterjee, “Assumptionless consistency of the lasso,” arXiv preprint arXiv:1303.5817, 2013.
  • [22] R. Chen and I. C. Paschalidis, “A robust learning approach for regression models based on distributionally robust optimization,” The Journal of Machine Learning Research, vol. 19, no. 1, pp. 517–564, 2018.
  • [23] T. L. Lai, C. Z. Wei et al., “Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems,” The Annals of Statistics, vol. 10, no. 1, pp. 154–166, 1982.
  • [24] H.-F. Chen and L. Guo, Identification and stochastic adaptive control.   Springer Science & Business Media, 2012.
  • [25] B. Nielsen, “Strong consistency results for least squares estimators in general vector autoregressions with deterministic terms,” Econometric Theory, vol. 21, no. 3, pp. 534–561, 2005.
  • [26] C. Wei et al., “Adaptive prediction by least squares predictors in stochastic regression models with applications to time series,” The Annals of Statistics, vol. 15, no. 4, pp. 1667–1682, 1987.
  • [27] Y. S. Chow and H. Teicher, Probability theory: independence, interchangeability, martingales.   Springer Science & Business Media, 2012.
  • [28] Y. S. Chow, “Local convergence of martingales and the law of large numbers,” The Annals of Mathematical Statistics, vol. 36, no. 2, pp. 552–558, 1965.