Learning mixtures of linear regressions is a natural generalization of the basic linear regression problem. In the basic problem, the goal is to learn the best linear relationship between the scalar responses (i.e., labels) and the explanatory variables (i.e., features). In the generalization, each scalar response is stochastically generated by picking a function uniformly from a set of unknown linear functions, evaluating this function on the explanatory variables and possibly adding noise; the goal is to learn the set of unknown linear functions. The problem was introduced by De Veaux  over thirty years ago and has recently attracted growing interest [8, 14, 22, 24, 25, 27]. Recent work focuses on a query-based scenario in which the input to the randomly chosen linear function can be specified by the learner. The sparse setting, in which each linear function depends on only a small number of variables, was recently considered by Yin et al. , and can be viewed as a generalization of the well-studied compressed sensing problem [7, 13]. The problem has numerous applications in modelling heterogeneous data arising in medical applications, behavioral health, and music perception .
Formal Problem Statement.
There are unknown distinct vectors and each is -sparse, i.e., the number of non-zero entries in each is at most where is some known parameter. We define an oracle which, when queried with a vector , returns the noisy output :
is a random variable withthat represents the measurement noise and is chosen uniformly555
Many of our results can be generalized to non-uniform distributions but we will assume a uniform distribution throughout for the sake of clarity.from the set . The goal is to recover all vectors in by making a set of queries to the oracle. We refer to the values returned by the oracle given these queries as samples. Note that the case of corresponds to the problem of compressed sensing. Our primary focus is on the sample complexity of the problem, i.e., minimizing the number of queries that suffices to recover the sparse vectors up to some tolerable error.
The most relevant previous work is by Yin et al. . For the noiseless case, i.e., , they show that queries are sufficient to recover all vectors in
with high probability. However, their result requires a restrictive assumption on the set of vectors and do not hold for an arbitrary set of sparse vectors. Specifically, they require that for any,
Their approach depends crucially on this assumption and this limits the applicability of their approach. Note that our results will not depend on such an assumption. For the noisy case, the approach taken by Yin et al. only handles the case and they state the case of as an important open problem. Resolving this open problem will be another one of our contributions.
are immensely popular topics across statistics, signal processing and machine learning with a large body of prior work. Mixture of linear regressions is a natural synthesis of mixture models and linear regression, a very basic machine learning primitive. Most of the work on the problem has considered learning generic vectors, i.e., not necessary sparse, and they propose a variety of algorithmic techniques to obtain polynomial sample complexity [8, 14, 19, 24, 26]. To the best of our knowledge, Städler et al.  were the first to impose sparsity on the solutions. However, many of the earlier papers on mixtures of linear regression, essentially consider the queries to be fixed, i.e., part of the input, whereas in this paper, and in Yin et al. , we are interested in designing queries in such a way to minimize the number of queries.
Our Results and Techniques.
We present results for both the noiseless and noisy cases. The latter is significantly more involved and is the main technical contribution of this paper.
Noiseless Case: In the case where there is no noise and the unknown vectors are -sparse, we show that queries suffice and that queries are necessary. The upper bound matches the query complexity of the result by Yin et al. but our result applies for all -sparse vectors rather than just those satisfying the assumption in Eq. 2. The approach we take is as follows: In compressed sensing, exact recovery of -sparse vectors is possible by taking samples with an matrix with any columns linearly independent. Such matrices exists with (such as Vandermonde matrices) and are called MDS matrices. We use rows of such a matrix repeatedly to generate samples. Since there are different vectors in the mixture, with measurements with a row we will be able to see the samples corresponding to each of the vectors with that row. However, even if this is true for measurements with each rows, we will still not be able to align measurements across the rows. For example, even though we will obtain for all and for all that are rows of an MDS matrix, we will be unable to identify the samples corresponding to . To tackle this problem, we propose using a special type of MDS matrix that allows us to align measurements corresponding to the same s. After that, we just use the sparse recovery property of the MDS matrix to individually recover each of the vectors.
Noisy Case: We assume that the noise is a Gaussian random variable with zero mean. Going forward, we write
to denote a Gaussian distribution with mean
and variance. Furthermore, we will no longer assume vectors in
are necessarily sparse. From the noisy samples, our objective is to recover an estimatefor each such that
where is an absolute constant and is the best -sparse approximation of , i.e., all except the largest (by absolute value) coordinates set to . The norms in the above equation can be arbitrary defining the strength of the guarantee, e.g., when we refer to an guarantee both norms are . Our results should be contrasted with , where results not only hold for only and under assumption (2), but the vectors are also strictly -sparse. However, like , we assume -precision of the unknown vectors, i.e., the value in each coordinate of each is an integer multiple of .666Note that we do not assume -precision in the noiseless case.
Notice that in this model the noise is additive and not multiplicative. Hence, it is possible to increase the norm of the queries arbitrarily so that the noise becomes inconsequential. However, in a real setting, this cannot be allowed since increasing the strength (norm) of the queries has a cost and it is in our interest to minimize the cost. Suppose the algorithm designs the query vector by first choosing a distribution and subsequently sampling a query vector . Let us now define the signal to noise ratio as follows:
. Let us now define the signal to noise ratio as follows:
Our objective in the noisy setting is to recover the unknown vectors while minimizing the number of queries and the at the same time. In this setting, assuming that all the unknown vectors have unit norm, we show that queries with suffice to reconstruct the vectors in with the approximation guarantees given in Eq. (3) with high probability if the noise is a zero mean gaussian with a variance of . This is equivalent to stating that queries suffice to recover the unknown vectors with high probability.
Note that in the previous work is assumed to be at least constant and, if this is the case, our result is optimal up to polynomial factors since queries are required even if . More generally, the dependence upon in our result improves upon the dependence in the result by Yin et al. Note that we assumed in our result because the dependence of sample complexity on is complicated as it is implicit in the signal-to-noise ratio.
As in noiseless case, our approach is to use a compressed sensing matrix and use its rows multiple time as queries to the oracle. At the first step, we would like to separate out the different s from their samples with the same rows. Unlike the noiseless case, even this turns out to be a difficult task. Under the assumption of Gaussian noise, however, we are able to show that this is equivalent to learning a mixture of Gaussians with different means. In this case, the means of the Gaussians belong to an “-grid", because of the assumption on the precision of s. This is not a standard setting in the literature of learning Gaussian mixtures, e.g., [1, 16, 20]. Note that, this is possible if the vector that we are sampling with has integer entries. As we will see a binary-valued compressed sensing matrix will do the job for us. We will rely on a novel complex-analytic technique to exactly learn the means of a mixture of Gaussians, with means belonging to an -grid. This technique is paralleled by the recent developments in trace reconstructions where similar methods were used for learning a mixture of binomials [18, 21].
Once for each query, the samples are separated, we are still tasked with aligning them so that we know the samples produced by the same across different queries. The method for the noiseless case fails to work here. Instead, we use a new method motivated by error-correcting codes. In particular, we perform several redundant queries, that help us to do this alignment. For example, in addition to the pair of queries we also perform the queries defined by and .
After the alignment, we use the compressed sensing recovery to estimate the unknown vectors. For this, we must start with a matrix that with minimal number of rows, will allow us to recover any vector with a guarantee such as (3). On top of this, we also need the matrix to have integer entries so that we can use our method of learning a mixture of Gaussians with means on an -grid. Fortunately, a random binary matrix satisfies all the requirements . Putting now these three steps of learning mixtures, aligning and compressed sensing, lets us arrive at our results.
While we concentrate on sample complexity in this paper, our algorithm for the noiseless case is computationally efficient, and the only computationally inefficient step in the general noisy case is that of learning Gaussian mixtures. However, in practice one can perform a simple clustering (such as Lloyd’s algorithm) to learn the means of the mixture.
Organization and Notation.
In Section 2, we present our results for the noiseless case. In Section 3.1 we consider the case with noise when and then consider noise and general in Section 3.2. Most proofs are deferred to the appendix in the supplementary material. Throughout, we write to denote taking an element from a finite set uniformly at random. For , let .
2 Exact sparse vectors and noiseless samples
To begin, we deal with the case of uniform mixture of exact sparse vectors with the oracle returning noiseless answers when queried with a vector. For this case, our scheme is provided in Algorithm 1. The main result for this section is the following.
For a collection of vectors such that , one can recover all of them exactly with probability at least with a total of oracle queries. See Algorithm 1.
A Vandermonde matrix is a matrix such that the entries in each row of the matrix are in geometric progression i.e., for an dimensional Vandermonde matrix the entry in the th entry is where are distinct values. We will use the following useful property of the Vandermonde matrices; see, e.g., [gantmakher1959theory, Section XIII.8] for the proof.
The rank of any square submatrix of a Vandermonde matrix is assuming are distinct and positive.
This implies that, with the samples from a Vandermonde matrix, a -sparse vector can be exactly recovered. This is because for any two unknown vectors and , the same set of responses for all the rows of the Vandermonde matrix implies that a square submatrix of the Vandermonde matrix is not full rank which is a contradiction to Lemma 1.
We are now ready to prove Theorem 1.
For the case of , note that the setting is the same as the well-known compressed sensing problem. Furthermore, suppose a matrix has the property that any submatrix is full rank, then using the rows of this matrix as queries is sufficient to recover any -sparse vector. By Lemma 1, any Vandemonde matrix has the necessary property.
Let be the set of unknown -sparse vectors. Notice that a particular row of the Vandermonde matrix looks like for some value of . Therefore, for some vector and a particular row of the Vandermonde matrix, the inner product of the two can be interpreted as a degree polynomial evaluated at such that the coefficients of the polynomial form the vector . More formally, the inner product can be written as where is the polynomial corresponding to the vector . For any value , we can define an ordering over the polynomials such that iff .
For two distinct indices , we will call the polynomial a difference polynomial. Each difference polynomial has at most non-zero coefficients and therefore has at most positive roots by Descartes’ Rule of Signs . Since there are at most distinct difference polynomials, the total number of distinct values that are roots of at least one difference polynomial is less than . Note that if an interval does not include any of these roots, then the ordering of remains consistent for any point in that interval. In particular, consider the intervals where . At most of these intervals include a root of a difference polynomial and hence if we pick a random interval then with probability at least , the ordering of are consistent throughout the interval. If the interval chosen is then set for .
Now for each value of , define the vector . For each , the vector will be used as query to the oracle repeatedly for times. We will call the set of query responses from the oracle for a fixed query vector a batch. For a fixed batch and ,
Taking a union bound over all the vectors ( of them) and all the batches ( of them), we get that in every batch every vector for is sampled with probability at least . Now, for each batch, we will retain the unique values (there should be exactly of them with high probability) and sort the values in each batch. Since the ordering of the polynomial remains same, after sorting, all the values in a particular position in each batch correspond to the same vector for some unknown index . We can aggregate the query responses of all the batches in each position and since there are linear measurements corresponding to the same vector, we can recover all the unknown vectors using Lemma 1. The failure probability of this algorithm is at most . ∎
The following theorem establishes that our method is almost optimal in terms of sample complexity.
At least oracle queries are necessary to recover an arbitrary set of vectors that are -sparse.
3 Noisy Samples and Sparse Approximation
We now consider the more general setting where the oracle is noisy and the vectors are not necessarily sparse. We assume is an arbitrary constant, i.e., it does not grow with or and that the unknown vectors have precision, i.e., each entries is an integer multiple of . The noise will be Gaussian with zero mean and variance i.e., . Our main result of this section is the following.
It is possible to recover approximations with the guarantee in Eq. (3) with probability at least of all the unknown vectors with oracle queries where .
Before we proceed with the ideas of proof, it would be useful to recall the restricted isometry property (RIP) of matrices in the context of recovery guarantees of (3). A matrix satisfies the -RIP if for any vector with
algorithm, an efficient algorithm based on linear programming. It is also known that a randommatrix (with normalized columns) satisfies the property with rows, where is an absolute constant .
There are several key ideas of the proof. Since the case of is simpler to handle, we start with that and then provide the extra steps necessary for the general case subsequently.
3.1 Gaussian Noise: Two vectors
Algorithm 2 addresses the setting with only two unknown vectors. We will assume , so that we can subsequently show that the SNR is simply . This assumption is not necessary but we make this for the ease of presentation. The assumption of -precision for was made in Yin et al. , and we stick to the same assumption. On the other hand, Yin et al. requires further assumptions that we do not need to make. Furthermore, the result of Yin et al. is restricted to exactly sparse vectors, whereas our result holds for general sparse approximation.
For the two-vector case the result we aim to show is following.
This result is directly comparable with . On the statistical side, we improve their result in several ways: (1) we improve the dependence on in the sample complexity from to ,777Note that  treat as constant in their theorem statement, but the dependence can be extracted from their proof. (2) our result applies for dense vectors, recovering the best -sparse approximations, and (3) we do not need the overlap assumption (eq. (2)) used in their work.
Once we show , Theorem 4 trivially implies Theorem 3 in the case . Indeed, from Algorithm 2, notice that we have used vectors sampled uniformly at random from and use them as query vectors. We must have for . Further, we have used the sum and difference query vectors which have the form and respectively where are sampled uniformly and independently from . Therefore, we must have for , According to our definition of , we have that .
The main insight is that for a fixed sensing vector , if we repeatedly query with , we obtain samples from a mixture of Gaussians . If we can exactly recover the means of these Gaussians, we essentially reduce to the noiseless case from the previous section. The first key step upper bounds the sample complexity for exactly learning the parameters of a mixture of Gaussians.
Lemma 2 (Learning Gaussian mixtures).
Let be a uniform mixture of univariate Gaussians, with known shared variance and with means . Then, for some constant and some , there exists an algorithm that requires samples from and exactly identifies the parameters with probability at least .
If we sense with then , so appealing to the above lemma, we can proceed assuming we know these two values exactly. Unfortunately, the sensing vectors here are more restricted — we must maintain bounded SNR and our technique of mixture learning requires that the means have finite precision — so we cannot simply appeal to our noiseless results for the alignment step. Instead we design a new alignment strategy, inspired by error correcting codes. Given two query vectors and the exact means , , we must identify which values correspond to and . In addition to sensing with any pair and we sense with , and we use these two additional measurements to identify which recovered means correspond to and which correspond to . Intuitively, we can check if our alignment is correct via these reference measurements.
Therefore, we can obtain aligned, denoised inner products with each of the two parameter vectors. At this point we can apply a standard compressed sensing result as mentioned at the start of this section to obtain the sparse approximations of vectors.
3.2 General value of
In this setting, we will have unknown vectors of unit norm each from which the oracle can sample from with equal probability. We assume that does not grow with or and as before, all the elements in the unknown vectors lie on a -grid. Here, we will build on the ideas for the special case of .
The main result of this section is the following.
Theorem 3 follows as a corollary of this result.
There are two main hurdles in extending the steps explained for . For a query vector , we define the denoised query means to be the set of elements . Recall that a query vector is defined to be good if all the elements in the set of denoised query means are distinct. For , the probability of a query vector being good for is at least but for a value of larger than , it is not possible to obtain such guarantees without further assumptions. For a more concrete example, consider and the unknown vectors to be such that has 1 in the position and zero everywhere else. If is sampled from as before, then can take values only in and therefore it is not possible that all the values are distinct. Secondly, even if we have a good query vector, it is no longer trivial to extend the clustering or alignment step. Hence a number of new ideas are necessary to solve the problem for any general value of .
We need to define a few constants which are used in the algorithm. Let be a constant (we need a that allow -sparse approximation given a -RIP matrix). Let be a large positive constant such that
Secondly, let be another positive constant that satisfies the following for a given value of ,
Finally, for a given value of and , let be the smallest integer that satisfies the following:
The Denoising Step.
In each step of the algorithm, we sample a vector uniformly at random from , another vector uniformly at random from and a number uniformly at random from . Now, we will use a batch of queries corresponding to the vectors and . We define a triplet of query vectors to be good if for all triplets of indices such that are not identical,
We show that the query vector triplet is good with at least some probability. This implies if we choose triplets of such query vectors, then at least one of the triplets are good with probability It turns out that, for a good triplet of vectors , we can obtain for all .
Furthermore, it follows from Lemma 2 that for a query vector with integral entries, a batch size of , for some constant , is sufficient to recover the denoised query responses for all the queries with probability at least .
The Alignment Step.
Let a particular good query vector triplet be . From now, we will consider the elements to be labels and for a vector , we will associate a label with every element in . The labelling is correct if, for all , the element labelled as also corresponds to the same unknown vector . Notice that we can label the elements correctly because the triplet is good. Consider another good query vector triplet . This matches with the earlier query triplet if additionally, the vector triplet is also good.
Such matching pair of good triplets exists, and can be found by random choice with some probability. We show that, the matching good triplets allow us to do the alignment in the case of general
At this point we would again like to appeal to the standard compressed sensing results. However we need to show that the matching good vectors themselves form a matrix that has the required RIP property. As our final step, we establish this fact.
Remark 3 (Refinement and adaptive queries).
It is possible to have a sample complexity of in Theorem 3, but with a probability of Also it is possible to shave-off another factor from sample complexity if we can make the queries adaptive.
Acknowledgements: This research is supported in part by NSF Grants CCF 1642658, 1618512, 1909046, 1908849 and 1934846.
S. Arora and R. Kannan.
Learning mixtures of arbitrary gaussians.
Symposium on Theory of Computing, 2001.
-  R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. The johnson-lindenstrauss lemma meets compressed sensing. preprint, 100(1):0, 2006.
-  R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 28(3):253–263, 2008.
-  P. Borwein and T. Erdélyi. Littlewood-type problems on subarcs of the unit circle. Indiana University Mathematics Journal, 1997.
-  S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
-  E. J. Candes. The restricted isometry property and its implications for compressed sensing. Comptes rendus mathematique, 346(9-10):589–592, 2008.
-  E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489–509, 2006.
-  A. T. Chaganty and P. Liang. Spectral experts for estimating mixtures of linear regressions. In International Conference on Machine Learning, pages 1040–1048, 2013.
-  D. Curtiss. Recent extentions of descartes’ rule of signs. Annals of Mathematics, pages 251–278, 1918.
-  S. Dasgupta. Learning mixtures of gaussians. In Foundations of Computer Science, pages 634–644, 1999.
-  R. D. De Veaux. Mixtures of linear regressions. Computational Statistics & Data Analysis, 8(3):227–245, 1989.
-  L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer Science & Business Media, 2012.
-  D. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
-  S. Faria and G. Soromenho. Fitting mixtures of linear regressions. Journal of Statistical Computation and Simulation, 80(2):201–225, 2010.
-  F. R. Gantmakher. The Theory of Matrices, volume 2.
-  M. Hardt and E. Price. Tight bounds for learning a mixture of two gaussians. In Symposium on Theory of Computing, 2015.
-  A. Kalai, A. Moitra, and G. Valiant. Disentangling Gaussians. Communications of the ACM, 55(2):113–120, 2012.
-  A. Krishnamurthy, A. Mazumdar, A. McGregor, and S. Pal. Trace reconstruction: Generalized and parameterized. In 27th Annual European Symposium on Algorithms, ESA 2019, September 9-11, 2019, Munich/Garching, Germany., pages 68:1–68:25, 2019.
-  J. Kwon and C. Caramanis. Global convergence of em algorithm for mixtures of two component linear regression. arXiv preprint arXiv:1810.05752, 2018.
-  A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of gaussians. In Foundations of Computer Science, 2010.
-  F. Nazarov and Y. Peres. Trace reconstruction with samples. In Symposium on Theory of Computing, 2017.
-  N. Städler, P. Bühlmann, and S. Van De Geer. l1-penalization for mixture regression models. Test, 19(2):209–256, 2010.
-  D. M. Titterington, A. F. Smith, and U. E. Makov. Statistical analysis of finite mixture distributions. Wiley, 1985.
-  K. Viele and B. Tong. Modeling with mixtures of linear regressions. Statistics and Computing, 12(4):315–330, 2002.
-  X. Yi, C. Caramanis, and S. Sanghavi. Alternating minimization for mixed linear regression. In International Conference on Machine Learning, pages 613–621, 2014.
-  X. Yi, C. Caramanis, and S. Sanghavi. Solving a mixture of many random linear equations by tensor decomposition and alternating minimization. arXiv preprint arXiv:1608.05749, 2016.
-  D. Yin, R. Pedarsani, Y. Chen, and K. Ramchandran. Learning mixtures of sparse linear regressions using sparse graph codes. IEEE Transactions on Information Theory, 65(3):1430–1451, 2019.
Appendix A Proof of Theorem 2
It is known that for any particular vector , at least queries to the oracle are necessary in order to recover the vector exactly. Suppose the random variable denotes the number of queries until the oracle has sampled the vector at least times. Notice that can be written as a sum of independent and identical random variables
distributed according to the geometric distribution with parameterwhere denotes the number of attempts required to obtain the sample after the sample has been made by the oracle. Since is a sum of independent random variables, we must have
Therefore by using Chebychev’s inequality , we must have
and therefore with high probability which proves the statement of the theorem.
Algorithm 2 (Design of queries and denoising): Let be the total number of queries that we will make. In the first step of the algorithm, for a particular query vector , our objective is to recover and which we will denote as the denoised query responses corresponding to the vector . It is intuitive, that in order to do this, we need to use the same query vector repeatedly a number of times and aggregate the noisy query responses to recover the denoised counterparts.
Therefore, at every iteration in Step 1 of Algorithm 2, we sample a vector uniformly at random from . Once the vector is sampled, we use as query vector repeatedly for times. We will say that the query responses to the same vector as query to be a batch of size . It can be seen that since is fixed, the query responses in a batch is sampled from a Gaussian mixture distribution with means and and variance , in short,
Therefore the problem reduces to recovering the mean parameters from a mixture of Gaussian distribution with at most two mixture constituents (since the means can be same) and having the same variance. We will use the following important lemma for this problem.
Lemma (Lemma 2: Learning Gaussian mixtures).
Let be a uniform mixture of univariate Gaussians, with known shared variance and with means . Then, for some constant and some , there exists an algorithm that requires samples from and exactly identifies the parameters with probability at least .
The proof of this lemma can be found in Appendix C. We now have the following lemma to characterize the size of each batch .
For any query vector , a batchsize of , for a constant , is sufficient to recover the denoised query responses and with probability at least .
Since , Using Lemma 2, the claim follows. ∎
For any query vectors sampled uniformly at random from , a batch size of , for some constant , is sufficient to recover the denoised query responses corresponding to every query vector with probability at least .
This statement is proved by taking a union bound over batches corresponding to that many query vectors. ∎
Algorithm 2 (Alignment step):
Notice from the previous discussion, for each batch corresponding to a query vector , we obtain the pair of values . However, we still need to cluster these values (by taking one value from each pair and assigning it to one of the clusters) into two clusters corresponding to and . We will first explain the clustering process for two particular query vectors and for which we have already obtained the pairs and .
The objective is to cluster the four samples into two groups of two samples each so that the samples in each cluster correspond to the same unknown sensed vector.
Now, we have two cases to consider:
Case 1: In this scenario, the values in at least one of the pairs are same and any grouping works.
Case 2: . We use two more batches corresponding to the vectors and which belong to . We will call the vector the sum query and the vector the difference query corresponding to respectively. Hence using Lemma 4 again, we will be able to obtain the pairs and . Now, we will choose two elements from the pairs and (one element from each pair) such that their sum belongs to the pair and their difference belongs to the pair . In our algorithm, we will put these two elements into one cluster and the other two elements into the other cluster. From construction, we must put in one cluster and in other.
Putting it all together, in Algorithm 2, we uniformly and randomly choose query vectors from and for each of them, we use it repeatedly for