# Sample Complexity of Learning Mixtures of Sparse Linear Regressions

In the problem of learning mixtures of linear regressions, the goal is to learn a collection of signal vectors from a sequence of (possibly noisy) linear measurements, where each measurement is evaluated on an unknown signal drawn uniformly from this collection. This setting is quite expressive and has been studied both in terms of practical applications and for the sake of establishing theoretical guarantees. In this paper, we consider the case where the signal vectors are sparse; this generalizes the popular compressed sensing paradigm. We improve upon the state-of-the-art results as follows: In the noisy case, we resolve an open question of Yin et al. (IEEE Transactions on Information Theory, 2019) by showing how to handle collections of more than two vectors and present the first robust reconstruction algorithm, i.e., if the signals are not perfectly sparse, we still learn a good sparse approximation of the signals. In the noiseless case, as well as in the noisy case, we show how to circumvent the need for a restrictive assumption required in the previous work. Our techniques are quite different from those in the previous work: for the noiseless case, we rely on a property of sparse polynomials and for the noisy case, we provide new connections to learning Gaussian mixtures and use ideas from the theory of error-correcting codes.

## Authors

• 52 publications
• 27 publications
• 10 publications
• 10 publications
• ### Superset Technique for Approximate Recovery in One-Bit Compressed Sensing

One-bit compressed sensing (1bCS) is a method of signal acquisition unde...
10/30/2019 ∙ by Larkin Flodin, et al. ∙ 0

• ### Deep generative demixing: Recovering Lipschitz signals from noisy subgaussian mixtures

Generative neural networks (GNNs) have gained renown for efficaciously c...
10/13/2020 ∙ by Aaron Berk, et al. ∙ 0

• ### Corrupted Sensing: Novel Guarantees for Separating Structured Signals

We study the problem of corrupted sensing, a generalization of compresse...
05/11/2013 ∙ by Rina Foygel, et al. ∙ 0

• ### Recovery of sparse linear classifiers from mixture of responses

In the problem of learning a mixture of linear classifiers, the aim is t...
10/22/2020 ∙ by Venkata Gandikota, et al. ∙ 0

• ### A problem dependent analysis of SOCP algorithms in noisy compressed sensing

Under-determined systems of linear equations with sparse solutions have ...
03/29/2013 ∙ by Mihailo Stojnic, et al. ∙ 0

• ### Sequential Information Guided Sensing

We study the value of information in sequential compressed sensing by ch...
09/01/2015 ∙ by Ruiyang Song, et al. ∙ 0

• ### Performance bound of the intensity-based model for noisy phase retrieval

The aim of noisy phase retrieval is to estimate a signal x_0∈C^d from m ...
04/19/2020 ∙ by Meng Huang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Learning mixtures of linear regressions is a natural generalization of the basic linear regression problem. In the basic problem, the goal is to learn the best linear relationship between the scalar responses (i.e., labels) and the explanatory variables (i.e., features). In the generalization, each scalar response is stochastically generated by picking a function uniformly from a set of unknown linear functions, evaluating this function on the explanatory variables and possibly adding noise; the goal is to learn the set of unknown linear functions. The problem was introduced by De Veaux [11] over thirty years ago and has recently attracted growing interest [8, 14, 22, 24, 25, 27]. Recent work focuses on a query-based scenario in which the input to the randomly chosen linear function can be specified by the learner. The sparse setting, in which each linear function depends on only a small number of variables, was recently considered by Yin et al. [27], and can be viewed as a generalization of the well-studied compressed sensing problem [7, 13]. The problem has numerous applications in modelling heterogeneous data arising in medical applications, behavioral health, and music perception [27].

#### Formal Problem Statement.

There are unknown distinct vectors and each is -sparse, i.e., the number of non-zero entries in each is at most where is some known parameter. We define an oracle which, when queried with a vector , returns the noisy output :

 y=⟨x,β⟩+η (1)

where

is a random variable with

that represents the measurement noise and is chosen uniformly555

Many of our results can be generalized to non-uniform distributions but we will assume a uniform distribution throughout for the sake of clarity.

from the set . The goal is to recover all vectors in by making a set of queries to the oracle. We refer to the values returned by the oracle given these queries as samples. Note that the case of corresponds to the problem of compressed sensing. Our primary focus is on the sample complexity of the problem, i.e., minimizing the number of queries that suffices to recover the sparse vectors up to some tolerable error.

#### Related Work.

The most relevant previous work is by Yin et al. [27]. For the noiseless case, i.e., , they show that queries are sufficient to recover all vectors in

with high probability. However, their result requires a restrictive assumption on the set of vectors and do not hold for an arbitrary set of sparse vectors. Specifically, they require that for any

,

 βj≠β′j for each j∈supp(β)∩supp(β′) . (2)

Their approach depends crucially on this assumption and this limits the applicability of their approach. Note that our results will not depend on such an assumption. For the noisy case, the approach taken by Yin et al. only handles the case and they state the case of as an important open problem. Resolving this open problem will be another one of our contributions.

More generally, both compressed sensing [7, 13] and learning mixtures of distributions [10, 23]

are immensely popular topics across statistics, signal processing and machine learning with a large body of prior work. Mixture of linear regressions is a natural synthesis of mixture models and linear regression, a very basic machine learning primitive

[11]. Most of the work on the problem has considered learning generic vectors, i.e., not necessary sparse, and they propose a variety of algorithmic techniques to obtain polynomial sample complexity [8, 14, 19, 24, 26]. To the best of our knowledge, Städler et al. [22] were the first to impose sparsity on the solutions. However, many of the earlier papers on mixtures of linear regression, essentially consider the queries to be fixed, i.e., part of the input, whereas in this paper, and in Yin et al. [27], we are interested in designing queries in such a way to minimize the number of queries.

#### Our Results and Techniques.

We present results for both the noiseless and noisy cases. The latter is significantly more involved and is the main technical contribution of this paper.

Noiseless Case: In the case where there is no noise and the unknown vectors are -sparse, we show that queries suffice and that queries are necessary. The upper bound matches the query complexity of the result by Yin et al. but our result applies for all -sparse vectors rather than just those satisfying the assumption in Eq. 2. The approach we take is as follows: In compressed sensing, exact recovery of -sparse vectors is possible by taking samples with an matrix with any columns linearly independent. Such matrices exists with (such as Vandermonde matrices) and are called MDS matrices. We use rows of such a matrix repeatedly to generate samples. Since there are different vectors in the mixture, with measurements with a row we will be able to see the samples corresponding to each of the vectors with that row. However, even if this is true for measurements with each rows, we will still not be able to align measurements across the rows. For example, even though we will obtain for all and for all that are rows of an MDS matrix, we will be unable to identify the samples corresponding to . To tackle this problem, we propose using a special type of MDS matrix that allows us to align measurements corresponding to the same s. After that, we just use the sparse recovery property of the MDS matrix to individually recover each of the vectors.

Noisy Case: We assume that the noise is a Gaussian random variable with zero mean. Going forward, we write

to denote a Gaussian distribution with mean

and variance

. Furthermore, we will no longer assume vectors in

are necessarily sparse. From the noisy samples, our objective is to recover an estimate

for each such that

 ∥β−^β∥≤c∥β−β∗∥, (3)

where is an absolute constant and is the best -sparse approximation of , i.e., all except the largest (by absolute value) coordinates set to . The norms in the above equation can be arbitrary defining the strength of the guarantee, e.g., when we refer to an guarantee both norms are . Our results should be contrasted with [27], where results not only hold for only and under assumption (2), but the vectors are also strictly -sparse. However, like [27], we assume -precision of the unknown vectors, i.e., the value in each coordinate of each is an integer multiple of .666Note that we do not assume -precision in the noiseless case.

Notice that in this model the noise is additive and not multiplicative. Hence, it is possible to increase the norm of the queries arbitrarily so that the noise becomes inconsequential. However, in a real setting, this cannot be allowed since increasing the strength (norm) of the queries has a cost and it is in our interest to minimize the cost. Suppose the algorithm designs the query vector by first choosing a distribution and subsequently sampling a query vector

. Let us now define the signal to noise ratio as follows:

 SNR=maximinℓExi∼Qi|⟨xi,βℓ⟩|2Eη2 . (4)

Our objective in the noisy setting is to recover the unknown vectors while minimizing the number of queries and the at the same time. In this setting, assuming that all the unknown vectors have unit norm, we show that queries with suffice to reconstruct the vectors in with the approximation guarantees given in Eq. (3) with high probability if the noise is a zero mean gaussian with a variance of . This is equivalent to stating that queries suffice to recover the unknown vectors with high probability.
Note that in the previous work is assumed to be at least constant and, if this is the case, our result is optimal up to polynomial factors since queries are required even if . More generally, the dependence upon in our result improves upon the dependence in the result by Yin et al. Note that we assumed in our result because the dependence of sample complexity on is complicated as it is implicit in the signal-to-noise ratio.

As in noiseless case, our approach is to use a compressed sensing matrix and use its rows multiple time as queries to the oracle. At the first step, we would like to separate out the different s from their samples with the same rows. Unlike the noiseless case, even this turns out to be a difficult task. Under the assumption of Gaussian noise, however, we are able to show that this is equivalent to learning a mixture of Gaussians with different means. In this case, the means of the Gaussians belong to an “-grid", because of the assumption on the precision of s. This is not a standard setting in the literature of learning Gaussian mixtures, e.g., [1, 16, 20]. Note that, this is possible if the vector that we are sampling with has integer entries. As we will see a binary-valued compressed sensing matrix will do the job for us. We will rely on a novel complex-analytic technique to exactly learn the means of a mixture of Gaussians, with means belonging to an -grid. This technique is paralleled by the recent developments in trace reconstructions where similar methods were used for learning a mixture of binomials [18, 21].

Once for each query, the samples are separated, we are still tasked with aligning them so that we know the samples produced by the same across different queries. The method for the noiseless case fails to work here. Instead, we use a new method motivated by error-correcting codes. In particular, we perform several redundant queries, that help us to do this alignment. For example, in addition to the pair of queries we also perform the queries defined by and .

After the alignment, we use the compressed sensing recovery to estimate the unknown vectors. For this, we must start with a matrix that with minimal number of rows, will allow us to recover any vector with a guarantee such as (3). On top of this, we also need the matrix to have integer entries so that we can use our method of learning a mixture of Gaussians with means on an -grid. Fortunately, a random binary matrix satisfies all the requirements [3]. Putting now these three steps of learning mixtures, aligning and compressed sensing, lets us arrive at our results.

While we concentrate on sample complexity in this paper, our algorithm for the noiseless case is computationally efficient, and the only computationally inefficient step in the general noisy case is that of learning Gaussian mixtures. However, in practice one can perform a simple clustering (such as Lloyd’s algorithm) to learn the means of the mixture.

#### Organization and Notation.

In Section 2, we present our results for the noiseless case. In Section 3.1 we consider the case with noise when and then consider noise and general in Section 3.2. Most proofs are deferred to the appendix in the supplementary material. Throughout, we write to denote taking an element from a finite set uniformly at random. For , let .

## 2 Exact sparse vectors and noiseless samples

To begin, we deal with the case of uniform mixture of exact sparse vectors with the oracle returning noiseless answers when queried with a vector. For this case, our scheme is provided in Algorithm 1. The main result for this section is the following.

###### Theorem 1.

For a collection of vectors such that , one can recover all of them exactly with probability at least with a total of oracle queries. See Algorithm 1.

A Vandermonde matrix is a matrix such that the entries in each row of the matrix are in geometric progression i.e., for an dimensional Vandermonde matrix the entry in the th entry is where are distinct values. We will use the following useful property of the Vandermonde matrices; see, e.g., [gantmakher1959theory, Section XIII.8] for the proof.

###### Lemma 1.

The rank of any square submatrix of a Vandermonde matrix is assuming are distinct and positive.

This implies that, with the samples from a Vandermonde matrix, a -sparse vector can be exactly recovered. This is because for any two unknown vectors and , the same set of responses for all the rows of the Vandermonde matrix implies that a square submatrix of the Vandermonde matrix is not full rank which is a contradiction to Lemma 1.

We are now ready to prove Theorem 1.

###### Proof.

For the case of , note that the setting is the same as the well-known compressed sensing problem. Furthermore, suppose a matrix has the property that any submatrix is full rank, then using the rows of this matrix as queries is sufficient to recover any -sparse vector. By Lemma 1, any Vandemonde matrix has the necessary property.

Let be the set of unknown -sparse vectors. Notice that a particular row of the Vandermonde matrix looks like for some value of . Therefore, for some vector and a particular row of the Vandermonde matrix, the inner product of the two can be interpreted as a degree polynomial evaluated at such that the coefficients of the polynomial form the vector . More formally, the inner product can be written as where is the polynomial corresponding to the vector . For any value , we can define an ordering over the polynomials such that iff .

For two distinct indices , we will call the polynomial a difference polynomial. Each difference polynomial has at most non-zero coefficients and therefore has at most positive roots by Descartes’ Rule of Signs [9]. Since there are at most distinct difference polynomials, the total number of distinct values that are roots of at least one difference polynomial is less than . Note that if an interval does not include any of these roots, then the ordering of remains consistent for any point in that interval. In particular, consider the intervals where . At most of these intervals include a root of a difference polynomial and hence if we pick a random interval then with probability at least , the ordering of are consistent throughout the interval. If the interval chosen is then set for .

Now for each value of , define the vector . For each , the vector will be used as query to the oracle repeatedly for times. We will call the set of query responses from the oracle for a fixed query vector a batch. For a fixed batch and ,

 Pr(βjisnotsampledbytheoracleinthebatch)≤(1−1L)LlogLk2≤e−logLk2=1Lk2.

Taking a union bound over all the vectors ( of them) and all the batches ( of them), we get that in every batch every vector for is sampled with probability at least . Now, for each batch, we will retain the unique values (there should be exactly of them with high probability) and sort the values in each batch. Since the ordering of the polynomial remains same, after sorting, all the values in a particular position in each batch correspond to the same vector for some unknown index . We can aggregate the query responses of all the batches in each position and since there are linear measurements corresponding to the same vector, we can recover all the unknown vectors using Lemma 1. The failure probability of this algorithm is at most . ∎

The following theorem establishes that our method is almost optimal in terms of sample complexity.

###### Theorem 2.

At least oracle queries are necessary to recover an arbitrary set of vectors that are -sparse.

## 3 Noisy Samples and Sparse Approximation

We now consider the more general setting where the oracle is noisy and the vectors are not necessarily sparse. We assume is an arbitrary constant, i.e., it does not grow with or and that the unknown vectors have precision, i.e., each entries is an integer multiple of . The noise will be Gaussian with zero mean and variance i.e., . Our main result of this section is the following.

###### Theorem 3.

It is possible to recover approximations with the guarantee in Eq. (3) with probability at least of all the unknown vectors with oracle queries where .

Before we proceed with the ideas of proof, it would be useful to recall the restricted isometry property (RIP) of matrices in the context of recovery guarantees of  (3). A matrix satisfies the -RIP if for any vector with

 (1−δ)∥z∥22≤∥Φz∥22≤(1+δ)∥z∥22. (5)

It is known that if a matrix is -RIP with , then the guarantee of  (3) (in particular, -guarantee and also an -guarantee) is possible [6] with the the basis pursuit

algorithm, an efficient algorithm based on linear programming. It is also known that a random

matrix (with normalized columns) satisfies the property with rows, where is an absolute constant  [3].

There are several key ideas of the proof. Since the case of is simpler to handle, we start with that and then provide the extra steps necessary for the general case subsequently.

### 3.1 Gaussian Noise: Two vectors

Algorithm 2 addresses the setting with only two unknown vectors. We will assume , so that we can subsequently show that the SNR is simply . This assumption is not necessary but we make this for the ease of presentation. The assumption of -precision for was made in Yin et al. [27], and we stick to the same assumption. On the other hand, Yin et al. requires further assumptions that we do not need to make. Furthermore, the result of Yin et al. is restricted to exactly sparse vectors, whereas our result holds for general sparse approximation.

For the two-vector case the result we aim to show is following.

###### Theorem 4.

Algorithm 2 uses queries to recover both the vectors and with an guarantee in Eq. (3) with probability at least .

This result is directly comparable with [27]. On the statistical side, we improve their result in several ways: (1) we improve the dependence on in the sample complexity from to ,777Note that [27] treat as constant in their theorem statement, but the dependence can be extracted from their proof. (2) our result applies for dense vectors, recovering the best -sparse approximations, and (3) we do not need the overlap assumption (eq. (2)) used in their work.

Once we show , Theorem 4 trivially implies Theorem 3 in the case . Indeed, from Algorithm 2, notice that we have used vectors sampled uniformly at random from and use them as query vectors. We must have for . Further, we have used the sum and difference query vectors which have the form and respectively where are sampled uniformly and independently from . Therefore, we must have for , According to our definition of , we have that .

A description of Algorithm 2 that lead to proof of Theorem 4 can be found in Appendix B. We provide a short sketch here and state an important lemma that we will use in the more general case.

The main insight is that for a fixed sensing vector , if we repeatedly query with , we obtain samples from a mixture of Gaussians . If we can exactly recover the means of these Gaussians, we essentially reduce to the noiseless case from the previous section. The first key step upper bounds the sample complexity for exactly learning the parameters of a mixture of Gaussians.

###### Lemma 2 (Learning Gaussian mixtures).

Let be a uniform mixture of univariate Gaussians, with known shared variance and with means . Then, for some constant and some , there exists an algorithm that requires samples from and exactly identifies the parameters with probability at least .

If we sense with then , so appealing to the above lemma, we can proceed assuming we know these two values exactly. Unfortunately, the sensing vectors here are more restricted — we must maintain bounded SNR and our technique of mixture learning requires that the means have finite precision — so we cannot simply appeal to our noiseless results for the alignment step. Instead we design a new alignment strategy, inspired by error correcting codes. Given two query vectors and the exact means , , we must identify which values correspond to and . In addition to sensing with any pair and we sense with , and we use these two additional measurements to identify which recovered means correspond to and which correspond to . Intuitively, we can check if our alignment is correct via these reference measurements.

Therefore, we can obtain aligned, denoised inner products with each of the two parameter vectors. At this point we can apply a standard compressed sensing result as mentioned at the start of this section to obtain the sparse approximations of vectors.

### 3.2 General value of L

In this setting, we will have unknown vectors of unit norm each from which the oracle can sample from with equal probability. We assume that does not grow with or and as before, all the elements in the unknown vectors lie on a -grid. Here, we will build on the ideas for the special case of .

The main result of this section is the following.

###### Theorem 5.

Algorithm 3 uses queries with to recover all the vectors with guarantees in Eq. (3) with probability at least .

Theorem 3 follows as a corollary of this result.

The analysis of Algorithm 3 and the proofs of Theorems  3 and  5 are provided in detail in Appendix D. Below we sketch some of the main points of the proof.

There are two main hurdles in extending the steps explained for . For a query vector , we define the denoised query means to be the set of elements . Recall that a query vector is defined to be good if all the elements in the set of denoised query means are distinct. For , the probability of a query vector being good for is at least but for a value of larger than , it is not possible to obtain such guarantees without further assumptions. For a more concrete example, consider and the unknown vectors to be such that has 1 in the position and zero everywhere else. If is sampled from as before, then can take values only in and therefore it is not possible that all the values are distinct. Secondly, even if we have a good query vector, it is no longer trivial to extend the clustering or alignment step. Hence a number of new ideas are necessary to solve the problem for any general value of .

We need to define a few constants which are used in the algorithm. Let be a constant (we need a that allow -sparse approximation given a -RIP matrix). Let be a large positive constant such that

 δ216−δ348−1c′>0. (A)

Secondly, let be another positive constant that satisfies the following for a given value of ,

 α⋆=max{α:αα(α−1)α−1

Finally, for a given value of and , let be the smallest integer that satisfies the following:

 z⋆=min{z∈Z:1−L3(34z+1−14z2+1)≥1√α⋆}. (C)

#### The Denoising Step.

In each step of the algorithm, we sample a vector uniformly at random from , another vector uniformly at random from and a number uniformly at random from . Now, we will use a batch of queries corresponding to the vectors and . We define a triplet of query vectors to be good if for all triplets of indices such that are not identical,

 ⟨v1,βi⟩+⟨v2,βj⟩≠⟨v3,βk⟩.

We show that the query vector triplet is good with at least some probability. This implies if we choose triplets of such query vectors, then at least one of the triplets are good with probability It turns out that, for a good triplet of vectors , we can obtain for all .

Furthermore, it follows from Lemma 2 that for a query vector with integral entries, a batch size of , for some constant , is sufficient to recover the denoised query responses for all the queries with probability at least .

#### The Alignment Step.

Let a particular good query vector triplet be . From now, we will consider the elements to be labels and for a vector , we will associate a label with every element in . The labelling is correct if, for all , the element labelled as also corresponds to the same unknown vector . Notice that we can label the elements correctly because the triplet is good. Consider another good query vector triplet . This matches with the earlier query triplet if additionally, the vector triplet is also good.

Such matching pair of good triplets exists, and can be found by random choice with some probability. We show that, the matching good triplets allow us to do the alignment in the case of general

At this point we would again like to appeal to the standard compressed sensing results. However we need to show that the matching good vectors themselves form a matrix that has the required RIP property. As our final step, we establish this fact.

###### Remark 3 (Refinement and adaptive queries).

It is possible to have a sample complexity of in Theorem 3, but with a probability of Also it is possible to shave-off another factor from sample complexity if we can make the queries adaptive.

Acknowledgements: This research is supported in part by NSF Grants CCF 1642658, 1618512, 1909046, 1908849 and 1934846.

## References

• [1] S. Arora and R. Kannan. Learning mixtures of arbitrary gaussians. In

Symposium on Theory of Computing

, 2001.
• [2] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. The johnson-lindenstrauss lemma meets compressed sensing. preprint, 100(1):0, 2006.
• [3] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 28(3):253–263, 2008.
• [4] P. Borwein and T. Erdélyi. Littlewood-type problems on subarcs of the unit circle. Indiana University Mathematics Journal, 1997.
• [5] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
• [6] E. J. Candes. The restricted isometry property and its implications for compressed sensing. Comptes rendus mathematique, 346(9-10):589–592, 2008.
• [7] E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489–509, 2006.
• [8] A. T. Chaganty and P. Liang. Spectral experts for estimating mixtures of linear regressions. In International Conference on Machine Learning, pages 1040–1048, 2013.
• [9] D. Curtiss. Recent extentions of descartes’ rule of signs. Annals of Mathematics, pages 251–278, 1918.
• [10] S. Dasgupta. Learning mixtures of gaussians. In Foundations of Computer Science, pages 634–644, 1999.
• [11] R. D. De Veaux. Mixtures of linear regressions. Computational Statistics & Data Analysis, 8(3):227–245, 1989.
• [12] L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer Science & Business Media, 2012.
• [13] D. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
• [14] S. Faria and G. Soromenho. Fitting mixtures of linear regressions. Journal of Statistical Computation and Simulation, 80(2):201–225, 2010.
• [15] F. R. Gantmakher. The Theory of Matrices, volume 2.
• [16] M. Hardt and E. Price. Tight bounds for learning a mixture of two gaussians. In Symposium on Theory of Computing, 2015.
• [17] A. Kalai, A. Moitra, and G. Valiant. Disentangling Gaussians. Communications of the ACM, 55(2):113–120, 2012.
• [18] A. Krishnamurthy, A. Mazumdar, A. McGregor, and S. Pal. Trace reconstruction: Generalized and parameterized. In 27th Annual European Symposium on Algorithms, ESA 2019, September 9-11, 2019, Munich/Garching, Germany., pages 68:1–68:25, 2019.
• [19] J. Kwon and C. Caramanis. Global convergence of em algorithm for mixtures of two component linear regression. arXiv preprint arXiv:1810.05752, 2018.
• [20] A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of gaussians. In Foundations of Computer Science, 2010.
• [21] F. Nazarov and Y. Peres. Trace reconstruction with samples. In Symposium on Theory of Computing, 2017.
• [22] N. Städler, P. Bühlmann, and S. Van De Geer. l1-penalization for mixture regression models. Test, 19(2):209–256, 2010.
• [23] D. M. Titterington, A. F. Smith, and U. E. Makov. Statistical analysis of finite mixture distributions. Wiley, 1985.
• [24] K. Viele and B. Tong. Modeling with mixtures of linear regressions. Statistics and Computing, 12(4):315–330, 2002.
• [25] X. Yi, C. Caramanis, and S. Sanghavi. Alternating minimization for mixed linear regression. In International Conference on Machine Learning, pages 613–621, 2014.
• [26] X. Yi, C. Caramanis, and S. Sanghavi. Solving a mixture of many random linear equations by tensor decomposition and alternating minimization. arXiv preprint arXiv:1608.05749, 2016.
• [27] D. Yin, R. Pedarsani, Y. Chen, and K. Ramchandran. Learning mixtures of sparse linear regressions using sparse graph codes. IEEE Transactions on Information Theory, 65(3):1430–1451, 2019.

## Appendix A Proof of Theorem 2

It is known that for any particular vector , at least queries to the oracle are necessary in order to recover the vector exactly. Suppose the random variable denotes the number of queries until the oracle has sampled the vector at least times. Notice that can be written as a sum of independent and identical random variables

distributed according to the geometric distribution with parameter

where denotes the number of attempts required to obtain the sample after the sample has been made by the oracle. Since is a sum of independent random variables, we must have

 EX=2LkandVar(X)=2k(L2−L)

Therefore by using Chebychev’s inequality [5], we must have

 Pr(X≤2Lk−k14√2k(L2−L))≤1√k

and therefore with high probability which proves the statement of the theorem.

## Appendix B Description of Algorithm 2 and Proof of Theorem 4

Algorithm 2 (Design of queries and denoising): Let be the total number of queries that we will make. In the first step of the algorithm, for a particular query vector , our objective is to recover and which we will denote as the denoised query responses corresponding to the vector . It is intuitive, that in order to do this, we need to use the same query vector repeatedly a number of times and aggregate the noisy query responses to recover the denoised counterparts.

Therefore, at every iteration in Step 1 of Algorithm 2, we sample a vector uniformly at random from . Once the vector is sampled, we use as query vector repeatedly for times. We will say that the query responses to the same vector as query to be a batch of size . It can be seen that since is fixed, the query responses in a batch is sampled from a Gaussian mixture distribution with means and and variance , in short,

 M=12N(⟨v,β1⟩,σ2)+12N(⟨v,β2⟩,σ2).

Therefore the problem reduces to recovering the mean parameters from a mixture of Gaussian distribution with at most two mixture constituents (since the means can be same) and having the same variance. We will use the following important lemma for this problem.

###### Lemma (Lemma 2: Learning Gaussian mixtures).

Let be a uniform mixture of univariate Gaussians, with known shared variance and with means . Then, for some constant and some , there exists an algorithm that requires samples from and exactly identifies the parameters with probability at least .

The proof of this lemma can be found in Appendix C. We now have the following lemma to characterize the size of each batch .

###### Lemma 4.

For any query vector , a batchsize of , for a constant , is sufficient to recover the denoised query responses and with probability at least .

###### Proof.

Since , Using Lemma 2, the claim follows. ∎

###### Corollary 5.

For any query vectors sampled uniformly at random from , a batch size of , for some constant , is sufficient to recover the denoised query responses corresponding to every query vector with probability at least .

###### Proof.

This statement is proved by taking a union bound over batches corresponding to that many query vectors. ∎

Algorithm 2 (Alignment step): Notice from the previous discussion, for each batch corresponding to a query vector , we obtain the pair of values . However, we still need to cluster these values (by taking one value from each pair and assigning it to one of the clusters) into two clusters corresponding to and . We will first explain the clustering process for two particular query vectors and for which we have already obtained the pairs and . The objective is to cluster the four samples into two groups of two samples each so that the samples in each cluster correspond to the same unknown sensed vector. Now, we have two cases to consider:
Case 1: In this scenario, the values in at least one of the pairs are same and any grouping works.

Case 2: . We use two more batches corresponding to the vectors and which belong to . We will call the vector the sum query and the vector the difference query corresponding to respectively. Hence using Lemma 4 again, we will be able to obtain the pairs and . Now, we will choose two elements from the pairs and (one element from each pair) such that their sum belongs to the pair and their difference belongs to the pair . In our algorithm, we will put these two elements into one cluster and the other two elements into the other cluster. From construction, we must put in one cluster and in other.

Putting it all together, in Algorithm 2, we uniformly and randomly choose query vectors from and for each of them, we use it repeatedly for