 # Recovery of Sparse Signals from a Mixture of Linear Samples

Mixture of linear regressions is a popular learning theoretic model that is used widely to represent heterogeneous data. In the simplest form, this model assumes that the labels are generated from either of two different linear models and mixed together. Recent works of Yin et al. and Krishnamurthy et al., 2019, focus on an experimental design setting of model recovery for this problem. It is assumed that the features can be designed and queried with to obtain their label. When queried, an oracle randomly selects one of the two different sparse linear models and generates a label accordingly. How many such oracle queries are needed to recover both of the models simultaneously? This question can also be thought of as a generalization of the well-known compressed sensing problem (Candès and Tao, 2005, Donoho, 2006). In this work, we address this query complexity problem and provide efficient algorithms that improves on the previously best known results.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Suppose, there are two unknown distinct vectors

, that we want to recover. We can measure these vectors by taking linear samples, however the linear samples come without the identifier of the vectors. To make this statement rigorous, assume the presence of an oracle which, when queried with a vector , returns the noisy output :

 y=⟨x,β⟩+ζ (1)

where is chosen uniformly from and

is additive Gaussian noise with zero mean and known variance

. We will refer to the values returned by the oracle given these queries as samples.

For a , the best -sparse approximation is defined to be the vector obtained from where all except the -largest (by absolute value) coordinates are set to . For each , our objective in this setting is to return a sparse approximation of using minimum number of queries such that

 ||^β−β||≤c||β−β(k)||+γ

where is an absolute constant, is a user defined nonnegative parameter representing the precision up to which we want to recover the unknown vectors, and the norms are arbitrary. For any algorithm that performs this task, the total number of samples acquired from the oracle is referred to as the query complexity.

If we had one, instead of two unknown vectors, then the problem would exactly be that of compressed sensing Candès et al. (2006); Donoho (2006). However having two vectors makes this problem significantly different and challenging. Further, if we allow , then we can treat all the samples to be coming from the same vector and output only a single vector as an approximation to both vectors. So in practice, obtaining is more interesting.

On another technical note, under this setting it is always possible to make the noise negligible by increasing the norm of the query

. To make the problem well-posed, let us define the Signal-to-Noise Ratio (SNR) for a query

:

where the expectation is over the randomness of the query. Furthermore define the overall SNR to be , where the maximization is over all the queries used in the recovery process.

### 1.1 Most Relevant Works

Previous works that are most relevant to our problem are by Yin et al. Yin et al. (2019) and Krishnamurthy et al. Krishnamurthy et al. (2019). Both of these papers address the exact same problem as above; but provide results under some restrictive conditions on the unknown vectors. For example, the results of Yin et al. (2019) is valid only when,

• the unknown vectors are exactly -sparse, i.e., has at most nonzero entries;

• it must hold that,

 β1j≠β2j for each j∈suppβ1∩suppβ2 ,

where denotes the th coordinate of , and denotes the set of nonzero coordinates of ;

• for some , .

All of these assumptions, especially the later two, are severely restrictive. While the results of Krishnamurthy et al. (2019) are valid without the first two assumptions, they fail to get rid of the third, an assumption of the unknown vectors always taking discrete values. This is in particular unfavorable, because the resultant query/sample complexities (and hence the time complexity) in both the above papers has an exponential dependence on .

### 1.2 Our Main Result

In contrast to these earlier results, we provide a generic sample complexity result that does not require any of the assumptions used by the predecessor works. Our main result is following.

###### Theorem 1.

[Main Result] Let (the noise factor) where is a parameter representing the desired recovery precision and

is the standard deviation of

in Eq. (1).

Case 1. For any , there exists an algorithm that makes

 O(klognlogk⌈logklog(√SNR/NF)⌉⌈1NF4√SNR+1NF2⌉)

queries to recover

, estimates of

, with high probability such that , for

,

 ||^βi−βπ(i)||2≤c||βi−βi(k)||1√k+O(γ)

where is some permutation of and is a universal constant.

Case 2. For any , there exists an algorithm that makes queries to recover , estimates of both , with high probability such that , for both ,

 ||^β−βi||2≤c||βi−βi(k)||1√k+O(γ)

where is a universal constant.

For a the first case of the Theorem holds but using the second case may give better result in that regime of precision. The second case of the theorem shows that if we allow a rather large precision error, then the number of queries is similar to the required number for recovering a single vector. This is expected, because in this case we can find just one line approximating both regressions.

The recovery guarantee that we are providing (an - guarantee) is in line with the standard guarantees of the compressed sensing literature. In this paper, we are interested in the regime as in compressed sensing. Note that, our number of required samples scales linearly with and has only poly-logarithmic scaling with , and polynomial scaling with the noise . In the previous works Yin et al. (2019)Krishnamurthy et al. (2019), the complexities scaled exponentially with noise.

Furthermore, the query complexity of our algorithm decreases with the Euclidean distance between the vectors (or the ‘gap’) - which makes sense intuitively. Consider the case when when we want a precise recovery ( very small). It turns out that when the gap is large, the query complexity varies as and when the gap is small the query complexity scale as .

###### Remark 1 (The zero noise case).

When , i.e., the samples are not noisy, the problem is still nontrivial, and is not covered by the statement of Theorem 1. However this case is strictly simpler to handle as it will involve only the alignment step (as will be discussed later), and not the mixture learning step. Recovery with is possible with only queries (see Appendix F for a more detailed discussion on the noiseless setting).

### 1.3 Other Relevant Works

The problem we address can be seen as the active learning version of learning mixtures of linear regressions. Mixture of linear regressions is a natural synthesis of mixture models and linear regression; a generalization of the basic linear regression problem of learning the best linear relationship between the labels and the feature vectors. In this generalization, each label is stochastically generated by picking a linear relation uniformly from a set of two or more linear functions, evaluating this function on the features and possibly adding noise; the goal is to learn the set of unknown linear functions. The problem has been studied at least for past three decades, staring with De Veaux

De Veaux (1989) with a recent surge of interest Chaganty & Liang (2013); Faria & Soromenho (2010); Städler et al. (2010); Kwon & Caramanis (2018); Viele & Tong (2002); Yi et al. (2014, 2016). In this literature a variety of algorithmic techniques to obtain polynomial sample complexity were proposed. To the best of our knowledge, Städler et al. Städler et al. (2010) were the first to impose sparsity on the solutions, where each linear function depends on only a small number of variables. However, many of the earlier papers on mixtures of linear regression, essentially consider the features to be fixed, i.e., part of the input, whereas recent works focus on the query-based model in the sparse setting, where features can be designed as queries Yin et al. (2019); Krishnamurthy et al. (2019). The problem has numerous applications in modelling heterogeneous data arising in medical applications, behavioral health, and music perception Yin et al. (2019).

This problem is a generalization of the compressed sensing problem Candès et al. (2006); Donoho (2006). As a building block to our solution, we use results from exact parameter learning for Gaussian mixtures. Both compressed sensing and learning mixtures of distributions Dasgupta (1999); Titterington et al. (1985)

are immensely popular topics across statistics, signal processing and machine learning with a large body of prior work. We refer to an excellent survey by

Boche et al. (2015) for compressed sensing results (in particular the results of Candes et al. (2008) and Baraniuk et al. (2008) are useful). For parameter learning in mixture models, we find the results of Daskalakis et al. (2017); Daskalakis & Kamath (2014); Hardt & Price (2015); Xu et al. (2016); Balakrishnan et al. (2017); Krishnamurthy et al. (2020) to be directly relevant.

### 1.4 Technical Contributions

If the responses to the queries were to contain tags of the models they are coming from, then we could use rows of any standard compressed sensing matrix as queries and just segregate the responses using the tags. Then by running a compressed sensing recovery on the groups with same tags, we would be done. In what follows, we try to infer this ‘tag’ information by making redundant queries.

If we repeat just the same query multiple time, the noisy responses are going to come from a mixture of Gaussians, with the actual responses being the component means. To learn the actual responses we rely on methods for parameter learning in Gaussian mixtures. It turns out that for different parameter regimes, different methods are best-suited for our purpose - and it is not known in advance what regime we would be in. The method of moments is a well-known procedure for parameter learning in Gaussian mixtures and rigorous theoretical guarantees on sample complexity exist

Hardt & Price (2015). However we are in a specialized regime of scalar uniform mixtures with known variance; and we leverage these information to get better sample complexity guarantee for exact parameter learning (Theorem 3). In particular we show that, in this case the mean and variance of the mixture are sufficient statistics to recover the unknown means, as opposed to the first six moments of the general case  Hardt & Price (2015). While recovery using other methods (Algorithms 1 and 4) are straight forward adaption of known literature, we show that only a small set of samples are needed to determine what method to use.

It turns out that method of moments still needs significantly more samples than the other methods. However we can avoid using method of moments and use a less intensive method (such as EM, Algorithms 1), provided we are in a regime when the gap between the component means is high. The only fact is that the Euclidean distance between and are far does not guarantee that. However, if we choose the queries to be Gaussians, then the gap is indeed high with certain probability. If the queries were to be generated by any other distribution, then such fact will require strong anti-concentration inequalities that in general do not hold. Therefore, we cannot really work with any standard compressed sensing matrix, but have to choose Gaussian matrices (which are incidentally also good standard compressed sensing matrices).

The main technical challenge comes in the next step, alignment. For any two queries even if we know and , we do not know how to club and together as their order could be different. And this is an issue with all pairs of queries which leaves us with exponential number of possibilities to choose form. We form a simple error-correcting code to tackle this problem.

For two queries, we deal with this issue by designing two additional queries and Now even if we mis-align, we can cross-verify with the samples from ‘sum’ query and the ‘difference’ query, and at least one of these will show inconsistency. We subsequently extend this idea to align all the samples. Once the samples are all aligned, we can just use some any recovery algorithm for compressed sensing to deduce the sparse vectors.

The rest of this paper is organized as follows. We give an overview of our algorithm in Sec. 2.1 , the actual algorithm is presented in Algorithm 8, which calls several subroutines. The process of denoising by Gaussian mixture learning is described in Sec. 2.2. The alignment problem is discussed in Sec. 2.3 and the proof of Theorem  1 is wrapped up in Sec. 2.4. Most proofs are delegated to the appendix in the supplementary material. Some ‘proof of concept’ simulation results are also in the appendix.

## 2 Main Results

### 2.1 Overview of Our Algorithm

Our scheme to recover the unknown vectors is described below. We will carefully chose the numbers so that the overall query complexity meets the promise of Theorem 1.

• [leftmargin=*,noitemsep,topsep=0em]

• We pick query vectors independently, each according to where is the -dimensional all zero vector and is the identity matrix.

• (Mixture) We repeatedly query the oracle with for times for all in order to offset the noise. The samples obtained from the repeated querying of is referred to as a batch corresponding to . is referred to as the batchsize of . Our objective is to return and , estimates of and respectively from the batch of samples (details in Section 2.2). However, it will not be possible to label which estimated mean corresponds to and which one corresponds to .

• (Alignment) For some and for each such that , we also query the oracle with the vectors (sum query) and (difference query) repeatedly for and times respectively. Our objective is to cluster the set of estimated means into two equally sized clusters such that all the elements in a particular cluster are good estimates of querying the same unknown vector.

• Since the queries has the property of being a good compressed sensing matrix (they satisfy -RIP condition, a sufficient condition for - recovery in compressed sensing, with high probability), we can formulate a convex optimization problem using the estimates present in each cluster to recover the unknown vectors and .

It is evident that the sample (query) complexity will be . In the subsections below, we will show each step more formally and provide upper bounds on the sufficient batchsize for each query.

### 2.2 Recovering Unknown Means from a Batch

For a query , notice that the samples from the batch corresponding to is distributed according to a Gaussian mixture ,

 M≜12N(⟨x,β1⟩,σ2)+12N(⟨x,β2⟩,σ2),

an equally weighted mixture of two Gaussian distributions having means

with known variance . For brevity, let us denote by and by from here on in this sub-section. In essence, our objective is to find the sufficient batchsize of so that it is possible to estimate and upto an additive error of . Below, we go over some methods providing theoretical guarantees on the sufficient sample complexity for approximating the means that will be suitable for different parameter regimes.

#### 2.2.1 Recovery using EM algorithm

The Expectation Maximization (EM) algorithm is widely known, and used for the purpose of parameter learning of Gaussian mixtures, cf.

Balakrishnan et al. (2017) and Xu et al. (2016). The EM algorithm tailored towards recovering the parameters of the mixture is described in Algorithm 1. The following result can be derived from Daskalakis et al. (2017) (with our terminology) that gives a sample complexity guarantee of using EM algorithm.

###### Theorem 2 (Finite sample EM analysis Daskalakis et al. (2017)).

From an equally weighted two component Gaussian mixture with unknown component means and known and shared variance , a total samples suffice to return , such that for some permutation , for

 ∣∣^μi−μπ(i)∣∣≤ϵ

using the EM algorithm with probability at least .

This theorem implies that EM algorithm requires smaller number of samples as the separation between the means grows larger. However, it is possible to have a better dependence on , especially when it is small compared to .

#### 2.2.2 Method of Moments

Consider any Gaussian mixture with two components,

 G≜p1N(μ1,σ21)+p2N(μ2,σ22),

where and

Define the variance of a random variable distributed according to

to be

 σ2G≜p1p2((μ1−μ2)2+p1σ21+p2σ22.

It was shown in Hardt & Price (2015) that samples are both necessary and sufficient to recover the unknown parameters upto an additive error of . However, in our setting the components of the mixture have the same known variance and further the mixture is equally weighted. Our first contribution is to show significantly better results for this special case.

###### Theorem 3.

With samples, Algorithm 3 returns , such that for some permutation , we have, for with probability at least .

This theorem states that samples are sufficient to recover the unknown means of (as compared to the result for the general case). This is because the mean and variance are sufficient statistics for this special case (as compared to first six excess moments in the general case). We first show two technical lemmas providing guarantees on recovering the mean and the variance of a random variable distributed according to . The procedure to return and (estimates of and respectively) is described in Algorithm 2.

###### Lemma 1.

samples divided into equally sized batches are sufficient to compute (see Algorithm 2) such that with probability at least .

###### Lemma 2.

samples divided into equally sized batches is sufficient to compute (see Algorithm 2) such that with probability at least .

The detailed proofs of Lemma 1 and 2 can be found in Appendix A. We are now ready to prove Theorem 3.

###### Proof of Theorem 3.

We will set up the following system of equations in the variables and :

 ^μ1+^μ2=2^M1and(^μ1−^μ2)2=4^M2−4σ2

Recall that from Lemma 1 and Lemma 2, we have computed and with the following guarantees: and Therefore, we must have , We can factorize the left hand side of the second equation in the following way: Notice that one of the factors must be less than . Without loss of generality, let us assume that This, along with the fact implies that (by adding and subtracting)

#### 2.2.3 Fitting a single Gaussian

In the situation when both the variance of each component in and the separation between the means are very small, fitting a single Gaussian to the samples obtained from works better than the aforementioned techniques. The procedure to compute , an estimate of is adapted from Daskalakis et al. (2017) and is described in Algorithm 4. Notice that Algorithm 4 is different from the naive procedure (averaging all samples) described in Algorithm 2 for estimating the mean of the mixture. The sample complexity for the naive procedure (see Lemma 1) scales with the gap even when the variance is small which is undesirable. In stead we have the following lemma.

###### Lemma 3 (Lemma 5 in Daskalakis et al. (2017)).

With Algorithm 4, samples are sufficient to compute such that with probability at least .

In this case, we will return to be estimates of both the means .

#### 2.2.4 Choosing appropriate methods

Among the above three methods to learn mixtures, the appropriate algorithm to apply for each parameter regime is listed below.

Case 1 (): We use the EM algorithm for this regime to recover . Notice that in this regime, by using Theorem 2 with , we obtain that samples are sufficient to recover up to an additive error of with probability at least .

Case 2 (): We use the method of moments to recover . In this regime, we must have . Therefore, by using Theorem 3 with , it is evident that samples are sufficient to recover upto an additive error of with probability at least .

Case 3 (): In this setting, we fit a single Gaussian. Using Theorem 3 with , we will be able to estimate up to an additive error of using samples. This, in turn implies

 |μi−^M1|≤|μ1−μ2|2+∣∣∣μ1+μ22−^M1∣∣∣≤γ.

for and therefore both the means are recovered up to an additive error of . Note that these three cases covers all possibilities.

#### 2.2.5 Test for appropriate method

Now, we describe a test to infer which parameter regime we are in and therefore which algorithm to use. The final algorithm to recover the means from including the test is described in Algorithm 5. We have the following result, the proof of which is delegated to appendix B.

###### Lemma 4.

The number of samples required for Algorithm 5 to infer the parameter regime correctly with probability at least is atmost .

### 2.3 Alignment

For a query , let us introduce the following notations for brevity:

 μi,1:=⟨xi,β1⟩μi,2:=⟨xi,β2⟩.

Now, using Algorithm 5, we can compute (estimates of ) using a batchsize of such that where is a permutation on .

The most important step in our process is to separate the estimates of the means according to the generative unknown sparse vectors and (i.e., alignment). Formally, we construct two -dimensional vectors and such that, for all the following hold:

• [leftmargin=*,noitemsep,topsep=0em]

• The elements of and , i.e., and , are and (but may not be respectively).

• Moreover, we must have the and to be good estimates of and respectively i.e. ; for all where is some permutation of .

In essence, for the alignment step, we want to find out all permutations . First, note that the aforementioned objective is trivial when . To see this, suppose is the identity permutation without loss of generality. In that case, we have for , and Similar guarantees also hold for and therefore the choice of the element of is trivial. This conclusion implies that the interesting case is only for those queries when . In other words, this objective is equivalent to separate out the permutations for into two groups such that all the permutations in each group are the same.

#### 2.3.1 Alignment for two queries

Consider two queries such that for . In this section, we will show how we can infer if is same as . Our strategy is to make two additional batches of queries corresponding to and (of size and respectively) which we shall call the sum and difference queries. Again, let us introduce the following notations: , As before, using Algorithm 5, we can compute (estimates of ) and (estimates of ) using a batchsize of and for the sum and difference query respectively such that and where are again unknown permutations of . We show the following lemma.

###### Lemma 5.

We can infer, using Algorithm 6, if and are same using the estimates provided , .

The proof of this lemma is delegated to appendix C and we provide an outline over here. In Algorithm 6, we first choose one value from (say ) and we check if we can choose one element (say ) from the set and one element (say ) in exactly one way such that . If that is true, then we infer that the tuple are estimates of the same unknown vector and accordingly return if is same as . If not possible, then we choose one value from (say ) and again we check if we can choose one element (say ) from the set and one element from (say ) in exactly one way such that . If that is true, then we infer that are estimates of the same unknown vector and accordingly return if is same as . It can be shown that we will succeed in this step using at least one of the sum or difference queries.

#### 2.3.2 Alignment for all queries

We will align the mean estimates for all the queries by aligning one pair at a time. This routine is summarized in Algorithm 7, which works when . To understand the routine, we start with the following technical lemma:

###### Lemma 6.

Let, . For , there exists a query among such that with probability at least .