Linear regression without correspondence

05/19/2017 ∙ by Daniel Hsu, et al. ∙ 0

This article considers algorithmic and statistical aspects of linear regression when the correspondence between the covariates and the responses is unknown. First, a fully polynomial-time approximation scheme is given for the natural least squares optimization problem in any constant dimension. Next, in an average-case and noise-free setting where the responses exactly correspond to a linear function of i.i.d. draws from a standard multivariate normal distribution, an efficient algorithm based on lattice basis reduction is shown to exactly recover the unknown linear function in arbitrary dimension. Finally, lower bounds on the signal-to-noise ratio are established for approximate recovery of the unknown linear function by any estimator.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider the problem of recovering an unknown vector

from noisy linear measurements when the correspondence between the measurement vectors and the measurements themselves is unknown. The measurement vectors (i.e., covariates) from are denoted by ; for each , the -th measurement (i.e., response) is obtained using :

(1)

Above, is an unknown permutation on , and the are unknown measurement errors.

This problem (which has been called unlabeled sensing [22], linear regression with an unknown permutation [18], and linear regression with shuffled labels [1]) arises in many settings. For example, physical sensing limitations may create ambiguity in or lose the ordering of measurements. Or, the covariates and responses may be derived from separate databases that lack appropriate record linkage (perhaps for privacy reasons). See the aforementioned references for more details on these applications. The problem is also interesting because the missing correspondence makes an otherwise well-understood problem into one with very different computational and statistical properties.

Prior works.

Unnikrishnan et al. [22] study conditions on the measurement vectors that permit recovery of any target vector under noiseless measurements. They show that when the entries of the are drawn i.i.d. from a continuous distribution, and , then almost surely, every vector is uniquely determined by noiseless correspondence-free measurements as in (1). (Under noisy measurements, it is shown that can be recovered when an appropriate signal-to-noise ratio tends to infinity.) It is also shown that is necessary for such a guarantee that holds for all vectors .

Pananjady et al. [18] study statistical and computational limits on recovering the unknown permutation . On the statistical front, they consider necessary and sufficient conditions on the signal-to-noise ratio when the measurement errors are i.i.d. draws from the normal distribution and the measurement vectors are i.i.d. draws from the standard multivariate normal distribution . Roughly speaking, exact recovery of is possible via maximum likelihood when for some absolute constant , and approximate recovery is impossible for any method when for some other absolute constant . On the computational front, they show that the least squares problem (which is equivalent to maximum likelihood problem)

(2)

given arbitrary and is NP-hard when 111Pananjady et al. [18] prove that Partition reduces to the problem of deciding if the optimal value of (2) is zero or non-zero. Note that Partition is weakly, but not strongly, NP-hard: it admits a pseudo-polynomial-time algorithm [10, Section 4.2]. In Appendix A, we prove that the least squares problem is strongly NP-hard by reduction from -Partition (which is strongly NP-complete [10, Section 4.2.2])., but admits a polynomial-time algorithm (in fact, an -time algorithm based on sorting) when .

Abid et al. [1] observe that the maximum likelihood estimator can be inconsistent for estimating in certain settings (including the normal setting of Pananjady et al. [18], with fixed but ). One of the alternative estimators they suggest is consistent under additional assumptions in dimension . Elhami et al. [8] give a -time algorithm that, in dimension , is guaranteed to approximately recover when the measurement vectors are chosen in a very particular way from the unit circle and the measurement errors are uniformly bounded.

Contributions.

We make progress on both computational and statistical aspects of the problem.

  1. [leftmargin=2em]

  2. We give an approximation algorithm for the least squares problem from (2) that, any given , , and , returns a solution with objective value at most times that of the minimum in time . This a fully polynomial-time approximation scheme for any constant dimension.

  3. We give an algorithm that exactly recovers in the measurement model from (1), under the assumption that there are no measurement errors and the covariates are i.i.d. draws from . The algorithm, which is based on a reduction to a lattice problem and employs the lattice basis reduction algorithm of Lenstra et al. [16], runs in time when the covariate vectors and target vector are appropriately quantized. This result may also be regarded as for each-type guarantee for exactly recovering a fixed vector , which complements the for all-type results of Unnikrishnan et al. [22] concerning the number of measurement vectors needed for recovering all possible vectors.

  4. We show that in the measurement model from (1) where the measurement errors are i.i.d. draws from and the covariate vectors are i.i.d. draws from , then no algorithm can approximately recover unless for some absolute constant

    . We also show that when the covariate vectors are i.i.d. draws from the uniform distribution on

    , then approximate recovery is impossible unless for some other absolute constant .

Our algorithms are not meant for practical deployment, but instead are intended to shed light on the computational difficulty of the least squares problem and the average-case recovery problem. Indeed, note that a naïve brute-force search over permutations requires time , and the only other previous algorithms (already discussed above) were restricted to  [18] or only had some form of approximation guarantee when  [8]. We are not aware of previous algorithms for the average-case problem in general dimension .222A recent algorithm of Pananjady et al. [19] exploits a similar average-case setting but only for a somewhat easier variant of the problem where more information about the unknown correspondence is provided.

Our lower bounds on stand in contrast to what is achievable in the classical linear regression model (where the covariate/response correspondence is known): in that model, the requirement for approximately recovering scales as , and hence the problem becomes easier with . The lack of correspondence thus drastically changes the difficulty of the problem.

2 Approximation algorithm for the least squares problem

In this section, we consider the least squares problem from Equation 2. The inputs are an arbitrary matrix and an arbitrary vector , and the goal is to find a vector and permutation matrix (where denotes the space of permutation matrices333Each permutation matrix corresponds to a permutation on ; the -th entry of is one if and is zero otherwise.) to minimize . This problem is NP-hard in the case where [18] (see also Appendix A). We give an approximation scheme that, for any , returns a -approximation in time , where .

We assume without loss of generality that and . This is because we can always replace with its matrix of left singular vectors

, obtained via singular value decomposition

, where and is diagonal. A solution for has the same cost as the solution for , and a solution for has the same cost as the solution for .

2.1 Algorithm

0:  Covariate matrix ; response vector ; approximation parameter .
0:  .
0:  Weight vector and permutation matrix .
1:  Run “Row Sampling” algorithm with input matrix to obtain a matrix with .
2:  Let be the set of vectors satisfying the following: for each ,
  • [leftmargin=2em]

  • if the -th column of is all zeros, then ;

  • otherwise, .

3:  Let .
4:  for each  do
5:     Compute , and let .
6:     Construct a -net for the Euclidean ball of radius around , so that for each with , there exists such that .
7:  end for
8:  return   and .
Algorithm 1 Approximation algorithm for least squares problem

Our approximation algorithm, shown as Algorithm 1, uses a careful enumeration to beat the naïve brute-force running time of . It uses as a subroutine a “Row Sampling” algorithm of Boutsidis et al. [5] (described in Appendix B), which has the following property.

Theorem 1 (Specialization of Theorem 12 in [5]).

There is an algorithm (“Row Sampling”) that, given any matrix with , returns in time a matrix with such that the following hold.

  1. [leftmargin=2em,itemsep=0ex]

  2. Every row of has at most one non-zero entry.

  3. For every , every satisfies for .

The matrix returned by Row Sampling determines a (weighted) subset of rows of

such that solving a (ordinary) least squares problem (with any right-hand side

) on this subset of rows and corresponding right-hand side entries yields a -approximation to the least squares problem over all rows and right-hand side entries. Row Sampling does not directly apply to our problem because (1) it does not minimize over permutations of the right-hand side, and (2) the approximation factor is too large. However, we are able to use it to narrow the search space in our problem.

An alternative to Row Sampling is to simply enumerate all subsets of rows of . This is justified by a recent result of Dereziński and Warmuth [7], which shows that for any right-hand side , using “volume sampling” [3] to choose a matrix (where each row has one non-zero entry) gives a similar guarantee as that of Row Sampling, except with the factor replaced by in expectation.

2.2 Analysis

The approximation guarantee of Algorithm 1 is given in the following theorem.

Theorem 2.

Algorithm 1 returns and satisfying

Proof.

Let be the optimal cost, and let denote a solution achieving this cost. The optimality implies that satisfies the normal equations . Observe that there exists a vector satisfying . By Theorem 1 and the normal equations, the vector and cost value satisfy

Moreover, since , we have that . By construction of , there exists satisfying . For this , the normal equations imply

Therefore, the solution returned by Algorithm 1 has cost no more than . ∎

By the results of Pananjady et al. [18] for maximum likelihood estimation, our algorithm enjoys recovery guarantees for and when the data come from the Gaussian measurement model (1). However, the approximation guarantee also holds for worst-case inputs without generative assumptions.

Running time.

We now consider the running time of Algorithm 1. There is the initial cost for singular value decomposition (as discussed at the beginning of the section), and also for “Row Sampling”; both of these take time. For the rest of the algorithm, we need to consider the size of and the size of the net for each . First, we have , since has only rows and each row has at most a single non-zero entry. Next, for each , we construct the -net (for ) by constructing a -net for the -ball of radius centered at (using an appropriate axis-aligned grid). This has size . Finally, each computation takes time, and each takes time [18] (also see Appendix B). So, the overall running time is .

3 Exact recovery algorithm in noiseless Gaussian setting

To counter the intractability of the least squares problem in (2) confronted in Section 2, it is natural to explore distributional assumptions that may lead to faster algorithms. In this section, we consider the noiseless measurement model where the are i.i.d. draws from (as in [18]). We give an algorithm that exactly recovers

with high probability when

. The algorithm runs in -time when and are appropriately quantized.

It will be notationally simpler to consider covariate vectors and responses

(3)

Here, are i.i.d. draws from , the unknown permutation is over , and the requirement of at least measurements is expressed as .

In fact, we shall consider a variant of the problem in which we are given one of the values of the unknown permutation . Without loss of generality, assume we are given that . Solving this variant of the problem suffices because there are only possible values of : we can try them all, incurring just a factor in the computation time. So henceforth, we just consider as an unknown permutation on .

3.1 Algorithm

Our algorithm, shown as Algorithm 2, is based on a reduction to the Subset Sum problem. An instance of Subset Sum is specified by an unordered collection of source numbers , and a target sum . The goal is to find a subset such that . Although Subset Sum is NP-hard in the worst case, it is tractable for certain structured instances [12, 9]. We prove that Algorithm 2 constructs such an instance with high probability. A similar algorithm based on such a reduction was recently used by Andoni et al. [2] for a different but related problem.

Algorithm 2 proceeds by (i) solving a Subset Sum instance based on the covariate vectors and response values (using Algorithm 3), and (ii) constructing a permutation on based on the solution to the Subset Sum instance. With the permutation in hand, we (try to) find a solution to the system of linear equations for . If , then there is a unique such solution almost surely.

0:  Covariate vectors in ; response values in ; confidence parameter ; lattice parameter .
0:  there exists and permutation on such that for each , and that .
0:  Permutation on or failure.
1:  Let , and its pseudoinverse be .
2:  Create Subset Sum instance with source numbers for and target sum .
3:  Run Algorithm 3 with Subset Sum instance and lattice parameter .
4:  if Algorithm 3 returns a solution  then
5:     return  any permutation on such that implies .
6:  else
7:     return  failure.
8:  end if
Algorithm 2 Find permutation
0:  Source numbers ; target sum ; lattice parameter .
0:  Subset or failure.
1:  Construct lattice basis where
2:  Run basis reduction [e.g., 16] to find non-zero lattice vector of length at most .
3:  if , with and is characteristic vector for some  then
4:     return  .
5:  else
6:     return  failure.
7:  end if
Algorithm 3 Lagarias and Odlyzko [12] subset sum algorithm

3.2 Analysis

The following theorem is the main recovery guarantee for Algorithm 2.

Theorem 3.

Pick any . Suppose are i.i.d. draws from , and follow the noiseless measurement model from (3) for some and permutation on (and ), and that . Furthermore, suppose Algorithm 2 is run with inputs , , , and , and also that where is defined in Equation 8. With probability at least , Algorithm 2 returns .

Remark 1.

The value of from Equation 8 is directly proportional to , and Algorithm 2 requires a lower bound on (in the setting of the lattice parameter ). Hence, it suffices to determine a lower bound on . Such a bound can be obtained from the measurement values: a standard tail bound (Lemma 6 in Appendix C) shows that with high probability, is a lower bound on , and is within a constant factor of it as well.

Remark 2.

Algorithm 2 strongly exploits the assumption of noiseless measurements, which is expected given the lower bounds of Pananjady et al. [18] for recovering . The algorithm, however, is also very brittle and very likely fails in the presence of noise.

Remark 3.

The recovery result does not contradict the results of Unnikrishnan et al. [22], which show that a collection of measurement vectors are necessary for recovering all , even in the noiseless measurement model of (3). Indeed, our result shows that for a fixed , with high probability measurements in the model of (3) suffice to permit exactly recovery of , but this same set of measurement vectors (when ) will fail for some other .

The proof of Theorem 3 is based on the following theorem—essentially due to Lagarias and Odlyzko [12] and Frieze [9]—concerning certain structured instances of Subset Sum that can be solved using the lattice basis reduction algorithm of Lenstra et al. [16]. Given a basis for a lattice

this algorithm can be used to find a non-zero vector whose length is at most times that of the shortest non-zero vector in the lattice

Theorem 4 ([12, 9]).

Suppose the Subset Sum instance specified by source numbers and target sum satisfy the following properties.

  1. [leftmargin=2em]

  2. There is a subset such that .

  3. Define and . There exists such that for each that is not an integer multiple of , where is the characteristic vector for .

Let be the lattice basis constructed by Algorithm 3, and assume . Then every non-zero vector in the lattice with length at most times the length of the shortest non-zero vector in is an integer multiple of the vector , and the basis reduction algorithm of Lenstra et al. [16] returns such a non-zero vector.

The Subset Sum instance constructed in Algorithm 2 has source numbers and target sum . We need to show that it satisfies the two conditions of Theorem 4.

Let , and let be the permutation matrix with for all . Note that is the “characteristic vector” for . Define and

A crude bound shows that .

The following lemma establishes the first required property in Theorem 4.

Lemma 1.

The random matrix

has rank almost surely, and the subset satisfies .

Proof.

That has rank almost surely follows from the fact that the probability density of is supported on all of . This implies that , and

The next lemma establishes the second required property in Theorem 4. Here, we use the fact that the Frobenius norm is at least one whenever is not an integer multiple of .

Lemma 2.

Pick any such that . With probability at least , every with satisfies

Proof.

By Lemma 1, the matrix satisfies . Fix any with . Then

Using matrix and vector notations, this can be written compactly as the inner product . Since and is independent of

, the distribution of the inner product is normal with mean zero and standard deviation equal to

. By Lemma 7 (in Appendix C), with probability at least ,

(4)

Observe that since has rank by Lemma 1, so

(5)

By Lemma 4 (in Appendix C), with probability at least ,

(6)

And by Lemma 9 (in Appendix C), with probability at least ,

(7)

Since is orthogonal, we have that . Combining this with (4), (5), (6), and (7), and union bounds over all proves the claim. ∎

Proof of Theorem 3.

Lemma 1 and Lemma 2 (with and ) together imply that with probability at least , the source numbers and target sum satisfy the conditions of Theorem 4 with

(8)

Thus, in this event, Algorithm 3 (with satisfying ) returns , which uniquely determines the permutation returned by Algorithm 2. ∎

Running time.

The basis reduction algorithm of Lenstra et al. [16] is iterative, with each iteration primarily consisting of Gram-Schmidt orthogonalization and another efficient linear algebraic process called “size reduction”. The total number of iterations required is

In our case, and ; and by Lemma 10 (in Appendix C), each of the basis vectors constructed has squared length at most . Using the tight setting of required in Theorem 3, this gives a bound on the total number of iterations as well as on the total running time.

However, the basis reduction algorithm requires both arithmetic and rounding operations, which are typically only available for finite precision rational inputs. Therefore, a formal running time analysis would require the idealized real-valued covariate vectors and unknown target vector

to be quantized to finite precision values. This is doable, and is similar to using a discretized Gaussian distribution for the distribution of the covariate vectors (and assuming

is a vector of finite precision values), but leads to a messier analysis incomparable to the setup of previous works. Nevertheless, it would be desirable to find a different algorithm that avoids lattice basis reduction that still works with just measurements.

4 Lower bounds on signal-to-noise for approximate recovery

In this section, we consider the measurement model from (1) where are i.i.d. draws from either or the uniform distribution on , and are i.i.d. draws from . We establish lower bounds on the signal-to-noise ratio (),

required by any estimator for to approximately recover in expectation. The estimators may have a priori knowledge of the values of and .

Theorem 5.

Assume are i.i.d. draws from .

  1. [leftmargin=2em]

  2. There is an absolute constant such that the following holds. If , , are i.i.d. draws from , follow the measurement model from (1), and

    then for any estimator , there exists some such that

  3. If are i.i.d. draws from the uniform distribution on , and follow the measurement model from (1), and

    then for any estimator , there exists some such that

Note that in the classical linear regression model where