Lower Bounds for Compressed Sensing with Generative Models

12/06/2019 ∙ by Akshay Kamath, et al. ∙ The University of Texas at Austin 0

The goal of compressed sensing is to learn a structured signal x from a limited number of noisy linear measurements y ≈ Ax. In traditional compressed sensing, "structure" is represented by sparsity in some known basis. Inspired by the success of deep learning in modeling images, recent work starting with <cit.> has instead considered structure to come from a generative model G: R^k →R^n. We present two results establishing the difficulty of this latter task, showing that existing bounds are tight. First, we provide a lower bound matching the <cit.> upper bound for compressed sensing from L-Lipschitz generative models G. In particular, there exists such a function that requires roughly Ω(k log L) linear measurements for sparse recovery to be possible. This holds even for the more relaxed goal of nonuniform recovery. Second, we show that generative models generalize sparsity as a representation of structure. In particular, we construct a ReLU-based neural network G: R^2k→R^n with O(1) layers and O(kn) activations per layer, such that the range of G contains all k-sparse vectors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In compressed sensing, one would like to learn a structured signal from a limited number of linear measurements . This is motivated by two observations: first, there are many situations where linear measurements are easy, in settings as varied as streaming algorithms, single-pixel cameras, genetic testing, and MRIs. Second, the unknown signals being observed are structured or “compressible”: although lies in , it would take far fewer than words to describe

. In such a situation, one can hope to estimate

well from a number of linear measurements that is closer to the size of the compressed representation of than to its ambient dimension .

In order to do compressed sensing, you need a formal notion of how signals are expected to be structured. The classic answer is to use sparsity. Given linear measurements111The algorithms we discuss can also handle post-measurement noise, where . We remove this term for simplicity: this paper focuses on lower bounds, and handling this term could only make things harder. of an arbitrary vector , one can hope to recover an estimate of satisfying

(1)

for some constant and norm . In this paper, we will focus on the norm and achieving the guarantee with probability. Thus, if is well-approximated by a -sparse vector , it should be accurately recovered. Classic results such as [CRT06] show that (1) is achievable when consists of independent Gaussian linear measurements. This bound is tight, and in fact no distribution of matrices with fewer rows can achieve this guarantee in either or  [DIP+10].

Although compressed sensing has had success, sparsity is a limited notion of structure. Can we learn a richer model of signal structure from data, and use this to perform recovery? In recent years, deep convolutional neural networks have had great success in producing rich models for representing the manifold of images, notably with generative adversarial networks (GANs) 

[GPM+14]

and variational autoencoders (VAEs) 

[KW14]. These methods produce generative models that allow approximate sampling from the distribution of images. So a natural question is whether these generative models can be used for compressed sensing.

In [BJP+17] it was shown how to use generative models to achieve a guarantee analogous to (1): for any -Lipschitz , one can achieve

(2)

where are parameters, denotes the radius- ball in and Lipschitzness is defined with respect to the -norms, using only measurements. Thus, the recovered vector is almost as good as the nearest point in the range of the generative model, rather than in the set of -sparse vectors. We will refer to the problem of achieving the guarantee in  (2) as “function-sparse recovery”.

Our main theorem is that the [BJP+17] result is tight: for any setting of parameters , there exists an -Lipschitz function such that any algorithm achieving (2) with probability must have linear measurements. Notably, the additive error that was unnecessary in sparse recovery is necessary for general Lipschitz generative model recovery.

A concurrent paper [LS19] proves a lower bound for a restricted version of (2). They show a lower bound when the vector that lies in the image of and for a particular value of . Our results, in comparison, apply to the most general version of the problem and are proven using a simpler communication complexity technique.

The second result in this paper is to directly relate the two notions of structure: sparsity and generative models. We produce a simple Lipschitz neural network , with ReLU activations, hidden layers, and maximum width , so that the range of contains all -sparse vectors.

A second result of [BJP+17] is that for ReLU-based neural networks, one can avoid the additive term and achieve a different result from (2):

(3)

using measurements, if is the depth and is the maximum number of activations per layer. Applying this result to our sparsity-producing network implies, with measurements, recovery achieving the standard sparsity guarantee (1). So the generative-model representation of structure really is more powerful than sparsity.

2 Proof overview

As described above, this paper contains two results: an lower bound for compressed sensing relative to a Lipschitz generative model, and an -layer generative model whose range contains all sparse vectors. These results are orthogonal, and we outline each in turn.

2.1 Lower bound for Lipschitz generative recovery.

Over the last decade, lower bounds for sparse recovery have been studied extensively. The techniques in this paper are most closely related to the techniques used in [DIP+10].

Similar to [DIP+10], our proof is based on communication complexity. We will exhibit an -Lipschitz function and a large finite set of points that are well-separated. Then, given a point that is picked uniformly at random from , we show how to identify it from using the function-sparse recovery algorithm. This implies also contains a lot of information, so must be fairly large.

Formally, we produce a generative model whose range includes a large, well-separated set:

Theorem 2.1.

Given satisfying , there exists an Lipschitz function , and such that

  1. for all ,

  2. for all ,

  3. for all ,

Now, suppose we have an algorithm that can perform function-sparse recovery with respect to from Theorem 2.1, with approximation factor , and error within the radius ball in -dimensions. Set , and for any take

for a small constant. The idea of the proof is the following: given , we can recover such that

and so, because has minimum distance , we can exactly recover by rounding to the nearest element of . But then we can repeat the process on to find , then , up to , and learn bits total. Thus must contain this many bits of information; but if the entries of are rational numbers with bounded numerators and (the same) bounded denominator, then each entry of can be described in bits, so

or .

There are two issues that make the above outline not totally satisfactory, which we only briefly address how to resolve here. First, the theorem statement makes no supposition on the entries of being polynomially bounded. To resolve this, we perturb with a tiny (polynomially small) amount of additive Gaussian noise, after which discretizing at an even tinier (but still polynomial) precision has negligible effect on the failure probability. The second issue is that the above outline requires the algorithm to recover all vectors, so it only applies if the algorithm succeeds with probability rather than constant probability. This is resolved by using a reduction from the augmented indexing problem, which is a one-way communication problem where Alice has , Bob has and , and Alice must send Bob a message so that Bob can output with probability. This still requires bits of communication, and can be solved in bits of communication by sending as above. Formally, our lower bound states:

Theorem 2.2.

Consider any with . There exists an -Lipschitz function such that, if is an algorithm which picks a matrix , and given returns an satisfying (2) with probability , then .

Constructing the set.

The above lower bound approach, relies on finding a large, well-separated set as in Theorem 2.1.

We construct this aforementioned set within the -dimensional ball of radius such that any two points in the set are at least apart. Furthermore, since we wish to use a function-sparse recovery algorithm, we describe a function and set the radius such that is -Lipschitz. In order to get the desired lower bound, the image of needs to contain a subset of at least points.

First, we construct a mapping as described above from to i.e we need to find points in that are mutually far apart. We show that certain binary linear codes over the alphabet yield such points that are mutually apart. We construct a -Lipschitz mapping of points in the interval to a subset of these points.

In order to extend this construction to a mapping from to , we apply the above function in a coordinate-wise manner. This would result in a mapping with the same Lipschitz parameter. The points in that are images of these points lie in a ball of radius but could potentially be close. To get around this, we use an error correcting code over a large alphabet to choose a subset of these points that is large enough and such that they are still mutually far apart.

2.2 Sparsity-producing generative model.

To produce a generative model whose range consists of all -sparse vectors, we start by mapping to the set of positive -sparse vectors. For any pair of angles

, we can use a constant number of unbiased ReLUs to produce a neuron that is only active at points whose representation

in polar coordinates has . Moreover, because unbiased ReLUs behave linearly, the activation can be made an arbitrary positive real by scaling appropriately. By applying this times in parallel, we can produce neurons with disjoint activation ranges, making a network whose range contains all -sparse vectors with nonnegative coordinates.

By doing this times and adding up the results, we produce a network whose range contains all -sparse vectors with nonnegative coordinates. To support negative coordinates, we just extend the solution to have two ranges within which it is non-zero: for one range of the output is positive, and for another the output is negative.

This results in the following theorem:

Theorem 2.3.

There exists a 2 layer neural network with width such that

3 Lower bound proof

In this section, we prove a lower bound for the sample complexity of function-sparse recovery by a reduction from a communication game. We show that the communication game can be won by sending a vector and then performing function-sparse recovery. A lower bound on the communication complexity of the game implies a lower bound on the number of bits used to represent if is discretized. We can then use this to lower bound the number of measurements in .

Since we are dealing in bits in the communication game and the entries of a sparse recovery matrix can be arbitrary reals, we will need to discretize each measurement. We show first that discretizing the measurement matrix by rounding does not change the resulting measurement too much and will allow for our reduction to proceed.

Notation.

We use to denote the -dimensional ball of radius . Given a function , denotes a function that the maps a point to . For any function , we use to denote .

Matrix conditioning.

We first show that, without loss of generality, we may assume that the measurement matrix is well-conditioned. In particular, we may assume that the rows of are orthonormal.

We can multiply

on the left by any invertible matrix to get another measurement matrix with the same recovery characteristics. If we consider the singular value decomposition

, where and are orthonormal and is 0 off the diagonal, this means that we can eliminate and make the entries of be either or . The result is a matrix consisting of orthonormal rows.

Discretization.

For well-conditioned matrices , we use the following lemma (similar to one from [DIP+10]) to show that we can discretize the entries without changing the behavior by much:

Lemma 3.1.

Let be a matrix with orthonormal rows. Let be the result of rounding to bits per entry. Then for any there exists an with and .

Proof.

Let be the error when discretizing to bits, so each entry of is less than . Then for any and , we have and

The Augmented Indexing problem.

As in [DIP+10], we use the Augmented Indexing communication game which is defined as follows: There are two parties, Alice and Bob. Alice is given a string . Bob is given an index , together with . The parties also share an arbitrarily long common random string . Alice sends a single message to Bob, who must output with probability at least , where the probability is taken over . We refer to this problem as Augmented Indexing. The communication cost of Augmented Indexing is the minimum, over all correct protocols, of length on the worst-case choice of and .

The following theorem is well-known and follows from Lemma 13 of [MNS+98] (see, for example, an explicit proof in [DIP+10])

Theorem 3.2.

The communication cost of Augmented Indexing is .

A well-separated set of points.

We would like to prove Theorem 2.1, getting a large set of well-separated points in the image of a Lipschitz generative model. Before we do this, though, we prove a analog:

Lemma 3.3.

There is a set of points in of size such that for each pair of points

Proof.

Consider a -balanced linear code over the alphabet with message length . It is known that such codes exist with block length [BT09]. Setting the block length to be and , we get that there is a set of points in such that the pairwise hamming distance is between , i.e. the pairwise distance is between . ∎

Now we wish to extend this result to arbitrary while achieving the parameters in Theorem 2.1.

Proof of Theorem 2.1.

We first define an -Lipschitz map that goes through a set of points that are pairwise apart. Consider the set of points from Lemma 3.3 scaled to . Observe that . Choose subset that such that it contains exactly points and let be a piecewise linear function that goes through all the points in in order. Then, we define as:

Let be the points that are pre-images of elements of . Observe that is -Lipschitz since within the interval , since it maps each interval of length to an interval of length at most .

Now, consider the function . Observe that is also Lipschitz,

Also, for every point , . However, there still exist distinct points (for instance points that differ at exactly one coordinate) such that .

We construct a large subset of the points in such that any two points in this subset are far apart using error correcting codes. Consider the s.t. is a prime. For any integer , there is a prime between and , so such a set exists. Consider a Reed-Solomon code of block length , message length , distance and alphabet . The existence of such a code implies that there is a subset of of size at least such that every pair of distinct elements from this set disagree in coordinates.

This translates into a distance of in 2-norm. So, if we set and to , we get have a set of points which are apart in 2-norm, lie within the ball of radius . ∎

Lower bound.

We now prove the lower bound for function-sparse recovery.

Proof of Theorem 2.2..

An application of Theorem 2.1 with gives us a set of points and such that such that , and for all , and for all , . Let , and let .

We will show how to solve the Augmented Indexing problem on instances of size with communication cost . The theorem will then follow by Theorem 3.2.

Alice is given a string , and Bob is given together with , as in the setup for Augmented Indexing.

Alice splits her string into contiguous chunks , each containing bits. She uses as an index into the set to choose . Alice defines

Alice and Bob use the common randomness

to agree upon a random matrix

with orthonormal rows. Both Alice and Bob round to form with bits per entry. Alice computes and transmits it to Bob. Note that, since the ’s need not be discretized.

From Bob’s input , he can compute the value for which the bit occurs in . Bob’s input also contains , from which he can reconstruct , and in particular can compute

Set . Bob then computes , and using and linearity, he can compute . Then

So from Lemma 3.1, there exists some with and

Ideally, Bob would perform recovery on the vector and show that the correct point is recovered. However, since is correlated with and , Bob needs to use a slightly more complicated technique.

Bob first chooses another vector uniformly from and computes . He then runs the estimation algorithm on and , obtaining . We have that is independent of and , and that with probability . But , so as a distribution over

, the ranges of the random variables

and overlap in at least a fraction of their volumes. Therefore and have statistical distance at most . The distribution of is independent of , so running the recovery algorithm on would work with probability at least . Hence with probability at least (for large enough), satisfies the recovery criterion for , meaning

Now,

Since , this distance is strictly bounded by . Since the minimum distance in is , this means for all . So Bob can correctly identify with probability at least . From he can recover , and hence the bit that occurs in .

Hence, Bob solves Augmented Indexing with probability at least given the message . Each entry of takes bits to describe because is discretized to up to bits and . Hence, the communication cost of this protocol is . By Theorem 3.2, , or . ∎

4 Reduction from -sparse recovery

We show that the set of all -sparse vectors in is contained in the image of a 2 layer neural network. This shows that function-sparse recovery is a generalization of sparse recovery.

Lemma 4.1.

There exists a 2 layer neural network with width such that

Our construction is intuitively very simple. We define two gadgets and . and iff . Similarly and iff . Then, we set the output node . Varying the distance of from the origin will allow us to get the desired value at the output node .

Proof.

Let . Let denote the unbiased ReLU function that preserves positive values and denote the unbiased ReLU function that preserves negative values. We define as follows:

is a 2 layer neural network gadget that produces positive values at output node of . We define each of the hidden nodes of the neural network as follows:

In a similar manner, which produces negative values at output node of with the internal nodes defined as:

The last ReLU activation preserves only negative values. Since and are identical up to signs in the second hidden layer, we only analyze ’s.
Consider . Let and . Then using the identity ,

This is positive only when . Similarly, and is positive only when . So, and are both non-zero when . Using some elementary trigonometry, we may see that:

In Fact A.1, we show a proof of the above identity. Observe that when , this term is negative and hence . So, we may conclude that if and only if with . Also, observe that . Similarly is non-zero only if and only if and . Since , the intervals within which each of , are non-zero do not intersect.

So, given a vector such that with , if , set

and if , set

Observe that:

and for all

So, if , is a 2-layer neural network with width such that . ∎

Proof of Theorem 2.3..

Given a vector that is non-zero at coordinates, let be the indices at which is non-zero. We may use copies of from Lemma 4.1 to generate -sparse vectors such that . Then, we add these vectors to obtain . It is clear that we only used copies of to create . So, can be represented by a neural network with 2 layers. ∎

Theorem 1 provides a reduction which uses only 2 layers. Then, using the algorithm from Theorem 3, we can recover the correct -sparse vector using measurements. Since and , this requires only linear measurements to perform -sparse recovery.

References

  • [BT09] A. Ben-Aroya and A. Ta-Shma (2009) Constructing small-bias sets from algebraic-geometric codes. In 2009 50th Annual IEEE Symposium on Foundations of Computer Science, pp. 191–197. Cited by: §3.
  • [BJP+17] A. Bora, A. Jalal, E. Price, and A. G. Dimakis (2017) Compressed sensing using generative models. In

    Proceedings of the 34th International Conference on Machine Learning - Volume 70

    ,
    ICML’17, pp. 537–546. External Links: Link Cited by: Lower Bounds for Compressed Sensing with Generative Models, §1, §1, §1.
  • [CRT06] E. J. Candès, J. Romberg, and T. Tao (2006) Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math. 59 (8), pp. 1208–1223. Cited by: §1.
  • [DIP+10] K. Do Ba, P. Indyk, E. Price, and D. Woodruff (2010) Lower bounds for sparse recovery. SODA. Cited by: §1, §2.1, §2.1, §3, §3, §3.
  • [GPM+14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • [KW14] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, External Links: Link Cited by: §1.
  • [LS19] Z. Liu and J. Scarlett (2019) Information-theoretic lower bounds for compressive sensing with generative models. External Links: 1908.10744 Cited by: §1.
  • [MNS+98] P. B. Miltersen, N. Nisan, S. Safra, and A. Wigderson (1998) On data structures and asymmetric communication complexity. J. Comput. Syst. Sci. 57 (1), pp. 37–49. External Links: Link, Document Cited by: §3.

Appendix A Trigonometric identity

Fact A.1.
Proof.

where we use the identity that