Constant-Expansion Suffices for Compressed Sensing with Generative Priors

06/07/2020 ∙ by Constantinos Daskalakis, et al. ∙ MIT 0

Generative neural networks have been empirically found very promising in providing effective structural priors for compressed sensing, since they can be trained to span low-dimensional data manifolds in high-dimensional signal spaces. Despite the non-convexity of the resulting optimization problem, it has also been shown theoretically that, for neural networks with random Gaussian weights, a signal in the range of the network can be efficiently, approximately recovered from a few noisy measurements. However, a major bottleneck of these theoretical guarantees is a network expansivity condition: that each layer of the neural network must be larger than the previous by a logarithmic factor. Our main contribution is to break this strong expansivity assumption, showing that constant expansivity suffices to get efficient recovery algorithms, besides it also being information-theoretically necessary. To overcome the theoretical bottleneck in existing approaches we prove a novel uniform concentration theorem for random functions that might not be Lipschitz but satisfy a relaxed notion which we call "pseudo-Lipschitzness." Using this theorem we can show that a matrix concentration inequality known as the Weight Distribution Condition (WDC), which was previously only known to hold for Gaussian matrices with logarithmic aspect ratio, in fact holds for constant aspect ratios too. Since the WDC is a fundamental matrix concentration inequality in the heart of all existing theoretical guarantees on this problem, our tighter bound immediately yields improvements in all known results in the literature on compressed sensing with deep generative priors, including one-bit recovery, phase retrieval, and more.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Compressed sensing is the study of recovering a high-dimensional signal from as few measurements as possible, under some structural assumption about the signal that pins it into a low-dimensional subset of the signal space. The assumption that has driven the most research is sparsity; it is well known that a -sparse signal from can be efficiently recovered from only linear measurements [4]. Numerous variants of this problem have been studied, e.g. tolerating measurement noise, recovering signals that are only approximately sparse, recovering signals from phaseless measurements or one-bit measurements, to name just a few [1].

However, in many applications sparsity in some basis may not be the most natural structural assumption to make for the signal to be reconstructed. Given recent strides in performance of generative neural networks

[6, 5, 11, 3, 12], there is strong evidence that data from some domain , e.g. faces, can be used to identify a deep neural network , where , whose range over varying “latent codes” covers well the objects of . Thus, if we want to perform compressed sensing on signals from this domain

, the machine learning paradigm suggests that a reasonable structural assumption to make is that the signal lies in the range of

, suggesting the following problem, first proposed in [2]:

Compressed Sensing with a Deep Generative Prior (CS-DGP) Given: Deep neural network ; measurement matrix , where . Given: , for some unknown

latent vector

, noise vector Goal: Recover (or in a different variant of the problem just

It has been shown empirically that this problem (and some variants of it) can be solved efficiently [2]. It has also been shown empirically that the quality of the reconstructed signals in the low number of measurements regime might greatly outperform those reconstructed using a sparsity assumption. It has even been shown that the network need not be trained on data from the domain of interest

but that a convolutional neural network

with random weights might suffice to regularize the reconstruction well [18, 19].

Despite the non-convexity of the optimization problem, some theoretical guarantees have also emerged [8, 10], when

is a fully-connected ReLU neural network of the following form (where

is the depth):

(1)

where each is a matrix of dimension , such that and . These theoretical guarantees mirror well-known results in sparsity-based compressed sensing, where efficient recovery is possible if the measurement matrix satisfies a certain deterministic condition, e.g. the Restricted Isometry Property. But for arbitrary , recovery is in general intractable [13], so some assumption about must also be made. Specifically, it has been shown in prior work that, if the measurement matrix satisfies a certain Range Restricted Isometry Condition (RRIC) with respect to , and each weight matrix satisfies a Weight Distribution Condition (WDC), then can be efficiently recovered up to error roughly from measurements [8, 10]. See Section 3 for a definition of the WDC, and Appendix 6 for a definition of the RRIC.

But it’s critical to understand when these conditions are satisfied (for example, in the sparsity setting, the Restricted Isometry Property is satisfied by i.i.d. Gaussian matrices when ). Similarly, the RRIC has been shown to hold when is i.i.d. Gaussian and , which is an essentially optimal measurement complexity if is constant. However, until this work, the WDC has seemed more onerous. Under the assumption that each has i.i.d. Gaussian entries, the WDC was previously only known to hold when : i.e. when every layer of the neural network is larger than the previous by a logarithmic factor. This expansivity condition is a major limitation of the prior theory, since in practice neural networks do not expand at every layer.

Our work alleviates this limitation, settling a problem left open in [8, 10] and recently also posed in survey [15]. We show that the WDC holds when . This proves the following result, where our contribution is to replace with .

Theorem 1.1.

Suppose that each weight matrix has expansion , and the number of measurements is . Suppose that has i.i.d. Gaussian entries and each has i.i.d. Gaussian entries . Then there is an efficient gradient-descent based algorithm which given , , and

, outputs, with probability at least

, an estimate

satisfying when is sufficiently small.

We note that the dependence in of the expansivity, number of measurements, and error in our theorem is the same as in [10]. Moreover, the techniques developed in this paper yield several generalizations of the above theorem, stated informally below, and discussed further in Section D.

Theorem 1.2.

Suppose is a random neural network with constant expansion, and conditions analogous to those of Theorem 1.1 are satisfied. Then the following results also hold with high probability. The Gaussian noise setting admits an efficient algorithm with recovery error . Phase retrieval and one-bit recovery with generative prior have no spurious local minima. And compressed sensing with a two-layer deconvolutional prior has no spurious local minima.

To see why expansivity plays a role in the first place, we provide some context:

Global landscape analysis.

The theoretical guarantees of [8, 10] fall under an emerging method for analyzing non-convex optimization problems called global landscape analysis [17]. Given a non-convex function , the basic goal is to show that does not have spurious local minima, implying that gradient descent will (eventually) converge to a global minimum. Stronger guarantees may provide bounds on the convergence rate. In less well-behaved settings, the goal may be to prove guarantees in a restricted region, or show that the local minima inform the global minimum.

In stochastic optimization settings wherein is a random function, global landscape analysis typically consists of two steps: first, prove that is well-behaved, and second, apply concentration results to prove that, with high probability, is sufficiently close to that no pathological behavior can arise. The analysis of compressed sensing with generative priors by [8] follows this two-step outline (see Section C for a sketch). The second step requires inducting on the layers of the network. For each layer, it’s necessary to prove uniform concentration of a function which takes a weight matrix as a parameter; this concentration is precisely the content of the WDC. As a general rule, tall random matrices concentrate more strongly, which is why proving the WDC for random matrices requires an expansivity condition (a lower bound on the aspect ratio of each weight matrix).

Concentration of random functions.

The uniform concentration required for the above analysis can be abstracted as follows. Given a family of functions on some metric space , and given a distribution on , we pick a random function . We seek to show that with high probability, is uniformly near . A generic approach to solve this problem is via Lipschitz bounds. If is sufficiently Lipschitz for all , and is near with high probability for any single , then by union bounding over an -net, uniform concentration follows.

However, in the global landscape analysis conducted by [8], the functions have poor Lipschitz constants. Pushing through the -net argument therefore requires very strong pointwise concentration, and hence the expansivity condition is necessitated.

1.1 Technical contributions

Concentration of Lipschitz random functions is a widely-used tool in probability theory, which has found many applications in global landscape analysis for the purposes of understanding non-convex optimization, as outlined above. For the functions arising in our analysis, however, Lipschitzness is actually

too strong a property, and leads to suboptimal results. A main technical contribution of our paper is to define a relaxed notion of pseudo-Lipschitz functions and derive a concentration inequality for pseudo-Lipschitz random functions. This serves as a central tool in our analysis, and is a general tool that we envision will find other applications in probability and non-convex optimization.

Informally, a function family is pseudo-Lipschitz if for every there is a pseudo-ball such that has small deviations when its argument is varied by a vector within the pseudo-ball. If for every the pseudo-ball is simply a ball, then the function family is plain Lipschitz. But this definition is more flexible; the pseudo-ball can be any small-diameter, convex body with non-negligible volume and, importantly, every could have a different pseudo-ball of small deviations. We show that uniform concentration still holds; here is a simplified (and slightly specialized) statement of our result (presented in full detail in Section 4 along with a formal definition of pseudo-Lipschitzness):

Theorem 1.3 (Informal-Concentration of pseudo-Lipschitz random functions).

Let

be a random variable taking values in

. Let be a function family, where is a subset of , the unit-ball in . Suppose that:

  1. [label = (0)]

  2. For any fixed , the random variable is well-concentrated around its mean,

  3. is pseudo-Lipschitz,

  4. is Lipschitz in .

Then is well-concentrated around its mean, uniformly in . Quantitatively, the strength of the concentration in (1) only needs to be proportional to the inverse volume of the pseudo-balls of small deviation guaranteed by (2) raised to the power , i.e. the number of pseudo-balls needed to cover .

(a) Spherical -Net: for all , deviates by at most within each ball.
(b) Aspherical -Net: for a specific , deviates by at most  within each weirdly-shaped pseudo-ball.
(c) Aspherical -Net: for a different , deviates by at most within each weirdly-shaped pseudo-ball.
Figure 1: A parameter-independent spherical -net, versus a parameter-dependent aspherical -net for two specific parameters; in particular, the parameter determines the rotational angle of the weirdly-shaped pseudo-balls within which changes by at most . The shaded square is . Each ball in the leftmost panel is the intersection of weirdly-shaped pseudo-balls with the same center, under all possible rotations. As such the radius of this ball is small, so to cover with such balls we need a lot of small balls. Instead, the aspherical parameter-dependent -nets are more efficient.

This result achieves asymptotically stronger results than are possible through Lipschitz concentration. Where does the gain come from? For each parameter , consider the pseudo-ball of -deviations of , as guaranteed by the pseudo-Lipschitzness. A standard -net would be covering the metric space by balls (as exemplified in Figure 0(a)), each of which would have to lie in the intersection of the pseudo-balls of all parameters . If for each parameter the pseudo-ball is “wide” in a different direction (see Figures 0(b) and 0(c) for a schematic), then the balls of the standard -net may be very small compared to any pseudo-ball, and the size of the standard -net could be very large compared to the size of the -net obtained for any fixed using pseudo-balls. Hence, the standard Lipschitzness argument may require much stronger concentration in (1) than our result does, in order to union bound over a potentially much larger -net.

There is an obvious technical obstacle to our proof: the pseudo-balls depend on the parameter , so an efficient covering of by pseudo-balls will necessarily depend on . It’s then unclear how to union bound over the centers of the pseudo-balls (as in the standard Lipschitz concentration argument). We resolve the issue with a decoupling argument. Thus, we ultimately show that under mild conditions, the pseudo-Lipschitz random function is asymptotically as well-concentrated as a Lipschitz random function—even though its Lipschitz constant may be much worse.

Applications.

With our new technique, we are able to show that the WDC holds for Gaussian matrices whenever , for some absolute constant , where previously it was only known to hold if . As a consequence, we show Theorem 1.1: that compressed sensing with a random neural network prior does not require the logarithmic expansivity condition.

In addition, there has been follow-up research on variations of the CS-DGP problem described above. The WDC is a critical assumption which enables efficient recovery in the setting of Gaussian noise [9], as well as global landscape analysis in the settings of phaseless measurements [7], one-bit (sign) measurements [16], and two-layer convolutional neural networks [14]. Moreover, there are currently no known theoretical results in this area—compressed sensing with generative priors—that avoid the WDC: hence, until now, logarithmic expansion was necessary to achieve any provable guarantees. Our result extends the prior work on these problems, in a black-box fashion, to random neural networks with constant expansion. We refer to Appendix D for details about these extensions.

Lower bound.

As a complementary contribution, we also provide a simple lower bound on the expansion required to recover the latent vector. This lower bound is strong in several senses: it applies to one-layer neural networks, even in the absence of compression and noise, and it is an information theoretic lower bound. Without compression and noise, the problem is simply inverting a neural network, and it has been shown [13] that inversion is computationally tractable if the network consists of Gaussian matrices with expansion . In this setting our lower bound is tight: we show that expansion by a factor of is in fact necessary for exact recovery. Details are deferred to Appendix A.

1.2 Roadmap

In Section 2, we introduce basic notation. In Section 3 we formally introduce the Weight Distribution Condition, and present our main theorem about the Weight Distribution Condition for random matrices. In Section 4 we define pseudo-Lipschitz function families, allowing us to formalize and prove Theorem 1.3. In Section 5, we show how uniform concentration of pseudo-Lipschitz random functions implies that Gaussian random matrices with constant expansion satisfy the WDC. Finally, in Section 6 we prove Theorem 1.1.

2 Preliminaries

For any vector , let refer to the norm of , and for any matrix let refer to the operator norm of . If is symmetric, then is also equal to the spectral norm .

Let refer to the unit ball in centered at , and let refer to the corresponding unit sphere. For a set let and let .

For a matrix and a vector , let be the matrix That is, row of is equal to if , and is equal to otherwise.

3 Weight Distribution Condition

In the existing literature in compressed sensing, many results for recovery of a sparse signal are based on an assumption on the measurement matrix that is called the Restricted Isometry Property (RIP). Many results then follow the same paradigm: they first prove that sparse recovery is possible under the RIP, and then show that a random matrix drawn from some specific distribution or class of distributions satisfies the RIP. The same paradigm has been followed in the literature of signal recovery under the deep generative prior. In virtually all of these results, the properties that correspond to the RIP property are the combination of the Range Restricted Isometry Condition (RRIC) and the Weight Distribution Condition (WDC). Our main focus in this paper is to improve upon the existing results related to the WDC property. The WDC has the following definition due to

[8].

Definition 3.1.

A matrix is said to satisfy the (normalized) Weight Distribution Condition (WDC) with parameter if for all nonzero it holds that , where (with expectation over i.i.d. entries of ).

Remark.

Note that the factor in front of is not present in the actual condition [8], hence the term “normalized". We scale up by a factor of to simplify later notation.

In Appendix C, we provide a detailed explanation of why the WDC arises and we also give a sketch of the basic theory of global landscape analysis for compressed sensing with generative priors.

3.1 Weight Distribution Condition from constant expansion

To prove Theorem 1.1, our strategy is to prove that the WDC holds for Gaussian random matrices with constant aspect ratio:

Theorem 3.2.

There are constants with the following property. Let and let . Suppose that . If is a matrix with independent entries drawn from , then satisfies the normalized WDC with parameter , with probability at least .

Equivalently, if has entries i.i.d. , then it satisfies the unnormalized WDC with high probability. The proof of 3.2 is provided in Section 5. It uses concentration of pseudo-Lipschitz functions, which are introduced formally in Section 4. As shown in Section 6, Theorem 1.1 then follows from prior work.

4 Uniform concentration beyond Lipschitzness

In this section we present our main technical result about uniform concentration bounds. We generalize a folklore result about uniform concentration of Lipschitz functions by generalizing Lipschitzness to a weaker condition which we call pseudo-Lipschitzness. This concentration result can be used to prove Theorem 3.2. Moreover we believe that it may have broader applications.

Before stating our result, let us first define the notion of pseudo-Lipschitzness of function families and give some comparison with the classical notion of Lipschitzness. Let be a family of functions over matrices parametrized by . We have the following definitions:

Definition 4.1.

A set system is -wide if , is convex, and

Definition 4.2 (Pseudo-Lipschitzness).

Suppose that there exists a -wide set system , such that

for any and with for all . Then we say that is -pseudo-Lipschitz.

Example 4.3.

Let and let . Then the family of functions is only -Lipschitz. So to have we need .

On the other hand, it can be seen that the set system defined by is -wide for a constant (by standard arguments about spherical caps). Therefore the family of functions is -pseudo-Lipschitz.

Our main technical result is that the above relaxation of Lipschitzness suffices to obtain strong uniform concentration of measure results; see Section 4.1 for the proof.

Theorem 4.4.

Let be a random variable taking values in . Let be a function family, and let be a function. Let and . Define the spherical shell in . Suppose that:

  1. For any fixed ,

  2. is -pseudo-Lipschitz,

  3. whenever , , and for all .

Then:

As a comparison, if the family were simply -Lipschitz, then uniform concentration would hold with probability by standard arguments. So the “effective Lipschitz constant” of an -pseudo-Lipschitz family is when .

4.1 Proof of Theorem 4.4

Let be a family of sets witnessing that is -pseudo-Lipschitz—i.e. is -wide, and whenever and with for all . The standard proof technique for uniform concentration is to fix an -net, and show that with high probability over the randomness , every point in the net satisfies the bound. Here instead, we use the pseudo-balls to construct a random, aspherical net that depends on and additional randomness. We’ll show that with high probability over all the randomness, every point in the net satisfies the bound. In particular we use the following process to construct :

Let . We define iteratively. For , define by

For each , if then terminate the process. Otherwise, by some deterministic process, pick

Let’s say that the process terminates at step , producing a sequence of random variables (with randomness introduced by ). Note that is also a random variable.

For each , let be the random variable

That is, is a perturbation of by a uniform sample from the bounded pseudo-ball. Let be the set . Observe that , and is covered by the pseudo-balls .

By a volume argument, we can upper bound .

Lemma 4.5.

The size of is at most .

Proof.

For each define an auxiliary set by

We claim that the sets are disjoint. Suppose not; then there are some indices and some point such that and also . It follows from convexity and symmetry of that . So , contradicting the definition of . So the sets are indeed disjoint.

But each is a subset of . By the volume lower bound on , we have

The lemma follows. ∎

We now show that with high probability the inequality holds for all simultaneously. The main idea is that the random perturbations partially decoupled the net from . Since each point of the net is distributed uniformly over a set of non-negligible volume, the probability that any fixed fails the concentration inequality can be bounded against the probability that a uniformly random point from the shell fails.

Lemma 4.6.

We have

Proof.

For any let be the event that Fix . Let be the event that . (Recall that is a deterministic function of which is random.) Let be independent uniform random variables over the shell . Let be the event that for all , where for convenience we define for . For any , consider conditioning on . Then and are deterministic; assume that occurs. The conditional random vector has the uniform product distribution

This is precisely the distribution of . Thus,

(2)

Since

are independent and uniformly distributed over

,

Substituting into Equation (2) and integrating over all ,

If were deterministic then we would have with probability at least . They are not deterministic, but they are independent of , which suffices to imply the above inequality. So

Finally we take a union bound over . By Lemma 4.5 we have . So

as claimed. ∎

We conclude the proof of Theorem 4.4. Suppose that the event of Lemma 4.6 fails; that is, for all we have . Then let . By construction of , for every there is some such that . But , so by convexity and symmetry of , it follows that . Hence by assumption, we have . Since , we have

Thus,

as desired. By Lemma 4.6, this uniform bound holds with probability at least , over the randomness of and the net perturbations. So it also holds with at least this probability over just the randomness of .

5 Proof of Theorem 3.2

In [8], a weaker version of Theorem 3.2 was proven—it required a logarithmic aspect ratio (i.e. ). The proof was by a standard -net argument. In Section 5.1 we discuss why Theorem 3.2 cannot be proven by standard arguments, and sketch how Theorem 4.4 yields a proof.

Then, in Section 5.2 we provide the full proof of Theorem 3.2.

Throughout, we let be an random matrix with independent rows .

5.1 Outline

At a high level, the proof of Theorem 3.2 uses an -net argument, with several crucial twists. The first twist is borrowed from the prior work [8]: we wish to prove concentration for the random function

but it is not continuous in and . So for , define the continuous functions

Following [8], we can now define continuous approximations of :

Observe that is an upper approximation to , and is a lower approximation. Thus, it’s clear that for all and all nonzero we have the matrix inequality

(3)

So it suffices to upper bound and lower bound . The two arguments are essentially identical, and we will focus on the former. We seek to prove that with high probability over , for all simultaneously, At this point, [8] employs a standard -net argument. This does not suffice for our purposes, because it uses the following bounds:

  1. For fixed , the inequality holds with probability

  2. is -Lipschitz.

The second bound means that the -net needs to have granularity , so we must union bound over triples . Thus, a high probability bound requires . Moreover, both bounds (1) and (2) are asymptotically optimal. So a different approach is needed.

This is where Theorem 4.4 comes in. Let . If we center a ball of small constant radius at some point , then for any there is some point in the ball where differs by . But for each , at most points only differs by . More formally, it can be shown that is -pseudo-Lipschitz (so its “effective Lipschitz parameter" has no dependence on ). The desired concentration is then a corollary of Theorem 4.4.

5.2 Full proof

In this section we prove Theorem 3.2. Fix . Let have independent rows . Recall from Section 5 that we have the matrix inequality

(4)

So it suffices to upper bound and lower bound . The two arguments are essentially identical, and we will focus on the former. We seek to prove that with high probability over , for all simultaneously,

Moreover we want this to hold whenever for some constant . We’ll use two standard concentration inequalities:

Lemma 5.1.

Suppose that . Then

  1. with probability at least .

  2. with probability at least .

Proof.

See [20] for a reference on the first bound. The second bound is by concentration of chi-squared with degrees of freedom. ∎

Let be the set of matrices such that and . Define the random variable .

Fix . For any , define

and define . We check that and satisfy the three conditions of Theorem 4.4 with appropriate parameters.

Lemma 5.2.

For any with ,

Proof.

We first consider . It is shown in the proof of Lemma 12 in [8] that

which under our assumptions implies that . Now expand

Let . Since is bounded and is Gaussian, and is bounded by a constant, it follows that

is subexponential with some constant variance proxy

. Therefore by Bernstein’s inequality, for all ,

for some constant . Taking , we get that

Finally, since , it follows that conditioning on at most doubles the failure probability. So

as desired. ∎

Next, we show that is -pseudo-Lipschitz where and .

Definition 5.3.

For any , , and , let be the set of points such that

The pseudo-ball captures the directions in which is Lipschitz more effectively than any spherical ball, as the following lemma shows.

Lemma 5.4.

Let . Let . If and then

Proof.

We have

where the second-to-last inequality uses that is -Lipschitz, and the last inequality uses the assumptions on and . ∎

Next, we need to lower bound the volume of .

Lemma 5.5.

Fix and . Then is -wide for . Fix and . For any ,

Proof.

Fix . It’s clear from the definition that is symmetric (i.e.