I Introduction
Over the past 1–2 decades, tremendous research effort has been placed on theoretical and algorithmic studies of highdimensional linear inverse problems [1, 2]. The prevailing approach has been to model lowdimensional structure via assumptions such as sparsity or low rankness, and numerous algorithmic approaches have been shown to be successful, including convex relaxations [3, 4], greedy methods [5, 6], and more. The problem of sparse estimation via linear measurements (commonly referred to as compressive sensing) is particularly wellunderstood, with theoretical developments including sharp performance bounds for both practical algorithms [4, 7, 8, 6] and (potentially intractable) informationtheoretically optimal algorithms [9, 10, 11, 12].
Following the tremendous success of deep generative models in a variety of applications [13]
, a new perspective on compressive sensing was recently introduced, in which the sparsity assumption is replaced by the assumption of the underlying signal being wellmodeled by a generative model (typically corresponding to a deep neural network)
[14]. This approach was seen to exhibit impressive performance in experiments, with reductions in the number of measurements by large factors such as to compared to sparsitybased methods.In addition, [14] provided theoretical guarantees on their proposed algorithm, essentially showing that an Lipschitz generative model with bounded dimensional inputs leads to reliable recovery with random Gaussian measurements (see Section II for a precise statement). Moreover, for a ReLU network generative model from to with width and depth (see Appendix C for definitions), it suffices to have . Some further related works are outlined below.
In this paper, we address a prominent gap in the existing literature by establishing algorithmindependent lower bounds on the number of measurements needed. Using tools from minimax statistical analysis, we show that for generative models satisfying the assumptions of [14], the abovementioned dependencies and cannot be improved (or in the latter case, cannot be improved by more than a factor) without further assumptions. Our argument is essentially based on a reduction to compressive sensing with a group sparsity model (e.g., see [15]), i.e., forming a generative model that is capable of producing such signals.
Ia Related Work
The abovementioned work of Bora et al. [14]
performed the theoretical analysis assuming that one can find an input to the generative model minimizing an empirical loss function up to a given additive error (see Theorem
1 below). In practice, such a minimization problem may be hard, so it was proposed to use gradient descent in the latent space (i.e., on input vectors to the generative model).A variety of followup works of [14] provided additional theoretical guarantees for compressive sensing with generative models. For example, instead of using a gradient descent algorithm as in [14], the authors of [16, 17] provide recovery guarantees for a projective gradient descent algorithm under various assumptions, where now the gradient steps are taken in the ambient space (i.e., on output vectors from the generative model)
In [14], the recovered signal is assumed to lie in (or close to) the range of the generative model , which poses limitations for cases that the true signal is further from the range of . To overcome this problem, a more general model allowing sparse deviations from the range of was introduced and analyzed in [18].
In the case that the generative model is a ReLU network, under certain assumptions on the layer sizes and randomness assumptions on the weights, the authors of [19] show that the nonconvex objective function given by empirical risk minimization does not have any spurious stationary points, and accordingly establish a simple gradientbased algorithm that is guaranteed to find the global minimum.
Sample complexity upper bounds are presented in [20] for various generalizations of the original model in [14], namely, nonGaussian measurements, nonlinear observations models, and heavytailed responses. Another line of works considers compressive sensing via untrained neural network models [21, 22], but these are less relevant to the present paper.
Despite this progress, to the best of our knowledge, there is no literature providing algorithmindependent lower bounds on the number of measurements needed, and without these, it is unclear to what extent the existing results can be improved. In particular, this is explicitly posed as an open problem in [20].
IB Contributions
In this paper, we establish informationtheoretic lower bounds that certify the optimality or nearoptimality of the abovementioned upper bounds from [14]. More specifically, our main results are as follows:

In Section III, we construct a bounded Lipschitz generative model capable of generating groupsparse signals, and show that the resulting necessary number of measurements for accurate recovery is .

In Section IV, using similar ideas, we construct twolayer ReLU networks with a large width requiring measurements for accurate recovery, as well as lowerwidth ReLU networks with a large depth requiring measurements.
Note that these results are only summarized informally here; see the relevant sections for formal statements, and in particular Theorems 3, 4, and 7.
Ii Problem Setup and Overview of Upper Bounds
In this section, we formally introduce the problem, and overview one of the main results of [14] giving an upper bound on the sample complexity for Lipschitzcontinuous generative models, so as to set the stage for our algorithmindependent lower bounds.
Compressive sensing aims to reconstruct an unknown vector from a number of noisy linear measurements of the form (formally defined below). In [14], instead of making use of the common assumption that is sparse [1], the authors assume that is close to some vector in the range of a generative function . We adopt the same setup as [14], but for convenience we consider both rectangular and spherical input domains (whereas [14] focused on the latter). In more detail, the setup is as follows:

A generative model is a function , with latent dimension , ambient dimension , and input domain .

When the signal to be estimated is , the observed vector is given by
(1) where is the measurement matrix, is the noise vector, and is the number of measurements. For now is arbitrary, but should be thought of as being close to for some .

We define the ball , and the ball . We will focus primarily on the case that the input domain is of one of these two types, and we refer to the cases as spherical domains and rectangular domains.

For , we define
(2) When is the domain of , we also use the notation , which we call the range of the generative model.
One of the two main results in [14] is the following, providing general recovery guarantees for compressive sensing with generative models and Gaussian measurements.
Theorem 1.
([14, Thm. 1.2]) Fix , let be an Lipschitz function, and let be a random measurement matrix whose entries are i.i.d. with distribution . Given the observed vector , let minimize to within additive error of the optimum over . Then, for any number of measurements satisfying for a universal constant and any
, the following holds with probability
:(3) 
The sample complexity comes from the covering number of with parameter (see Appendix A for formal definitions), which is . The analysis of [14] extends directly to other compact domains ; in the following, we provide the extension to the rectangular domain , which will be particularly relevant in this paper. The covering number of with parameter is (see Appendix A), and as a result, the sample complexity becomes , and we have the following.
Theorem 2.
(Adapted from [14, Thm. 1.2]) Fix , let be an Lipschitz function, and let be a random measurement matrix whose entries are i.i.d. with distribution . Given the observed vector , let minimize to within additive error of the optimum over . Then, with a number of measurements satisfying for a universal constant and any , the following holds with probability :
(4) 
Another main result of [14] (namely, Theorem 1.1 therein) concerns the sample complexity for generative models formed by neural networks with ReLU activations. We also establish corresponding lower bounds for such results, but the formal statements are deferred to Section IV.
For the sake of comparison with our lower bounds, it will be useful to manipulate Theorems 1 and 2 into a different form. To do this, we specialize the setting as follows:

We consider the case that the constrained minimum of can be found exactly, so that .

We consider the case of i.i.d. Gaussian noise: , where is the identity matrix, and is a positive constant indicating the noise level. By standard concentration [2, Sec. 2.1], we have with probability , and when this holds, the term is upper bounded by .

We focus on the case that the goal is to bring the error bound down to the noise level, and hence, we set to match the term (to within a constant factor) with high probability.
In this specialized setting, we have the following.
Corollary 1.
Proof.
We first prove the case of a spherical domain using Theorem 1. Since with probability as mentioned above, and since , Theorem 1 (with the first and third terms in (3) removed due to our assumptions) yields the following:
(6) 
with probability at least . On the other hand, when this highprobability event fails, we can trivially upper bound by the maximum difference between any two vectors in generated by . Since the input domain is and is Lipschitz, we have , and combining the preceding findings gives
(7) 
Finally, observe that behaves as provided that with a sufficiently large implied constant. This condition is milder than the assumed behavior, and the result follows.
The analogous claim following from Theorem 2 (with ) is proved in an identical manner, with the only notable difference being that the worstcase bound is replaced by . ∎
Iii Lower Bound for Bounded LipschitzContinuous Models
In this section, we construct a Lipschitzcontinuous generative model that can generate bounded groupsparse vectors. Then, by making use of minimax statistical analysis for groupsparse recovery, we provide informationtheoretic lower bounds that match the upper bounds in Corollary 1.
Iiia Choice of Generative Model for the Rectangular Domain
We would like to construct an Lipschitz function such that recovering an arbitrary in its range with high probability and with squared error requires . Recall that we consider with and .
Our approach is to construct such a generative model that is able to generate groupsparse signals, and then follow the steps of the minimax converse for (group)sparse estimation [11, 23]. More precisely, we say that a signal in is groupsparse if, when divided into blocks of size ,^{1}^{1}1To simplify the notation, we assume that is an integer multiple of . For general values of , the same analysis goes through by letting the final entries of always equal zero. each block contains at most one nonzero entry.^{2}^{2}2More general notions of group sparsity exist, but for compactness we simply refer to this specific notion as groupsparse. See Figure 1 for an illustration. We define
(8) 
and the following constrained variants:
(9)  
(10) 
The vectors in have exactly nonzero entries all having magnitude . These vectors alone will suffice for establishing our lower bound (with a suitable choice of ), but we construct a generative model capable of producing all signals in ; this is done as follows:

The output vector is divided into blocks of length , denoted by .

A given block is only a function of the corresponding input , for .

The mapping from to is as shown in Figure 2. The interval is divided into intervals of length , and the th entry of can only be nonzero if takes a value in the th interval. Within that interval, the mapping takes a “doubletriangular” shape – the endpoints and midpoint are mapped to zero, the points and of the way into the interval are mapped to and
respectively, and the remaining points follow a linear interpolation. As a result, all values in the range
can be produced.
While this generative model is considerably simpler than those used to generate complex synthetic data (e.g., natural images), it suffices for our purposes because it satisfies the assumptions imposed in [14]. Our main goal is to show that the results of [14] cannot be improved without further assumptions.
The simplicity of the preceding generative model permits a direct calculation of the Lipschitz constant, stated as follows.
Lemma 1.
The generative model described above, with parameters , , , and , has a Lipschitz constant given by
(11) 
Proof.
Recall that is the length block corresponding to , and for concreteness consider . For two distinct , it is easy to see that the ratio is maximized when and are in the same small interval with length . This implies that the Lipschitz constant for the subblock is the absolute value of the slope of a line segment in that interval, denoted by . Then, combining the subblocks, we have
(12)  
(13)  
(14) 
so the overall Lipschitz constant is also . ∎
IiiB Minimax Lower Bound for GroupSparse Recovery
Consider the problem of estimating a groupsparse signal (see (10)) from linear measurements , where (we will later substitute ). Specifically, given knowledge of and , an estimate is formed. We are interested in establishing a lower bound on the minimax risk , where denotes expectation when the underlying vector is .
The following lemma states a minimax lower bound for groupsparse recovery under a suitable choice of . This result can be proved using similar steps to the case of sparse recovery (without group structure) [11, 23], with suitable modifications.
Lemma 2.
Consider the problem of groupsparse recovery with parameters , , , and , with a given measurement matrix . If for an absolute constant , and if , then we have
(15) 
In particular, to achieve for a positive constant , we require
(16) 
Proof.
See Appendix B. ∎
Of course, (15) trivially remains true when the supremum is taken over any set containing , in particular including for any .
IiiC Statement of Main Result
Combining the preceding auxiliary results, we deduce the following informationtheoretic lower bound for compressive sensing with generative models.
Theorem 3.
Consider the problem of compressive sensing with Lipschitz generative models with input domain , and i.i.d. noise. Let and be fixed constants, and assume that with a sufficiently large implied constant. Then there exists an Lipschitz generative model (and associated output dimension ) such that, for any satisfying , if we have
(17) 
then we must also have .
Proof.
We are free to choose the output dimension to our liking for the purpose of proving the theorem, and accordingly, we set
(18) 
for some constant to be chosen later. As a result, we have
(19) 
since we assumed that with a sufficiently large implied constant. Hence, it suffices to show that is necessary for achieving (17).
To do this, we make use of Lemma 2 on groupsparse recovery, and the fact that our choice of generative model is able to produce such signals. Since we assumed that , the contrapositive form of Lemma 2 states that under the assumptions therein, it is not possible to achieve (17) when
(20) 
While this has the desired behavior, the result only holds true under the conditions and from Lemma 2 (after setting and ). We proceed by checking that the assumptions of Theorem 3 imply that both of these conditions are true.
The condition follows directly from (18) and the assumption that with a sufficiently large implied constant. For the condition on , we equate the condition from (18) with the finding from Lemma 1; canceling the terms and rearranging gives
(21) 
As a result, we have the required condition as long as
(22) 
Hence, we have shown that it is impossible to achieve (17) in the case that both (22) and (20) hold. To make these two conditions consistent, we set , meaning (22) reduces to .
The preceding findings show, in particular, that if is the largest integer satisfying (20) (henceforth denoted by ), then it is impossible to achieve (17). To show that the same is true for all smaller values of , we use the simple fact that additional measurements can only ever help. More formally, suppose that is an measurement matrix achieving (17) for some . Consider adding rows of zeros to to produce , so that . If one ignores the final entries of , then the problem of recovery from measurements is reduced to that from
measurements. In fact, in the latter case, the noise variance is also reduced to
, but to precisely recover the desired setting corresponding to measurements, the recovery algorithm can artificially add noise to each entry. ∎Theorem 3 not only shows that the scaling laws of Corollary 1 cannot be improved under i.i.d. measurements (in which case is close to with high probability), but also that no further improvements (beyond constant factors) are possible even for general measurement matrices having a similar Frobenius norm. The result holds under the assumption that with a sufficiently large implied constant, which is a very mild assumption since for fixed and , the righthand side tends to zero as grows large (whereas typical Lipschitz constants are at least equal to one, if not much higher).^{3}^{3}3In fact, if we were to have , then the scaling of Corollary 1 would seemingly not make sense. The explanation is that in this regime, outputting any suffices for the recovery guarantee, and no measurements are needed at all.
IiiD Extension to the Spherical Domain
The above analysis focuses on the rectangular domain . At first glance, it may appear nontrivial to use the same ideas to obtain corresponding lower bounds for the spherical domain . However, in the following we show that by simply considering the largest possible ball inside the ball, we can obtain a matching lower bound to Corollary 1 even for spherical domains. The fact that this crude approach gives a tight result is somewhat surprising, and is discussed further below.
Let denote the aboveformed generative model for rectangular domains with radius , and note that . To handle the spherical domain , we construct the generative model as follows:

For any , we simply let . It is only these input values that will be used to establish the lower bound, as these values alone suffice for generating all of . However, we still need to set the other values to ensure that Lipschitz continuity is maintained.

To handle the other values of , we extend the functions in Figure 2 (with in place of ) to take values on the whole real line: For all values outside the indicated interval, each function value simply remains zero.

The preceding dot point leads to a Lipschitzcontinuous function defined on all of , and we simply take to be that function restricted to .
By the first dot point above, we can directly apply Theorem 3 with in place of , yielding the following.
Theorem 4.
Consider the problem of compressive sensing with Lipschitz generative models, with input domain and i.i.d. noise. Let and be fixed constants, and assume that with a sufficiently large implied constant. Then there exists an Lipschitz generative model (and associated output dimension ) such that, for any satisfying , if we have
(23) 
then we must also have .
This result establishes the tightness of Corollary 1 up to constant factors for spherical domains. The assumption is different from that of Theorem 3, but is similarly mild (see the discussion in Footnote 3 on Page 3).
The above reduction may appear to be overly crude, because as grows large the volume of is a vanishingly small fraction of the volume of . However, as discussed following Theorem 1, the key geometric quantity in the proof of the upper bound is in fact the covering number (see also Appendix A), and both and yield the same scaling laws for the logarithm of the covering number (with a sufficiently small distance parameter). As a result, it is reasonable to expect that these two domains also require the same scaling laws on the number of measurements.
Iv Generative Models Based on ReLU Networks
In this section, as opposed to considering general Lipschitzcontinuous generative models, we provide a more detailed treatment of neural networks with ReLU activations (see Appendix C for brief definitions). We are particularly interested in comparing against the following result from [14]; this result holds even when the domain is unbounded (), so we do not need to distinguish between the rectangular and spherical domains.
Theorem 5.
([14, Thm. 1.1]) Let be a generative model from a layer neural network with ReLU activations^{4}^{4}4As discussed in [14], the same result holds for any piecewise linear activation with two components (e.g., leaky ReLU). and at most nodes per layer, and let be a random measurement matrix whose entries are i.i.d. with distribution . Given the observed vector , let minimize to within additive error of the optimum over . Then, with a number of measurements satisfying for a universal constant and any , the following holds with probability :
(24) 
It is interesting to note that this result makes no assumptions about the neural network weights (nor domain size), but rather, only the input size, width, and depth. In addition, we have the following counterpart to Corollary 1, with a slight modification to only state the existence of a good matrix rather than concerning Gaussian random matrices.
Corollary 2.
Consider the setup of Theorem 5 with for some , no optimization error (), and i.i.d. Gaussian noise with , but with a deterministic measurement matrix in place of the random Gaussian measurement matrix. Then, when for a universal constant , there exists some such with such that
(25) 
for a universal constant .
Proof.
We need to modify the proof of Corollary 1, since in principle we may no longer have a bound on the error when the highprobability event in Theorem 4 fails. Fortunately, an inspection of the proof of Theorem 5 in [14] reveals that the highprobability event only amounts to establishing properties of , most notably including the socalled
restricted eigenvalue condition
. Since the conclusion of Theorem 5 holds on average when has i.i.d. entries, it also holds for the best possible choice of . Since standard concentration [2, Sec. 2.1] yields with probability for Gaussian measurements, we may also assume that the “best possible” here satisfies such a condition.Before establishing corresponding lower bounds to this result, it is useful to first discuss how the generative model from Figure 2 can be constructed using ReLU networks; this is done in Section IVA. In Section IVB, we build on these ideas to form different (but related) generative models that properly reveal the dependence of on the width and depth.
Iva Constructing the Generative Model Used in Theorem 3
In the case of a rectangular domain, the triangular shapes of the mappings in Figure 2 are such that the generative model can directly be implemented as a ReLU network with a single hidden layer. Indeed, this would remain true if the mappings between and (with being a single entry of ) in Figure 2 were replaced by any piecewise linear function [28].
A limitation of this interpretation as a onelayer ReLU network is that for increasing values of , the corresponding network has increasingly large weights. In particular, for fixed values of and , a rearrangement of (18) gives , which amounts to large weights in the case that .
In the following, we argue that the construction of Figure 2 can be implemented using a deep ReLU network with bounded weights. To see this, we use similar ideas to those used to generate rapidlyvarying (e.g., “sawtooth”) functions using ReLU networks [28].
Consider the functions and shown in Figure 3. If we compose with itself times, then we obtain a function equaling for , equaling for , and linearly interpolating in between. By further composing this function with , we obtain a function of the form shown in Figure 3 (Right), which matches those in Figure 2. By incorporating suitable offsets into this procedure, one can obtain the same “double triangular” shape shifted along the horizontal axis, and hence recover all of the mappings shown in Figure 2.
Since the steepest slope among and has a gradient of , both of these functions can be implemented with a single hidden layer with bounded weights and bounded offsets. To bring the width “double triangular” region down to the width in Figure 2, we need compositions of (each of which adds another layer to the network).^{5}^{5}5The case that is not a power of two can be handled by slightly modifying the function in Figure 3, i.e., moving the changepoints currently occurring at and . Finally, the number of onedimensional mappings of the form shown in Figure 1 is , and we let the network incorporate these in parallel. Combining these findings, we have the following.
Theorem 6.
Consider the setup of Theorem 3, and suppose that . Then, the generative model therein can be realized using a ReLU network with depth , width , weights bounded by , and offsets bounded by .
Note that the assumption is very mild in view of (21), and even if one wishes to handle more general values, it is not difficult to generalize the above arguments accordingly.
IvB Understanding the Dependence on Width and Depth
Thus far, we have considered forming a generative model capable of producing groupsparse signals, which leads to a lower bound of . While this precise approach does not appear to be suited to properly understanding the dependence on width and depth in Theorem 5, we now show that a simple variant indeed suffices: We form a wide and/or deep ReLU network capable of producing all sparse signals for some that may be much larger than one.
It is instructive to first consider the case and , and to construct a noncontinuous generative model that will later be slightly modified to be continuous. For later convenience, we momentarily denote the output length by . We consider the interval , which we view as being split into small intervals of equal length; note that is the number of possible signed sparsity patterns for groupsparse signals of length with exactly nonzero entries. The idea is to let each value of corresponding to the midpoint of a given length interval in produce a signal with a different sparsity pattern.
In more detail, we consider the following (see Figure 4 for an illustration):

(Coarsest scale) The interval is split into intervals of length . Then:

If lies in the first interval, we have , and if lies in the second interval, we have (in all other cases, );

If lies in the third interval, we have , and if lies in the forth interval, we have (in all other cases, );

This continues similarly for .


(Second coarsest scale) Each interval at the coarsest scale is split into equal subintervals of length . Then, within each of the coarsest intervals, we have the following:

If lies in the first subinterval, we have , and if lies in the second subinterval, we have (in all other cases, );

If lies in the third subinterval, we have , and if lies in the forth subinterval, we have (in all other cases, );

This continues similarly for .


We continue recursively until we are at the finest scale with subintervals of length that dictate the values of .
While the discontinuous points in Figure 4 are problematic when it comes to implementation with a ReLU network, we can overcome this by simply replacing them by straightline transitions have a finite slope (i.e., the rectangular shapes become trapezoidal), while being sufficiently sharp so that all the input values at the midpoints of the length intervals produce the same outputs as the idealized function described above. Then, ReLUbased implementation is mathematically possible, since the mappings are piecewise linear [28].
The above construction generates all groupsparse signals in with nonzero entries equaling . To see this, one can consider “entering” the appropriate coarsest region according to the desired location and sign in the first block (of length ) of the sparse signal, then recursively entering the appropriate secondcoarsest region based on the second block, and so on.
To generalize the above ideas to input generative models, we form such functions in parallel, thereby allowing the generation of groupsparse signals in (with ) having nonzero entries . Then, we can use Lemma 2 and a suitable choice of to deduce the following.
Theorem 7.
Fix , and consider the problem of compressive sensing with generative models under i.i.d. noise, a measurement matrix satisfying , and the abovedescribed generative model with parameters , , , and . Then, if for an absolute constant , then there exists a constant such that the choice yields the following:

Any algorithm attaining must also have (or equivalently , since ).

The generative function can be implemented as a ReLU network with a single hidden layer (i.e., ) of width at most .

Alternatively, if is an integer power of two,^{6}^{6}6This is a mild assumption given that we already assumed an unspecified constant ; for general , one can consider only using the first entries of each length vector for the largest possible integer value of , and letting the remaining entries always be zero. This means that at least half of the entries are used, and the same follows for the entries of the combined output. the generative function can be implemented as a ReLU network with depth and width .
In the settings described in the second and third dot points, the sample complexity from Corollary 2 behaves as and respectively.
Proof.
The first claim is proved similarly to the proof of Theorem 3, so we only outline the differences. In accordance with Lemma 2, let be the largest integer smaller than . Then, Lemma 2 states that if and , it is not possible to achieve . Substituting this definition of into this choice of gives the claimed behavior with . The first claim follows by using the argument at the end of the proof of Theorem 3 to argue that since the recovery goal cannot be attained when , it also cannot be attained when .
For the second claim, we observe that each mapping from to in Figure 4 has a bounded number of rectangular “pieces”, and at the th scale, the number of pieces is . Summing over gives a total of at most pieces. Recall also that these rectangles are replaced by trapezoidal shapes to make them implementable. Hence, we can apply the wellknown fact that any piecewiselinear function with pieces can be implemented using a ReLU network of width with a single hidden layer [28], and in our case we have . The desired claim follows by multiplying by in accordance with the fact that we implement the network of Figure 4 in parallel times.
Due to the periodic nature of the signals in Figure 4, the third claim also follows using wellestablished ideas [28]. We would like to produce trapezoidal pulses at regular intervals similarly to Figure 4. To obtain the positive pulses, we can take a halftrapezoidal shape of the form in Figure 5 (Right) and pass it through a sawtooth function having some number of triangular regions as in Figure 5 (Middle), possibly using suitable offsets to shift the location. The negative pulses can be produced similarly, and the two can be added together in the final layer.
As exemplified in Figure 5 and proved in [28], the piece sawtooth function itself can be implemented by a network with width and depth when is a power of two. In our case, the maximal number of such repetitions is (at the finest scale), and since this is a power of two by assumption, the depth required is
Comments
There are no comments yet.