Over the past 1–2 decades, tremendous research effort has been placed on theoretical and algorithmic studies of high-dimensional linear inverse problems [1, 2]. The prevailing approach has been to model low-dimensional structure via assumptions such as sparsity or low rankness, and numerous algorithmic approaches have been shown to be successful, including convex relaxations [3, 4], greedy methods [5, 6], and more. The problem of sparse estimation via linear measurements (commonly referred to as compressive sensing) is particularly well-understood, with theoretical developments including sharp performance bounds for both practical algorithms [4, 7, 8, 6] and (potentially intractable) information-theoretically optimal algorithms [9, 10, 11, 12].
Following the tremendous success of deep generative models in a variety of applications 
, a new perspective on compressive sensing was recently introduced, in which the sparsity assumption is replaced by the assumption of the underlying signal being well-modeled by a generative model (typically corresponding to a deep neural network). This approach was seen to exhibit impressive performance in experiments, with reductions in the number of measurements by large factors such as to compared to sparsity-based methods.
In addition,  provided theoretical guarantees on their proposed algorithm, essentially showing that an -Lipschitz generative model with bounded -dimensional inputs leads to reliable recovery with random Gaussian measurements (see Section II for a precise statement). Moreover, for a ReLU network generative model from to with width and depth (see Appendix -C for definitions), it suffices to have . Some further related works are outlined below.
In this paper, we address a prominent gap in the existing literature by establishing algorithm-independent lower bounds on the number of measurements needed. Using tools from minimax statistical analysis, we show that for generative models satisfying the assumptions of , the above-mentioned dependencies and cannot be improved (or in the latter case, cannot be improved by more than a factor) without further assumptions. Our argument is essentially based on a reduction to compressive sensing with a group sparsity model (e.g., see ), i.e., forming a generative model that is capable of producing such signals.
I-a Related Work
The above-mentioned work of Bora et al. 
performed the theoretical analysis assuming that one can find an input to the generative model minimizing an empirical loss function up to a given additive error (see Theorem1 below). In practice, such a minimization problem may be hard, so it was proposed to use gradient descent in the latent space (i.e., on input vectors to the generative model).
A variety of follow-up works of  provided additional theoretical guarantees for compressive sensing with generative models. For example, instead of using a gradient descent algorithm as in , the authors of [16, 17] provide recovery guarantees for a projective gradient descent algorithm under various assumptions, where now the gradient steps are taken in the ambient space (i.e., on output vectors from the generative model)
In , the recovered signal is assumed to lie in (or close to) the range of the generative model , which poses limitations for cases that the true signal is further from the range of . To overcome this problem, a more general model allowing sparse deviations from the range of was introduced and analyzed in .
In the case that the generative model is a ReLU network, under certain assumptions on the layer sizes and randomness assumptions on the weights, the authors of  show that the non-convex objective function given by empirical risk minimization does not have any spurious stationary points, and accordingly establish a simple gradient-based algorithm that is guaranteed to find the global minimum.
Sample complexity upper bounds are presented in  for various generalizations of the original model in , namely, non-Gaussian measurements, non-linear observations models, and heavy-tailed responses. Another line of works considers compressive sensing via untrained neural network models [21, 22], but these are less relevant to the present paper.
Despite this progress, to the best of our knowledge, there is no literature providing algorithm-independent lower bounds on the number of measurements needed, and without these, it is unclear to what extent the existing results can be improved. In particular, this is explicitly posed as an open problem in .
In this paper, we establish information-theoretic lower bounds that certify the optimality or near-optimality of the above-mentioned upper bounds from . More specifically, our main results are as follows:
In Section III, we construct a bounded -Lipschitz generative model capable of generating group-sparse signals, and show that the resulting necessary number of measurements for accurate recovery is .
In Section IV, using similar ideas, we construct two-layer ReLU networks with a large width requiring measurements for accurate recovery, as well as lower-width ReLU networks with a large depth requiring measurements.
Ii Problem Setup and Overview of Upper Bounds
In this section, we formally introduce the problem, and overview one of the main results of  giving an upper bound on the sample complexity for Lipschitz-continuous generative models, so as to set the stage for our algorithm-independent lower bounds.
Compressive sensing aims to reconstruct an unknown vector from a number of noisy linear measurements of the form (formally defined below). In , instead of making use of the common assumption that is -sparse , the authors assume that is close to some vector in the range of a generative function . We adopt the same setup as , but for convenience we consider both rectangular and spherical input domains (whereas  focused on the latter). In more detail, the setup is as follows:
A generative model is a function , with latent dimension , ambient dimension , and input domain .
When the signal to be estimated is , the observed vector is given by
where is the measurement matrix, is the noise vector, and is the number of measurements. For now is arbitrary, but should be thought of as being close to for some .
We define the -ball , and the -ball . We will focus primarily on the case that the input domain is of one of these two types, and we refer to the cases as spherical domains and rectangular domains.
For , we define
When is the domain of , we also use the notation , which we call the range of the generative model.
One of the two main results in  is the following, providing general recovery guarantees for compressive sensing with generative models and Gaussian measurements.
([14, Thm. 1.2])
Fix , let be an -Lipschitz function, and let be a random measurement matrix whose entries are i.i.d. with distribution . Given the observed vector , let minimize to within additive error of the optimum over . Then, for any number of measurements satisfying for a universal constant and any , the following holds with probability
, the following holds with probability:
The sample complexity comes from the covering number of with parameter (see Appendix -A for formal definitions), which is . The analysis of  extends directly to other compact domains ; in the following, we provide the extension to the rectangular domain , which will be particularly relevant in this paper. The covering number of with parameter is (see Appendix -A), and as a result, the sample complexity becomes , and we have the following.
(Adapted from [14, Thm. 1.2]) Fix , let be an -Lipschitz function, and let be a random measurement matrix whose entries are i.i.d. with distribution . Given the observed vector , let minimize to within additive error of the optimum over . Then, with a number of measurements satisfying for a universal constant and any , the following holds with probability :
Another main result of  (namely, Theorem 1.1 therein) concerns the sample complexity for generative models formed by neural networks with ReLU activations. We also establish corresponding lower bounds for such results, but the formal statements are deferred to Section IV.
We consider the case that the constrained minimum of can be found exactly, so that .
We focus on the case that the goal is to bring the error bound down to the noise level, and hence, we set to match the term (to within a constant factor) with high probability.
In this specialized setting, we have the following.
We first prove the case of a spherical domain using Theorem 1. Since with probability as mentioned above, and since , Theorem 1 (with the first and third terms in (3) removed due to our assumptions) yields the following:
with probability at least . On the other hand, when this high-probability event fails, we can trivially upper bound by the maximum difference between any two vectors in generated by . Since the input domain is and is -Lipschitz, we have , and combining the preceding findings gives
Finally, observe that behaves as provided that with a sufficiently large implied constant. This condition is milder than the assumed behavior, and the result follows.
The analogous claim following from Theorem 2 (with ) is proved in an identical manner, with the only notable difference being that the worst-case bound is replaced by . ∎
Iii Lower Bound for Bounded Lipschitz-Continuous Models
In this section, we construct a Lipschitz-continuous generative model that can generate bounded -group-sparse vectors. Then, by making use of minimax statistical analysis for group-sparse recovery, we provide information-theoretic lower bounds that match the upper bounds in Corollary 1.
Iii-a Choice of Generative Model for the Rectangular Domain
We would like to construct an -Lipschitz function such that recovering an arbitrary in its range with high probability and with squared error requires . Recall that we consider with and .
Our approach is to construct such a generative model that is able to generate group-sparse signals, and then follow the steps of the minimax converse for (group-)sparse estimation [11, 23]. More precisely, we say that a signal in is -group-sparse if, when divided into blocks of size ,111To simplify the notation, we assume that is an integer multiple of . For general values of , the same analysis goes through by letting the final entries of always equal zero. each block contains at most one non-zero entry.222More general notions of group sparsity exist, but for compactness we simply refer to this specific notion as -group-sparse. See Figure 1 for an illustration. We define
and the following constrained variants:
The vectors in have exactly non-zero entries all having magnitude . These vectors alone will suffice for establishing our lower bound (with a suitable choice of ), but we construct a generative model capable of producing all signals in ; this is done as follows:
The output vector is divided into blocks of length , denoted by .
A given block is only a function of the corresponding input , for .
The mapping from to is as shown in Figure 2. The interval is divided into intervals of length , and the -th entry of can only be non-zero if takes a value in the -th interval. Within that interval, the mapping takes a “double-triangular” shape – the endpoints and midpoint are mapped to zero, the points and of the way into the interval are mapped to and
respectively, and the remaining points follow a linear interpolation. As a result, all values in the rangecan be produced.
While this generative model is considerably simpler than those used to generate complex synthetic data (e.g., natural images), it suffices for our purposes because it satisfies the assumptions imposed in . Our main goal is to show that the results of  cannot be improved without further assumptions.
The simplicity of the preceding generative model permits a direct calculation of the Lipschitz constant, stated as follows.
The generative model described above, with parameters , , , and , has a Lipschitz constant given by
Recall that is the length- block corresponding to , and for concreteness consider . For two distinct , it is easy to see that the ratio is maximized when and are in the same small interval with length . This implies that the Lipschitz constant for the sub-block is the absolute value of the slope of a line segment in that interval, denoted by . Then, combining the sub-blocks, we have
so the overall Lipschitz constant is also . ∎
Iii-B Minimax Lower Bound for Group-Sparse Recovery
Consider the problem of estimating a -group-sparse signal (see (10)) from linear measurements , where (we will later substitute ). Specifically, given knowledge of and , an estimate is formed. We are interested in establishing a lower bound on the minimax risk , where denotes expectation when the underlying vector is .
The following lemma states a minimax lower bound for -group-sparse recovery under a suitable choice of . This result can be proved using similar steps to the case of -sparse recovery (without group structure) [11, 23], with suitable modifications.
Consider the problem of -group-sparse recovery with parameters , , , and , with a given measurement matrix . If for an absolute constant , and if , then we have
In particular, to achieve for a positive constant , we require
See Appendix -B. ∎
Of course, (15) trivially remains true when the supremum is taken over any set containing , in particular including for any .
Iii-C Statement of Main Result
Combining the preceding auxiliary results, we deduce the following information-theoretic lower bound for compressive sensing with generative models.
Consider the problem of compressive sensing with -Lipschitz generative models with input domain , and i.i.d. noise. Let and be fixed constants, and assume that with a sufficiently large implied constant. Then there exists an -Lipschitz generative model (and associated output dimension ) such that, for any satisfying , if we have
then we must also have .
We are free to choose the output dimension to our liking for the purpose of proving the theorem, and accordingly, we set
for some constant to be chosen later. As a result, we have
since we assumed that with a sufficiently large implied constant. Hence, it suffices to show that is necessary for achieving (17).
To do this, we make use of Lemma 2 on -group-sparse recovery, and the fact that our choice of generative model is able to produce such signals. Since we assumed that , the contrapositive form of Lemma 2 states that under the assumptions therein, it is not possible to achieve (17) when
While this has the desired behavior, the result only holds true under the conditions and from Lemma 2 (after setting and ). We proceed by checking that the assumptions of Theorem 3 imply that both of these conditions are true.
The condition follows directly from (18) and the assumption that with a sufficiently large implied constant. For the condition on , we equate the condition from (18) with the finding from Lemma 1; canceling the terms and re-arranging gives
As a result, we have the required condition as long as
The preceding findings show, in particular, that if is the largest integer satisfying (20) (henceforth denoted by ), then it is impossible to achieve (17). To show that the same is true for all smaller values of , we use the simple fact that additional measurements can only ever help. More formally, suppose that is an measurement matrix achieving (17) for some . Consider adding rows of zeros to to produce , so that . If one ignores the final entries of , then the problem of recovery from measurements is reduced to that from
measurements. In fact, in the latter case, the noise variance is also reduced to, but to precisely recover the desired setting corresponding to measurements, the recovery algorithm can artificially add noise to each entry. ∎
Theorem 3 not only shows that the scaling laws of Corollary 1 cannot be improved under i.i.d. measurements (in which case is close to with high probability), but also that no further improvements (beyond constant factors) are possible even for general measurement matrices having a similar Frobenius norm. The result holds under the assumption that with a sufficiently large implied constant, which is a very mild assumption since for fixed and , the right-hand side tends to zero as grows large (whereas typical Lipschitz constants are at least equal to one, if not much higher).333In fact, if we were to have , then the scaling of Corollary 1 would seemingly not make sense. The explanation is that in this regime, outputting any suffices for the recovery guarantee, and no measurements are needed at all.
Iii-D Extension to the Spherical Domain
The above analysis focuses on the rectangular domain . At first glance, it may appear non-trivial to use the same ideas to obtain corresponding lower bounds for the spherical domain . However, in the following we show that by simply considering the largest possible -ball inside the -ball, we can obtain a matching lower bound to Corollary 1 even for spherical domains. The fact that this crude approach gives a tight result is somewhat surprising, and is discussed further below.
Let denote the above-formed generative model for rectangular domains with radius , and note that . To handle the spherical domain , we construct the generative model as follows:
For any , we simply let . It is only these input values that will be used to establish the lower bound, as these values alone suffice for generating all of . However, we still need to set the other values to ensure that Lipschitz continuity is maintained.
To handle the other values of , we extend the functions in Figure 2 (with in place of ) to take values on the whole real line: For all values outside the indicated interval, each function value simply remains zero.
The preceding dot point leads to a Lipschitz-continuous function defined on all of , and we simply take to be that function restricted to .
By the first dot point above, we can directly apply Theorem 3 with in place of , yielding the following.
Consider the problem of compressive sensing with -Lipschitz generative models, with input domain and i.i.d. noise. Let and be fixed constants, and assume that with a sufficiently large implied constant. Then there exists an -Lipschitz generative model (and associated output dimension ) such that, for any satisfying , if we have
then we must also have .
This result establishes the tightness of Corollary 1 up to constant factors for spherical domains. The assumption is different from that of Theorem 3, but is similarly mild (see the discussion in Footnote 3 on Page 3).
The above reduction may appear to be overly crude, because as grows large the volume of is a vanishingly small fraction of the volume of . However, as discussed following Theorem 1, the key geometric quantity in the proof of the upper bound is in fact the covering number (see also Appendix -A), and both and yield the same scaling laws for the logarithm of the covering number (with a sufficiently small distance parameter). As a result, it is reasonable to expect that these two domains also require the same scaling laws on the number of measurements.
Iv Generative Models Based on ReLU Networks
In this section, as opposed to considering general Lipschitz-continuous generative models, we provide a more detailed treatment of neural networks with ReLU activations (see Appendix -C for brief definitions). We are particularly interested in comparing against the following result from ; this result holds even when the domain is unbounded (), so we do not need to distinguish between the rectangular and spherical domains.
([14, Thm. 1.1]) Let be a generative model from a -layer neural network with ReLU activations444As discussed in , the same result holds for any piecewise linear activation with two components (e.g., leaky ReLU). and at most nodes per layer, and let be a random measurement matrix whose entries are i.i.d. with distribution . Given the observed vector , let minimize to within additive error of the optimum over . Then, with a number of measurements satisfying for a universal constant and any , the following holds with probability :
It is interesting to note that this result makes no assumptions about the neural network weights (nor domain size), but rather, only the input size, width, and depth. In addition, we have the following counterpart to Corollary 1, with a slight modification to only state the existence of a good matrix rather than concerning Gaussian random matrices.
Consider the setup of Theorem 5 with for some , no optimization error (), and i.i.d. Gaussian noise with , but with a deterministic measurement matrix in place of the random Gaussian measurement matrix. Then, when for a universal constant , there exists some such with such that
for a universal constant .
We need to modify the proof of Corollary 1, since in principle we may no longer have a bound on the error when the high-probability event in Theorem 4 fails. Fortunately, an inspection of the proof of Theorem 5 in  reveals that the high-probability event only amounts to establishing properties of , most notably including the so-called restricted eigenvalue condition
restricted eigenvalue condition. Since the conclusion of Theorem 5 holds on average when has i.i.d. entries, it also holds for the best possible choice of . Since standard concentration [2, Sec. 2.1] yields with probability for Gaussian measurements, we may also assume that the “best possible” here satisfies such a condition.
Before establishing corresponding lower bounds to this result, it is useful to first discuss how the generative model from Figure 2 can be constructed using ReLU networks; this is done in Section IV-A. In Section IV-B, we build on these ideas to form different (but related) generative models that properly reveal the dependence of on the width and depth.
Iv-a Constructing the Generative Model Used in Theorem 3
In the case of a rectangular domain, the triangular shapes of the mappings in Figure 2 are such that the generative model can directly be implemented as a ReLU network with a single hidden layer. Indeed, this would remain true if the mappings between and (with being a single entry of ) in Figure 2 were replaced by any piecewise linear function .
A limitation of this interpretation as a one-layer ReLU network is that for increasing values of , the corresponding network has increasingly large weights. In particular, for fixed values of and , a re-arrangement of (18) gives , which amounts to large weights in the case that .
In the following, we argue that the construction of Figure 2 can be implemented using a deep ReLU network with bounded weights. To see this, we use similar ideas to those used to generate rapidly-varying (e.g., “sawtooth”) functions using ReLU networks .
Consider the functions and shown in Figure 3. If we compose with itself times, then we obtain a function equaling for , equaling for , and linearly interpolating in between. By further composing this function with , we obtain a function of the form shown in Figure 3 (Right), which matches those in Figure 2. By incorporating suitable offsets into this procedure, one can obtain the same “double triangular” shape shifted along the horizontal axis, and hence recover all of the mappings shown in Figure 2.
Since the steepest slope among and has a gradient of , both of these functions can be implemented with a single hidden layer with -bounded weights and -bounded offsets. To bring the -width “double triangular” region down to the width in Figure 2, we need compositions of (each of which adds another layer to the network).555The case that is not a power of two can be handled by slightly modifying the function in Figure 3, i.e., moving the changepoints currently occurring at and . Finally, the number of one-dimensional mappings of the form shown in Figure 1 is , and we let the network incorporate these in parallel. Combining these findings, we have the following.
Consider the setup of Theorem 3, and suppose that . Then, the generative model therein can be realized using a ReLU network with depth , width , weights bounded by , and offsets bounded by .
Note that the assumption is very mild in view of (21), and even if one wishes to handle more general values, it is not difficult to generalize the above arguments accordingly.
Iv-B Understanding the Dependence on Width and Depth
Thus far, we have considered forming a generative model capable of producing -group-sparse signals, which leads to a lower bound of . While this precise approach does not appear to be suited to properly understanding the dependence on width and depth in Theorem 5, we now show that a simple variant indeed suffices: We form a wide and/or deep ReLU network capable of producing all -sparse signals for some that may be much larger than one.
It is instructive to first consider the case and , and to construct a non-continuous generative model that will later be slightly modified to be continuous. For later convenience, we momentarily denote the output length by . We consider the interval , which we view as being split into small intervals of equal length; note that is the number of possible signed sparsity patterns for group-sparse signals of length with exactly non-zero entries. The idea is to let each value of corresponding to the mid-point of a given length- interval in produce a signal with a different sparsity pattern.
In more detail, we consider the following (see Figure 4 for an illustration):
(Coarsest scale) The interval is split into intervals of length . Then:
If lies in the first interval, we have , and if lies in the second interval, we have (in all other cases, );
If lies in the third interval, we have , and if lies in the forth interval, we have (in all other cases, );
This continues similarly for .
(Second coarsest scale) Each interval at the coarsest scale is split into equal sub-intervals of length . Then, within each of the coarsest intervals, we have the following:
If lies in the first sub-interval, we have , and if lies in the second sub-interval, we have (in all other cases, );
If lies in the third sub-interval, we have , and if lies in the forth sub-interval, we have (in all other cases, );
This continues similarly for .
We continue recursively until we are at the finest scale with sub-intervals of length that dictate the values of .
While the discontinuous points in Figure 4 are problematic when it comes to implementation with a ReLU network, we can overcome this by simply replacing them by straight-line transitions have a finite slope (i.e., the rectangular shapes become trapezoidal), while being sufficiently sharp so that all the input values at the midpoints of the length- intervals produce the same outputs as the idealized function described above. Then, ReLU-based implementation is mathematically possible, since the mappings are piecewise linear .
The above construction generates all -group-sparse signals in with non-zero entries equaling . To see this, one can consider “entering” the appropriate coarsest region according to the desired location and sign in the first block (of length ) of the -sparse signal, then recursively entering the appropriate second-coarsest region based on the second block, and so on.
To generalize the above ideas to -input generative models, we form such functions in parallel, thereby allowing the generation of -group-sparse signals in (with ) having non-zero entries . Then, we can use Lemma 2 and a suitable choice of to deduce the following.
Fix , and consider the problem of compressive sensing with generative models under i.i.d. noise, a measurement matrix satisfying , and the above-described generative model with parameters , , , and . Then, if for an absolute constant , then there exists a constant such that the choice yields the following:
Any algorithm attaining must also have (or equivalently , since ).
The generative function can be implemented as a ReLU network with a single hidden layer (i.e., ) of width at most .
Alternatively, if is an integer power of two,666This is a mild assumption given that we already assumed an unspecified constant ; for general , one can consider only using the first entries of each length- vector for the largest possible integer value of , and letting the remaining entries always be zero. This means that at least half of the entries are used, and the same follows for the entries of the combined output. the generative function can be implemented as a ReLU network with depth and width .
In the settings described in the second and third dot points, the sample complexity from Corollary 2 behaves as and respectively.
The first claim is proved similarly to the proof of Theorem 3, so we only outline the differences. In accordance with Lemma 2, let be the largest integer smaller than . Then, Lemma 2 states that if and , it is not possible to achieve . Substituting this definition of into this choice of gives the claimed behavior with . The first claim follows by using the argument at the end of the proof of Theorem 3 to argue that since the recovery goal cannot be attained when , it also cannot be attained when .
For the second claim, we observe that each mapping from to in Figure 4 has a bounded number of rectangular “pieces”, and at the -th scale, the number of pieces is . Summing over gives a total of at most pieces. Recall also that these rectangles are replaced by trapezoidal shapes to make them implementable. Hence, we can apply the well-known fact that any piecewise-linear function with pieces can be implemented using a ReLU network of width with a single hidden layer , and in our case we have . The desired claim follows by multiplying by in accordance with the fact that we implement the network of Figure 4 in parallel times.
Due to the periodic nature of the signals in Figure 4, the third claim also follows using well-established ideas . We would like to produce trapezoidal pulses at regular intervals similarly to Figure 4. To obtain the positive pulses, we can take a half-trapezoidal shape of the form in Figure 5 (Right) and pass it through a sawtooth function having some number of triangular regions as in Figure 5 (Middle), possibly using suitable offsets to shift the location. The negative pulses can be produced similarly, and the two can be added together in the final layer.
As exemplified in Figure 5 and proved in , the -piece sawtooth function itself can be implemented by a network with width and depth when is a power of two. In our case, the maximal number of such repetitions is (at the finest scale), and since this is a power of two by assumption, the depth required is