Compressed Sensing using Generative Models

03/09/2017 ∙ by Ashish Bora, et al. ∙ 0

The goal of compressed sensing is to estimate a vector from an underdetermined system of noisy linear measurements, by making use of prior knowledge on the structure of vectors in the relevant domain. For almost all results in this literature, the structure is represented by sparsity in a well-chosen basis. We show how to achieve guarantees similar to standard compressed sensing but without employing sparsity at all. Instead, we suppose that vectors lie near the range of a generative model G: R^k →R^n. Our main theorem is that, if G is L-Lipschitz, then roughly O(k L) random Gaussian measurements suffice for an ℓ_2/ℓ_2 recovery guarantee. We demonstrate our results using generative models from published variational autoencoder and generative adversarial networks. Our method can use 5-10x fewer measurements than Lasso for the same accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 9

page 20

page 22

page 23

page 24

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Compressive or compressed sensing is the problem of reconstructing an unknown vector after observing linear measurements of its entries, possibly with added noise:

where is called the measurement matrix and is noise. Even without noise, this is an underdetermined system of linear equations, so recovery is impossible unless we make an assumption on the structure of the unknown vector . We need to assume that the unknown vector is “natural,” or “simple,” in some application-dependent way.

The most common structural assumption is that the vector is -sparse in some known basis (or approximately -sparse). Finding the sparsest solution to an underdetermined system of linear equations is NP-hard, but still convex optimization can provably recover the true sparse vector if the matrix

satisfies conditions such as the Restricted Isometry Property (RIP) or the related Restricted Eigenvalue Condition (REC) 

[35, 7, 14, 6]

. The problem is also called high-dimensional sparse linear regression and there is vast literature on establishing conditions for different recovery algorithms, different assumptions on the design of

and generalizations of RIP and REC for other structures, see e.g. [6, 33, 1, 30, 3].

This significant interest is justified since a large number of applications can be expressed as recovering an unknown vector from noisy linear measurements. For example, many tomography problems can be expressed in this framework: is the unknown true tomographic image and the linear measurements are obtained by x-ray or other physical sensing system that produces sums or more general linear projections of the unknown pixels. Compressed sensing has been studied extensively for medical applications including computed tomography (CT) [8], rapid MRI [31]

and neuronal spike train recovery 

[21]. Another impressive application is the “single pixel camera” [15], where digital micro-mirrors provide linear combinations to a single pixel sensor that then uses compressed sensing reconstruction algorithms to reconstruct an image. These results have been extended by combining sparsity with additional structural assumptions [4, 22], and by generalizations such as translating sparse vectors into low-rank matrices [33, 3, 17]. These results can improve performance when the structural assumptions fit the sensed signals. Other works perform “dictionary learning,” seeking overcomplete bases where the data is more sparse (see [9] and references therein).

In this paper instead of relying on sparsity, we use structure from a generative model

. Recently, several neural network based generative models such as variational auto-encoders (VAEs) 

[26] and generative adversarial networks (GANs) [19] have found success at modeling data distributions. In these models, the generative part learns a mapping from a low dimensional representation space to the high dimensional sample space

. While training, this mapping is encouraged to produce vectors that resemble the vectors in the training dataset. We can therefore use any pre-trained generator to approximately capture the notion of an vector being “natural” in our domain: the generator defines a probability distribution over vectors in sample space and tries to assign higher probability to more likely vectors, for the dataset it has been trained on. We expect that vectors “natural” to our domain will be close to some point in the support of this distribution,

i.e., in the range of .

Our Contributions: We present an algorithm that uses generative models for compressed sensing. Our algorithm simply uses gradient descent to optimize the representation such that the corresponding image has small measurement error . While this is a nonconvex objective to optimize, we empirically find that gradient descent works well, and the results can significantly outperform Lasso with relatively few measurements.

We obtain theoretical results showing that, as long as gradient descent finds a good approximate solution to our objective, our output will be almost as close to the true as the closest possible point in the range of .

The proof is based on a generalization of the Restricted Eigenvalue Condition () that we call the Set-Restricted Eigenvalue Condition (S/̄REC). Our main theorem is that if a measurement matrix satisfies the S/̄REC for the range of a given generator , then the measurement error minimization optimum is close to the true . Furthermore, we show that random Gaussian measurement matrices satisfy the S/̄REC condition with high probability for large classes of generators. Specifically, for -layer neural networks such as VAEs and GANs, we show that

Gaussian measurements suffice to guarantee good reconstruction with high probability. One result, for ReLU-based networks, is the following:

Theorem 1.1.

Let be a generative model from a -layer neural network using ReLU activations. Let be a random Gaussian matrix for , scaled so . For any and any observation , let minimize to within additive of the optimum. Then with probability,

Let us examine the terms in our error bound in more detail. The first two are the minimum possible error of any vector in the range of the generator and the norm of the noise; these are necessary for such a technique, and have direct analogs in standard compressed sensing guarantees. The third term comes from gradient descent not necessarily converging to the global optimum; empirically, does seem to converge to zero, and one can check post-observation that this is small by computing the upper bound .

While the above is restricted to ReLU-based neural networks, we also show similar results for arbitrary -Lipschitz generative models, for . Typical neural networks have -bounded weights in each layer, so

, giving for all activation functions the same

sample complexity as for ReLU networks.

Theorem 1.2.

Let be an -Lipschitz function. Let be a random Gaussian matrix for , scaled so . For any and any observation , let minimize to within additive of the optimum over vectors with . Then with probability,

The downside is two minor technical conditions: we only optimize over representations with bounded by , and our error gains an additive term. Since the dependence on these parameters is , and is something like , we may set and while only losing constant factors, making these conditions very mild. In fact, generative models normally have the coordinates of be independent uniform or Gaussian, so , and a constant signal-to-noise ratio would have .

We remark that, while these theorems are stated in terms of Gaussian matrices, the proofs only involve the distributional Johnson-Lindenstrauss property of such matrices. Hence the same results hold for matrices with subgaussian entries or fast-JL matrices [2].

2 Our Algorithm

All norms are -norms unless specified otherwise.

Let be the vector we wish to sense. Let be the measurement matrix and be the noise vector. We observe the measurements . Given and , our task is to find a reconstruction close to .

A generative model is given by a deterministic function , and a distribution over . To generate a sample from the generator, we can draw and the sample then is . Typically, we have , i.e. the generative model maps from a low dimensional representation space to a high dimensional sample space.

Our approach is to find a vector in representation space such that the corresponding vector in the sample space matches the observed measurements. We thus define the objective to be

(1)

By using any optimization procedure, we can minimize with respect to . In particular, if the generative model is differentiable, we can evaluate the gradients of the loss with respect to

using backpropagation and use standard gradient based optimizers. If the optimization procedure terminates at

, our reconstruction for is . We define the measurement error to be and the reconstruction error to be .

3 Related Work

Several recent lines of work explore generative models for reconstruction. The first line of work attempts to project an image on to the representation space of the generator. These works assume full knowledge of the image, and are special cases of the linear measurements framework where the measurement matrix is identity. Excellent reconstruction results with SGD in the representation space to find an image in the generator range have been reported by [28] with stochastic clipping and [11] with logistic measurement loss. A different approach is introduced in [16] and [12]. In their method, a recognition network that maps from the sample space vector to the representation space vector is learned jointly with the generator in an adversarial setting.

A second line of work explores reconstruction with structured partial observations. The inpainting problem consists of predicting the values of missing pixels given a part of the image. This is a special case of linear measurements where each measurement corresponds to an observed pixel. The use of Generative models for this task has been studied in [38], where the objective is taken to be a combination of

error in measurements and a perceptual loss term given by the discriminator. Super-resolution is a related task that attempts to increase the resolution of an image. We can view this problem as observing local spatial averages of the unknown higher resolution image and hence cast this as another special case of linear measurements. For prior work on super-resolution see

e.g. [37, 13, 23] and references therein.

We also take note of the related work of [18]

that connects model-based compressed sensing with the invertibility of Convolutional Neural Networks.

A related result appears in [5], which studies the measurement complexity of an RIP condition for smooth manifolds. This is analogous to our S/̄REC for the range of , but the range of is neither smooth (because of ReLUs) nor a manifold (because of self-intersection). Their recovery result was extended in [20] to unions of two manifolds.

4 Theoretical Results

We begin with a brief review of the Restricted Eigenvalue Condition (REC) in standard compressed sensing. The REC is a sufficient condition on for robust recovery to be possible. The REC essentially requires that all “approximately sparse” vectors are far from the nullspace of the matrix . More specifically, satisfies REC for a constant if for all approximately sparse vectors ,

(2)

It can be shown that this condition is sufficient for recovery of sparse vectors using Lasso. If one examines the structure of Lasso recovery proofs, a key property that is used is that the difference of any two sparse vectors is also approximately sparse (for sparsity up to ). This is a coincidence that is particular to sparsity. By contrast, the difference of two vectors “natural” to our domain may not itself be natural. The condition we need is that the difference of any two natural vectors is far from the nullspace of .

We propose a generalized version of the REC for a set of vectors, the Set-Restricted Eigenvalue Condition (S/̄REC):

Definition 1.

Let . For some parameters , , a matrix is said to satisfy the if ,

There are two main differences between the S/̄REC and the standard REC in compressed sensing. First, the condition applies to differences of vectors in an arbitrary set of “natural” vectors, rather than just the set of approximately -sparse vectors in some basis. This will let us apply the definition to being the range of a generative model.

Second, we allow an additive slack term . This is necessary for us to achieve the S/̄REC when is the output of general Lipschitz functions. Without it, the S/̄REC depends on the behavior of at arbitrarily small scales. Since there are arbitrarily many such local regions, one cannot guarantee the existence of an that works for all these local regions. Fortunately, as we shall see, poor behavior at a small scale will only increase our error by .

The S/̄REC definition requires that for any two vectors in , if they are significantly different (so the right hand side is large), then the corresponding measurements should also be significantly different (left hand side). Hence we can hope to approximate the unknown vector from the measurements, if the measurement matrix satisfies the S/̄REC.

But how can we find such a matrix? To answer this, we present two lemmas showing that random Gaussian matrices of relatively few measurements satisfy the S/̄REC for the outputs of large and practically useful classes of generative models .

In the first lemma, we assume that the generative model is -Lipschitz, i.e., , we have

Note that state of the art neural network architectures with linear layers, (transposed) convolutions, max-pooling, residual connections, and all popular non-linearities satisfy this assumption. In Lemma 

8.5 in the Appendix we give a simple bound on in terms of parameters of the network; for typical networks this is . We also require the input to the generator to have bounded norm. Since generative models such as VAEs and GANs typically assume their input is drawn with independent uniform or Gaussian inputs, this only prunes an exponentially unlikely fraction of the possible outputs.

Lemma 4.1.

Let be -Lipschitz. Let

be an -norm ball in . For , if

then a random matrix

with IID entries such that satisfies the with probability.

All proofs, including this one, are deferred to Appendix A.

Note that even though we proved the lemma for an ball, the same technique works for any compact set.

For our second lemma, we assume that the generative model is a neural network with such that each layer is a composition of a linear transformation followed by a pointwise non-linearity. Many common generative models have such architectures. We also assume that all non-linearities are piecewise linear with at most two pieces. The popular ReLU or LeakyReLU non-linearities satisfy this assumption. We do not make any other assumption, and in particular, the magnitude of the weights in the network do not affect our guarantee.

Lemma 4.2.

Let be a -layer neural network, where each layer is a linear transformation followed by a pointwise non-linearity. Suppose there are at most nodes per layer, and the non-linearities are piecewise linear with at most two pieces, and let

for some . Then a random matrix with IID entries satisfies the with probability.

To show Theorems 1.1 and 1.2, we just need to show that the S/̄REC implies good recovery. In order to make our error guarantee relative to error in the image space , rather than in the measurement space , we also need that preserves norms with high probability [10]. Fortunately, Gaussian matrices (or other distributional JL matrices) satisfy this property.

Lemma 4.3.

Let by drawn from a distribution that (1) satisfies the with probability and (2) has for every fixed , with probability .

For any and noise , let . Let approximately minimize over , i.e.,

Then,

with probability .

Combining Lemma 4.1, Lemma 4.2, and Lemma 4.3 gives Theorems 1.1 and 1.2. In our setting, is the range of the generator, and in the theorem above is the reconstruction returned by our algorithm.

(a) Results on MNIST
(b) Results on celebA
Figure 1:

We compare the performance of our algorithm with baselines. We show a plot of per pixel reconstruction error as we vary the number of measurements. The vertical bars indicate 95% confidence intervals.

5 Models

In this section we describe the generative models used in our experiments. We used two image datasets and two different generative model types (a VAE and a GAN). This provides some evidence that our approach can work with many types of models and datasets.

In our experiments, we found that it was helpful to add a regularization term to the objective to encourage the optimization to explore more in the regions that are preferred by the respective generative models (see comparison to unregularized versions in Fig. 1). Thus the objective function we use for minimization is

Both VAE and GAN typically imposes an isotropic Gaussian prior on . Thus is proportional to the negative log-likelihood under this prior. Accordingly, we use the following regularizer:

(3)

where measures the relative importance of the prior as compared to the measurement error.

5.1 MNIST with VAE

The MNIST dataset consists of about images of handwritten digits, where each image is of size  [27]. Each pixel value is either (background) or (foreground). No pre-processing was performed. We trained VAE on this dataset. The input to the VAE is a vectorized binary image of input dimension . We set the size of the representation space . The recognition network is a fully connected network. The generator is also fully connected with the architecture . We train the VAE using the Adam optimizer [25] with a mini-batch size and a learning rate of .

We found that using in Eqn. (3) gave the best performance, and we use this value in our experiments.

The digit images are reasonably sparse in the pixel space. Thus, as a baseline, we use the pixel values directly for sparse recovery using Lasso. We set shrinkage parameter to be for all the experiments.

5.2 CelebA with DCGAN

CelebA is a dataset of more than face images of celebrities [29]. The input images were cropped to a RGB image, giving inputs per image. Each pixel value was scaled so that all values are between . We trained a DCGAN 111Code reused from https://github.com/carpedm20/DCGAN-tensorflow [34, 24] on this dataset. We set the input dimension

and use a standard normal distribution. The architecture follows that of 

[34]. The model was trained by one update to the discriminator and two updates to the generator per cycle. Each update used the Adam optimizer [25] with minibatch size , learning rate and .

We found that using in Eqn. (3) gave the best results and thus, we use this value in our experiments.

For baselines, we perform sparse recovery using Lasso on the images in two domains: (a) 2D Discrete Cosine Transform (2D-DCT) and (b) 2D Daubechies-1 Wavelet Transform (2D-DB1). While the we provide Gaussian measurements of the original pixel values, the penalty is on either the DCT coefficients or the DB1 coefficients of each color channel of an image. For all experiments, we set the shrinkage parameter to be and respectively for 2D-DCT, and 2D-DB1.

6 Experiments and Results

6.1 Reconstruction from Gaussian measurements

We take

to be a random matrix with IID Gaussian entries with zero mean and standard deviation of

. Each entry of noise vector

is also an IID Gaussian random variable. We compare performance of different sensing algorithms qualitatively and quantitatively. For quantitative comparison, we use the reconstruction error =

, where is an estimate of returned by the algorithm. In all cases, we report the results on a held out test set, unseen by the generative model at training time.

6.1.1 Mnist

The standard deviation of the noise vector is set such that . We use Adam optimizer [25], with a learning rate of . We do random restarts with steps per restart and pick the reconstruction with best measurement error.

In Fig. 0(a), we show the reconstruction error as we change the number of measurements both for Lasso and our algorithm. We observe that our algorithm is able to get low errors with far fewer measurements. For example, our algorithm’s performance with measurements matches Lasso’s performance with measurements. Fig. 1(a) shows sample reconstructions by Lasso and our algorithm.

However, our algorithm is limited since its output is constrained to be in the range of the generator. After measurements, our algorithm’s performance saturates, and additional measurements give no additional performance. Since Lasso has no such limitation, it eventually surpasses our algorithm, but this takes more than measurements of the 784-dimensional vector. We expect that a more powerful generative model with representation dimension can make better use of additional measurements.

6.1.2 celebA

The standard deviation of entries in the noise vector is set such that . We optimize use Adam optimizer [25], with a learning rate of . We do random restarts with update steps per restart and pick the reconstruction with best measurement error.

In Fig. 0(b), we show the reconstruction error as we change the number of measurements both for Lasso and our algorithm. In Fig. 3 we show sample reconstructions by Lasso and our algorithm. We observe that our algorithm is able to produce reasonable reconstructions with as few as measurements, while the output of the baseline algorithms is quite blurry. Similar to the results on MNIST, if we continue to give more measurements, our algorithm saturates, and for more than measurements, Lasso gets a better reconstruction. We again expect that a more powerful generative model with would perform better in the high-measurement regime.

(a) We show original images (top row) and reconstructions by Lasso (middle row) and our algorithm (bottom row).
(b) We show original images (top row), low resolution version of original images (middle row) and reconstructions (last row).
Figure 2: Results on MNIST. Reconstruction with 100 measurements (left) and Super-resolution (right)

6.2 Super-resolution

Super-resolution is the task of constructing a high resolution image from a low resolution version of the same image. This problem can be thought of as special case of our general framework of linear measurements, where the measurements correspond to local spatial averages of the pixel values. Thus, we try to use our recovery algorithm to perform this task with measurement matrix tailored to give only the relevant observations. We note that this measurement matrix may not satisfy the S/̄REC condition (with good constants and ), and consequently, our theorems may not be applicable.

6.2.1 Mnist

We construct a low resolution image by spatial

pooling with a stride of

to produce a image. These measurements are used to reconstruct the original image. Fig. 1(b) shows reconstructions produced by our algorithm on images from a held out test set. We observe sharp reconstructions which closely match the fine structure in the ground truth.

6.2.2 celebA

We construct a low resolution image by spatial pooling with a stride of to produce a image. These measurements are used to reconstruct the original image. In Fig. 4 we show results on images from a held out test set. We see that our algorithm is able to fill in the details to match the original image.

Figure 3: Reconstruction results on celebA with measurements (of dimensional vector). We show original images (top row), and reconstructions by Lasso with DCT basis (second row), Lasso with wavelet basis (third row), and our algorithm (last row).
Figure 4: Super-resolution results on celebA. Top row has the original images. Second row shows the low resolution ( smaller) version of the original image. Last row shows the images produced by our algorithm.
Figure 5: Results on the representation error experiments on celebA. Top row shows original images and the bottom row shows closest images found in the range of the generator.
(a) Results on MNIST
(b) Results on celebA
Figure 6: Reconstruction error for images in the range of the generator. The vertical bars indicate 95% confidence intervals.

6.3 Understanding sources of error

Although better than baselines, our reconstructions still admit some error. There are three sources of this error: (a) Representation error: the image being sensed is far from the range of the generator (b) Measurement error: The finite set of random measurements do not contain all the information about the unknown image (c) Optimization error: The optimization procedure did not find the best .

In this section we present some experiments that suggest that the representation error is the dominant term. In our first experiment, we ensure that the representation error is zero, and try to minimize the sum of other two errors. In the second experiment, we ensure that the measurement error is zero, and try to minimize the sum of other two.

Figure 7: Results on the representation error experiments on MNIST. Top row shows original images and the bottom row shows closest images found in the range of the generator.

6.3.1 Sensing images from the range of the generator

Our first approach is to sense an image that is in the range of the generator. More concretely, we sample a from . Then we pass it through the generator to get . Now, we pretend that this is a real image and try to sense that. This method eliminates the representation error and allows us to check if our gradient based optimization procedure is able to find by minimizing the objective.

In Fig. 5(a) and Fig. 5(b), we show the reconstruction error for images in the range of the generators trained on MNIST and celebA datasets respectively. We see that we get almost perfect reconstruction with very few measurements. This suggests that objective is being properly minimized and we indeed get close to . i.e. the sum of optimization error and the measurement error is not very large, in the absence of the representation error.

6.3.2 Quantifying representation error

We saw that in absence of the representation error, the overall error is small. However from Fig. 1, we know that the overall error is still non-zero. So, in this experiment, we seek to quantify the representation error, i.e., how far are the real images from the range of the generator?

From the previous experiment, we know that the recovered by our algorithm is close to , the best possible value, if the image being sensed is in the range of the generator. Based on this, we make an assumption that this property is also true for real images. With this assumption, we get an estimate to the representation error as follows: We sample real images from the test set. Then we use the full image in our algorithm, i.e., our measurement matrix is identity. This eliminates the measurement error. Using these measurements, we get the reconstructed image through our algorithm. The estimated representation error is then . We repeat this procedure several times over randomly sampled images from our dataset and report average representation error values. The task of finding the closest image in the range of the generator has been studied in prior work [11, 16, 12].

On the MNIST dataset, we get average per pixel representation error of . The recovered images are shown in Fig. 7. In contrast with only Gaussian measurements, we are able to get a per pixel reconstruction error of about .

On the celebA dataset, we get average per pixel representation error of . The recovered images are shown in Fig. 5. On the other hand, with only Gaussian measurements, we get a per pixel reconstruction error of about .

These experiments suggest that the representation error is the major component of the total error. Thus, a more flexible generative model can help to decrease the overall error on both datasets.

7 Conclusion

We demonstrate how to perform compressed sensing using generative models from neural nets. These models can represent data distributions more concisely than standard sparsity models, while their differentiability allows for fast signal reconstruction. This will allow compressed sensing applications to make significantly fewer measurements.

Our theorems and experiments both suggest that, after relatively few measurements, the signal reconstruction gets close to the optimal within the range of the generator. To reach the full potential of this technique, one should use larger generative models as the number of measurements increase. Whether this can be expressed more concisely than by training multiple independent generative models of different sizes is an open question.

Generative models are an active area of research with ongoing rapid improvements. Because our framework applies to general generative models, this improvement will immediately yield better reconstructions with fewer measurements. We also believe that one could also use the performance of generative models for our task as one benchmark for the quality of different models.

Acknowledgements

We would like to thank Philipp Krähenbühl for helpful discussions.

References

8 Appendix A

Lemma 8.1.

Given , , , and , if matrix satisfies the , then for any two , such that and , we have

Proof.

8.1 Proof of Lemma 4.1

Definition 2.

A random variable is said to be if  , we have

Lemma 8.2.

Let be an -Lipschitz function. Let be the -ball in with radius , , and be a -net on such that . Let be a

random matrix with IID Gaussian entries with zero mean and variance

. If

then for any , if , we have with probability .

Note that for any given point in , if we try to find its nearest neighbor of that point in an -net on , then the difference between the two is at most the . In words, this lemma says that even if we consider measurements made on these points, i.e. a linear projection using a random matrix , then as long as there are enough measurements, the difference between measurements is of the same order . If the point was in the net, then this can be easily achieved by Johnson-Lindenstrauss Lemma. But to argue that this is true for all in , which can be an uncountably large set, we construct a chain of nets on . We now present the formal proof.

Proof.

Observe that is . Thus, for any ,

is sufficient to ensure that

Now, let be a chain of epsilon nets of such that is a -net and , with . We know that there exist nets such that

Let . Then due to Lipschitzness of , ’s form a chain of epsilon nets such that is a -net of , with .

For , let

Thus,

Now assume ,

and

By choice of and , we have ,

Thus by union bound, we have

Now,

Observe that for any , we can write

where and .

Since each , with probability at least , we have

Now, , and due to properties of epsilon-nets. We know that with probability at least (Corollary 5.35 [36]). By setting , we get that, with probability .

Combining these two results, and noting that it is possible to choose , we get that with probability ,

Lemma.

Let be -Lipschitz. Let

be an -norm ball in . For , if

then a random matrix with IID entries such that satisfies the with probability.

Proof.

We construct a -net, , on . There exists a net such that

Since is a -cover of , due to the -Lipschitz property of , we get that is a -cover of .

Let denote the pairwise differences between the elements in , i.e.,

Then,

For any , , such that are -close to and respectively. Thus, by triangle inequality,

Again by triangle inequality,

Now, by Lemma 8.2, with probability , , and . Thus,

By the Johnson-Lindenstrauss Lemma, for a fixed , . Therefore, we can union bound over all vectors in to get

Since , and , , we have

Combining the three results above we get that with probability ,

Thus, satisfies with probability .

8.2 Proof of Lemma 4.2

Lemma 8.3.

Consider different

dimensional hyperplanes in

. Consider the -dimensional faces (hereafter called -faces) generated by the hyperplanes, i.e. the elements in the partition of such that relative to each hyperplane, all points inside a partition are on the same side. Then, the number of -faces is .

Proof.

Proof is by induction, and follows [32].

Let denote the number of faces generated in by different -dimensional hyperplanes. As a base case, let . Then -dimensional hyperplanes are just points on a line. points partition into pieces. This gives .

Now, assuming that is true, we need to show . Assume we have different hyperplanes , and a new hyperplane is added. intersects at different -faces given by . The -faces in partition into different -faces. Additionally, each -face in divides an existing -face into two. Hence the number of new -faces introduced by the addition of is . This gives the recursion

Lemma.

Let be a -layer neural network, where each layer is a linear transformation followed by a pointwise non-linearity. Suppose there are at most nodes per layer, and the non-linearities are piecewise linear with at most two pieces, and let

for some . Then a random matrix with IID entries satisfies the with probability.

Proof.

Consider the first layer of . Each node in this layer can be represented as a hyperplane in , where the points on the hyperplane are those where the input to the node switches from one linear piece to the other. Since there are at most nodes in this layer, by Lemma 8.3, the input space is partitioned by at most different hyperplanes, into -faces. Applying this over the layers of , we get that the input space is partitioned into at most sets.

Recall that the non-linearities are piecewise linear, and the partition boundaries were made precisely at those points where the non-linearities change from one piece to another. This means that within each set of the input partition, the output is a linear function of the inputs. Thus is a union of different -faces in .

We now use an oblivious subspace embedding to bound the number of measurements required to embed the range of . For a single -face , a random matrix with IID entries such that satisfies with probability if .

Since the range of is a union of different -faces, we can union bound over all of them, such that satisfies the with probability . Thus, we get that satisfies the with probability if

8.3 Proof of Lemma 4.3

Lemma.

Let by drawn from a distribution that (1) satisfies the with probability and (2) has for every fixed , with probability . For any and noise , let . Let approximately minimize over , i.e.,

Then

with probability .

Proof.

Let . Then we have by Lemma 8.1 and the hypothesis on that