1 Introduction
Modern deep generative models are demonstrating excellent performance as signal priors, frequently outperforming the previous state of the art for various inverse problems including denoising, inpainting, reconstruction from Gaussian projections and phase retrieval (see e.g. bora2017compressed ; fletcher2017inference ; gupta2018deep ; dhar2018modeling ; hand2018phase ; tripathi2018correction and references therein). Consequently, there is substantial work on improving compressed sensing with generative adversarial network (GANs) grover2018uncertainty ; mardani2018neural ; heckel2018deep ; mixon2018sunlayer ; pandit2019asymptotics . Similar ideas have been recently applied also for sparse PCA with a generative prior aubin2019spiked .
A central problem that appears when trying to solve inverse problems using deep generative models is inverting a generator (bora2017compressed, ; hand2017global, ; shah2018solving, )
. We are interested in deep generative models, parameterized as feedforward neural networks with ReLU/LeakyReLU activations. For a generator
that maps lowdimensional vectors in to high dimensional vectors (e.g. images) in , we want to reconstruct the latent code if we can observe (realizable case) or a noisy version where denotes some measurement noise. We are therefore interested in the optimization problem(1) 
for some norm. With this procedure, we learn a concise image representation of a given image as . This applies to image compressions and denoising tasks as studied in (heckel2018deep2, ; heckel2018deep, ). Meanwhile, this problem is a starting point for general linear inverse problems:
(2) 
since several recent works leverage inversion as a key step in solving more general inverse problems, see e.g. shah2018solving ; romano2017little . Specifically, Shah et al. (shah2018solving, ) provide theoretical guarantees on obtaining the optimal solution for (2) with projected gradient descent, provided one could solve (1) exactly. This work provides a provable algorithm to perform this projection step under some assumptions.
Previous work focuses on the norm that works slowly with gradient descent bora2017compressed ; huang2018provably . In this work, we focus on direct solvers and error bound analysis for and norm instead.^{1}^{1}1 Notice the relation between norm guarantees . Therefore the studies on and is enough to bound all intermediate norms for . Note that this is a nonconvex optimization problem even for a singlelayer network with ReLU activations. Therefore gradient descent may get stuck at local minimima or require a long time to converge. For example, for MNIST, compressing a single image by optimizing (1) takes on average several minutes and may need multiple restarts.
Our Contributions: For the realizable case we show that for a single layer solving (1) is equivalent to solving a linear program. For networks more than one layer, however, we show it is NPhard to simply determine whether exact recovery exists. For a twolayer network we show that the preimage in the latent space can be a nonconvex set.
For realizable inputs and arbitrary depth we show that inversion is possible in polynomial time if the network layers have sufficient expansion and the weights are randomly selected. A similar result was established very recently for gradient descent huang2018provably . We instead propose inversion by layerwise Gaussian elimination. Our result holds even if each layer is expanding by a constant factor while huang2018provably requires a logarithmic multiplicative expansion in each layer.
For noisy inputs and arbitrary depth we propose two algorithms that rely on iteratively solving linear programs to reconstruct each layer. We establish provable error bounds on the reconstruction error when the weights are random and have constant expansion. We also show empirically that our method matches and sometimes outperforms gradient descent for inversion, especially when the latent dimension becomes larger.
2 Setup
We consider deep generative models with the latent dimension being smaller than the signal dimension , parameterized by a layer feedforward network of the form
(3) 
where each layer is defined as a composition of activations and linear maps: . We focus on the ReLU activations applied coordinatewise, and we will also consider the activation as , where the scaling factor is typically 0.1^{2}^{2}2The inversion of LeakyReLU networks is much easier than ReLU networks and we therefore only mention it when needed.. are the weights of the network, and are the bias terms. Therefore, and indicate the dimensionality of the input and output of the generator . We use to denote the output of the th layer. Note that one can absorb the bias term into by adding one more dimension with a constant input. Therefore, without loss of generality, we sometimes omit when writing the equation, unless we explicitly needed it.
We use bold lowercase symbols for vectors, e.g. , and for its coordinates. We use uppercase symbols for denote matrices, e.g. , where is its th row vector. For a indexed set , represents the submatrix of consisting of each th row of for any .
The central challenge is to determine the signs for the intermediate variables of the hidden layers. We refer to these sign patterns as "ReLU configurations" throughout the paper, indicating which neurons are ‘on’ and which are ‘off’.
3 Invertibility for ReLU Realizable Networks
In this section we study the realizable case, i.e., when we are given an observation vector for which there exists such that . In particular, we show that the problem is NPhard for ReLU activations in general, but could be solved in polynomial time with some mild assumptions with high probability. We present our theoretical findings first and all proofs of the paper are presented later in the Appendix.
3.1 Inverting a Single Layer
We start with the simplest onelayer case to find if for any norm. Since the problem is nonconvex, further assumptions of are required huang2018provably for gradient descent to work. When the problem is realizable, however, to find feasible such that , one could invert the function by solving a linear programming:
(4) 
Its solution set is convex and forms a polytope, but possibly includes uncountable feasible points. Therefore, it becomes unclear how to continue the process of layerwise inversion unless further assumptions are made. To demonstrate the challenges to generalize the result to deeper nets, we show that the solution set becomes nonconvex, and to determine whether there exists any solution is NPcomplete.
3.2 Challenges to Invert a Two or More Layered ReLU Network
As a warmup, we first present the NPhardness to recover a binary latent code for a two layer network, and then generalize it to the realvalued case.
Theorem 1 (NPhardness to Recover Binary Latent Code For Twolayer ReLU Network).
Given a twolayer ReLU network where weights are all fixed, and an observation , the problem to determine whether there exists such that is NPcomplete.
We defer the proof to the Appendix, which is constructive and shows the 3SAT problem is reducible to the above twolayer binary code recovery problem. Meanwhile, when the ReLU configuration for each layer is given, the recovery problem becomes to solve a simple linear system. Therefore the problem lies in NP, and together we have NPcompleteness. With similar procedure, we could also construct a 4layer network with real input and prove the following statement:
Theorem 2 (NPhardness to Recover ReLU Networks with Real Domain).
Given a fourlayered ReLU neural network where weights are all fixed, and an observation vector , the problem to determine whether there exists such that is NPcomplete.
The conclusion holds naturally for generative models with deeper architecture.
Meanwhile, although the preimage for a single layer is a polytope thus convex, it doesn’t continue to hold for more than one layers, see Example 1. Fortunately, we present next that some moderate conditions guarantee a polynomial time solution with high probability.
3.3 Inverting Expansive Random Network in Polynomial Time
Assumption 1.
For a weight matrix , we assume 1) its entries are sampled i.i.d Gaussian, and 2) the weight matrix is tall: for some constant .
In the previous section, we indicate that the per layer inversion can be achieved through linear programming (4). With Assumption 1 we will be able to prove that the solution is unique with high probability, and thus Theorem 3 holds for ReLU networks with arbitrary depth.
Theorem 3.
Let be a generative model from a layer neural network using ReLU activations. If for each layer, the weight matrix satisfies Assumption 1, then for any prior and observation , with probability , could be achieved from by solving layerwise linear equations. Namely, a random, expansive and realizable generative model could be inverted in polynomial time with high probability.
In our proof, we show that with high probability the observation has at least nonzero entries, which forms equalities and the coefficient matrix is invertible with probability 1. Therefore the time complexity of exact recovery is no worse than golub2012matrix since the recovery simply requires solving linear equations with dimension .
Inverting LeakyReLU Network: On the other hand, inversion of LeakyReLU layers are significantly easier for the realizable case. Unlike ReLU, LeakyReLU is a bijective map, i.e., each observation corresponds to a unique preimage:
(5) 
Therefore, as long as each is of rank , each layer map is also bijective and could be computed by the inverse of LeakyReLU (5
) and linear regression.
4 Invertibility for Noisy ReLU Networks
Besides the realizable case, the study of noise tolerance is essential for many real applications. In this section, we thus consider the noisy setting with observation , and investigate the approximate recovery for by relaxing some equalities in (4). We also analyze the problem with both and error bound, in favor of different types of random noise distribution. In this section, all generators are without the bias term.
4.1 Norm Error Bound
Again we start with a single layer, i.e. we observe . Depending on the distribution over the measurement noise , different norm in the objective should be used, with corresponding error bound analysis. We first look at the case where the entries of are uniformly bounded and the approximation of .
Note that for an error , the true prior that produces the observation falls into the following constraints:
(6) 
which is also equivalent to the set . Therefore a natural way to approximate the prior is to use linear programming to solve the above constraints.
If is known, inversion is straightforward from constraints (6
). However, suppose we don’t want to use a loose guess, we could start from a small estimation and gradually increase the tolerance until feasibility is achieved. A layerwise inversion is formally presented in Algorithm
1^{3}^{3}3For practical use, we introduce a factor to gradually increase the error estimation. In our theorem, it assumed we expicitly set to invert the th layer as the error estimation ..A key assumption that possibly conveys the error bound from the output to the solution is the following assumption:
Assumption 2 (Submatrix Extends Norm).
For the weight matrix , there exists an integer and a constant , such that for any , satisfies
with high probability for any , and is a constant. Recall that is the subrows of confined to .
With this assumption, we are able to show the following theorem that bounds the recovery error.
Theorem 4.
Let be a noisy observation produced by the generator , a layer ReLU network mapping from . Let each weight matrix satisfies Assumption 2 with the integer and constant . Let the error satisfies , and for each , at least coordinates are larger than . Then by recursively applying Algorithm 1 backwards, it produces an that satisfies with high probability.
We argue that the assumptions required could be satisfied by random weight matrices sampled from i.i.d Gaussian distribution, and present the following corollary.
Corollary 1.
Let be a noisy observation produced by the generator , a layer ReLU network mapping from . Let each weight matrix () be sampled from i.i.d Gaussian distribution , then satisfies Assumption 2 with some constant . Let the error satisfies , where . By recursively applying Algorithm 1, it produces an that satisfies with high probability.
Remark 1.
For LeakyReLU, we could do at least as good as ReLU, since we could simply view all negative coordinates as inactive coordinates of ReLU, and each observation will produce a loose bound. On the other hand, if there are significant number of negative entries, we could also change the linear programming constraints of Algorithm 1 as follows:
4.2 Norm Error Bound
In this section we develop a generative model inversion framework using the norm. We introduce Algorithm 2 that tolerates error in different level for each output coordinate and intends to minimize the norm error bound.
Remark 2.
Similar to LP, we could extend this LP to LeakyReLU by simply adding similar constraints for the negative observations.
s.t.  
Different from Algorithm 1, the deviating error allowed on each observation is no longer uniform and the new algorithm is actually optimizing over the error. Similar to the error bound analysis with norm we are able to get some tight approximation guarantee under some mild assumption related to Restricted Isometry Property for norm:
Assumption 3 (Submatrix Extends Norm).
For a weight matrix , there exists an integer and a constant , such that for any , satisfies
(7) 
with high probability for any .
This assumption is a special case of the lower bound of the wellstudied Restricted Isometry Property, for norm and sparsity , i.e., RIP1. Similar to the analysis, we are able to get recovery guarantees for generators with arbitary depth.
Theorem 5.
Let be a noisy observation produced by the generator , a layer ReLU network mapping from . Let each weight matrix satisfy Assumption 3 with the integer and constant . Let the error satisfy , and for each , at least coordinates are larger than . Then by recursively applying Algorithm 2, it produces a that satisfies with high probability.
There is a significant volume of prior work on the RIP1 condition. For instance, studies in (berinde2008combining, ) showed that a (scaled) random sparse binary matrix with rows is RIP1 with high probability. In our case and could be arbitrarily large, therefore again we only require the expansion factor to be constant. Similar results with different weight matrices are also shown in nachin2010lower ; indyk2013model ; allen2016restricted .
4.3 Relaxation on the ReLU Configuration Estimation
Our previous methods critically depend on the correct estimation of the ReLU configurations. In both Algorithm 1 and 2, we require the ground truth of all intermediate layer outputs to have many coordinates with large magnitude so that they can be distinguished from noise. An incorrect estimate from an "off" configuration to an "on" condition will possibly cause primal infeasibility when solving the LP. Increasing ameliorates this problem but also increases the recovery error.
With this intuition, a natural workaround is to perform some relaxation to tolerate incorrectly estimated signs of the observations.
(8) 
Here the ReLU configuration is no longer explicitly reflected in the constraints. Instead, we only include the upper bound for each inner product , which is always valid whether the ReLU is on or off. The previous requirement for the lower bound is now relaxed and hidden in the objective part. When the value of is relatively large, the solver will produce a larger value of to achieve optimality. Since this value is also upper bounded by , the optimal solution would be approaching to if possible. On the other hand, when the value of is close to 0, the objective dependence on is almost negligible.
Meanwhile, in the realizable case when such that , and , it is easy to show that the solution set for (8) is exactly the preimage of . This also trivially holds for Algorithm 1 and 2.
Random Net  MNIST Net  
(a) Uniform Noise  (b) Gaussian Noise  (c) Uniform Noise  (d) Gaussian Noise 
In experiments (c)(d) the network is generative model for the MNIST dataset. In this case, gradient descent fails to find global minimum in almost all the cases.
5 Experiments
In this section, we describe our experimental setup and report the performance comparisons of our algorithms with the gradient descent method huang2018provably ; hand2017global . We conduct simulations in various aspects with Gaussian random weights, and a simple GAN architecture with MNIST dataset to show that our approach can work in practice for the denoising problem. We refer to our Algorithm 1 as LP and Algorithm 2 as LP. We focus in the main text the experiments with these two proposals and also include some more empirical findings with the relaxed version described in (8) in the Appendix.
5.1 Synthetic Data
We validate our algorithms on synthetic data at various noise levels and verify Theorem 4 and 5 numerically. For our methods, we choose the scaling factor . With gradient descent, we use learning rate of and up to 1,000 iterations or until the gradient norm is no more than .
Model architecture: The architecture we choose in the simulation aligns with our theoretical findings. We choose a two layer network with constant expansion factor : latent dimension , hidden neurons of size and observation dimension . The entries in the weight matrix are independently drawn from .
Noise generation:
We use two kinds of random distribution to generate the noise, i.e., uniform distribution
and Gaussian random noise , in favor of the and error bound analysis respectively. We choose for both noise types.Recovery with Various Observation Noise: In Figure 1(a)(b) we plot the relative recovery error
at different noise levels. It supports our theoretical findings that with other parameters fixed, the recovery error grows almost linearly to the observation noise. Meanwhile, we observe in both cases, our methods perform similarly to gradient descent on average, while gradient descent is less robust and produces more outlier points. As expected, our
LP performs slightly better than gradient descent when the input error is uniformly bounded; see Figure 1(a). However, with a large variance in the observation error, as seen in Figure
1(b), LP is not as robust as LP or gradient descent.Additional experiments can be found in the Appendix including the performance of the LP relaxation that mimics LP but is more efficient and robust.
(a) ReLU  (b) LeakyReLU 
Recovery with Various Input Neurons: According to the theoretical result, one advantage of our proposals is the much smaller expansion requirement than gradient descent hand2017global (constant vs factors). Therefore we conduct the experiments to verify this point. We follow the exact setting as huang2018provably ; we fix the hidden layer and output sizes as and and vary the input size to measure the empirical success rate of recovery influenced by the input size.
In Figure 2 we report the empirical success rate of recovery for our proposals and gradient descent. With exact setting as in huang2018provably , a run is considered successful when . We observe that when input width is small, both gradient descent and our methods grant success rate. However, as the input neurons grows, gradient descent drops to complete failure when 60, while our algorithms continue to present 100% success rate until . The performance of gradient descent is slightly worse than reported in huang2018provably since they have conducted
number of measurements for each run while we only considered the measurement matrix as identity matrix.
5.2 Experiments on Generative Model for MNIST Dataset
Observation  Ground Truth  
Ours ( LP)  Gradient Descent huang2018provably  
0  3  7  8  9  0  3  7  8  9 
Observation  Ground Truth  
Ours ( LP)  Gradient Descent huang2018provably  
7  1  5  6  9  7  1  5  6  9 
To verify the practical contribution of our model, we conduct experiments on a real generative network with the MNIST dataset. We set a simple fullyconnected architecture with latent dimension , hidden neurons of size and output size . The network has a single channel. We train the network using the original Generative Adversarial Network goodfellow2014generative . We set to be small since the output usually only has around to nonzero pixels.
Similar to the simulation part, we compared our methods with gradient descent hand2017global ; huang2018provably . Under this setting, we choose the learning rate to be and number of iterations up to 10,000 (or until gradient norm is below ).
We first randomly select some empirical examples to visually show performance comparison in Figure 3. In these examples, observations are perturbed with some Gaussian random noise with variance and we use LP as our algorithm to invert the network. From the figures, we could see that our method could almost perfectly denoise and reconstruct the input image, while gradient descent impairs the completeness of the original images to some extent.
We also compare the distribution of relative recovery error with respect to different input noise levels, as ploted in Figure 1(c)(d). From the figures, we observe that for this real network, our proposals still successfully recover the ground truth with good accuracy most of the time, while gradient descent usually gets stuck in local minimum. This explains why it produces defective image reconstructions as shown in 3.
Finally, we presented some sensing results when we mask part of the observations using PGD with our inverting procedure. As shown in Figure 4, our algorithm always show reliable recovery while gradient descent sometimes fails to output reasonable result. More experiments are presented in the Appendix.
6 Conclusion
We introduced a novel algorithm to invert a generative model through linear programming, one layer at a time, given (noisy) observations of its output. We prove that for expansive and random Gaussian networks, we can exactly recover the true latent code in the noiseless setting. For noisy observations we also establish provable performance bounds. Our work is different from the closely related huang2018provably since we require less expansion, we bound for and norm (as opposed to ), and we also only focus on inversion, i.e., without a forward operator. Our method can be used as a projection step to solve general linear inverse problems with projected gradient descent shah2018solving . Empirically we demonstrate good performance, sometimes outperforming gradient descent when the latent vectors are high dimensional.
References
 (1) Zeyuan AllenZhu, Rati Gelashvili, and Ilya Razenshteyn. Restricted isometry property for general pnorms. IEEE Transactions on Information Theory, 62(10):5839–5854, 2016.
 (2) Benjamin Aubin, Bruno Loureiro, Antoine Maillard, Florent Krzakala, and Lenka Zdeborová. The spiked matrix model with generative priors. arXiv preprint arXiv:1905.12385, 2019.
 (3) Radu Berinde, Anna C Gilbert, Piotr Indyk, Howard Karloff, and Martin J Strauss. Combining geometry and combinatorics: A unified approach to sparse signal recovery. In Communication, Control, and Computing, 2008 46th Annual Allerton Conference on, pages 798–805. IEEE, 2008.
 (4) Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using generative models. arXiv preprint arXiv:1703.03208, 2017.
 (5) Manik Dhar, Aditya Grover, and Stefano Ermon. Modeling sparse deviations for compressed sensing using generative models. arXiv preprint arXiv:1807.01442, 2018.
 (6) Alyson K Fletcher and Sundeep Rangan. Inference in deep networks in high dimensions. arXiv preprint arXiv:1706.06549, 2017.
 (7) Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012.
 (8) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 (9) Aditya Grover and Stefano Ermon. Uncertainty autoencoders: Learning compressed representations via variational information maximization. arXiv preprint arXiv:1812.10539, 2018.
 (10) Sidharth Gupta, Konik Kothari, Maarten V de Hoop, and Ivan Dokmanić. Deep mesh projectors for inverse problems. arXiv preprint arXiv:1805.11718, 2018.
 (11) Paul Hand, Oscar Leong, and Vlad Voroninski. Phase retrieval under a generative prior. In Advances in Neural Information Processing Systems, pages 9154–9164, 2018.
 (12) Paul Hand and Vladislav Voroninski. Global guarantees for enforcing deep generative priors by empirical risk. arXiv preprint arXiv:1705.07576, 2017.
 (13) Reinhard Heckel and Paul Hand. Deep decoder: Concise image representations from untrained nonconvolutional networks. arXiv preprint arXiv:1810.03982, 2018.
 (14) Reinhard Heckel, Wen Huang, Paul Hand, and Vladislav Voroninski. Deep denoising: Rateoptimal recovery of structured signals with a deep prior. arXiv preprint arXiv:1805.08855, 2018.
 (15) Wen Huang, Paul Hand, Reinhard Heckel, and Vladislav Voroninski. A provably convergent scheme for compressive sensing under random generative priors. arXiv preprint arXiv:1812.04176, 2018.
 (16) Piotr Indyk and Ilya Razenshteyn. On modelbased rip1 matrices. In International Colloquium on Automata, Languages, and Programming, pages 564–575. Springer, 2013.
 (17) Morteza Mardani, Qingyun Sun, Shreyas Vasawanala, Vardan Papyan, Hatef Monajemi, John Pauly, and David Donoho. Neural proximal gradient descent for compressive imaging. arXiv preprint arXiv:1806.03963, 2018.
 (18) Dustin G Mixon and Soledad Villar. Sunlayer: Stable denoising with generative networks. arXiv preprint arXiv:1803.09319, 2018.
 (19) Mergen Nachin. Lower bounds on the column sparsity of sparse recovery matrices. UAP: MIT Undergraduate Thesis, 2010.
 (20) Parthe Pandit, Mojtaba Sahraee, Sundeep Rangan, and Alyson K Fletcher. Asymptotics of map inference in deep networks. arXiv preprint arXiv:1903.01293, 2019.
 (21) Yaniv Romano, Michael Elad, and Peyman Milanfar. The little engine that could: Regularization by denoising (red). SIAM Journal on Imaging Sciences, 10(4):1804–1844, 2017.

(22)
Mark Rudelson and Roman Vershynin.
Smallest singular value of a random rectangular matrix.
Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 62.12, pages 1707–1739, 2009.  (23) Viraj Shah and Chinmay Hegde. Solving linear inverse problems using gan priors: An algorithm with provable guarantees. arXiv preprint arXiv:1802.08406, 2018.
 (24) Subarna Tripathi, Zachary C Lipton, and Truong Q Nguyen. Correction by projection: Denoising images with generative adversarial networks. arXiv preprint arXiv:1803.04477, 2018.
Appendix A Methodology Details
In this section we present the detailed steps for our proposed methods.
a.1 LP Relaxation
We formally present the relaxed version based on (8):
s.t 
We also propose the relaxed LP for LeakyReLU activation, with key step as follows:
s.t. 
Similarly when and , the solution to (A.1) is exactly .
Appendix B Theoretical Analysis
b.1 Hardness
Warmup: NPhardness to Invert a Binary TwoLayer Network:
We show that 3SAT is reducible to the inversion problem. We first review the MAX3SAT problem:
Given a 3CNF formula (i.e. a formula in conjunctive normal form where each clause is limited to at most three literals), determine its satisfiability.
Proof of Theorem 1.
Now we design a network with binary input vectors that could be reduced from 3SAT problem.
Firstly, let the network consists of input nodes . Next, the connecting layers consists of nodes, where each node indicates one clause: will be connected to nodes among , where the weight is for a positive literal, and for a negative literal. In other words, , where the th row of is 3sparse, and corresponds to the 3 variables in the th clause. Let the bias for each to be , i.e. . Therefore, only when none of the three literals is satisfied, will output , otherwise the ReLU activation will make output 0.
Afterwards, the final layer is one node that takes the summation of , i.e. . We will set the output to be . Therefore only when all clauses are satisfied, the problem has feasible solutions. Therefore the original problem is also NPhard.
∎
NPhardness for Real Network.
Proof of Theorem 2.
Now we design a realvalued network that could be reduced from 3SAT problem. Firstly, the network consists of input nodes . Next, the connecting 2 layers map each to . Now, the third connecting layer consists of nodes, where the first nodes indicate each clause: will be connected to nodes among , where the weight is for a positive literal, and for a negative literal. Let , and . The bias term on this third layer is such that the first values are and the last two values are 0. Finally, the last layer is of 2 nodes, first one is the summation of the first nodes of , and the second one is
We will set the output to be . Notice the first two layers make sure each value of is in the range of When the output of , it means all values of must be . Therefore we go back to the previous setting with binary input vectors and simply means that all clauses are satisfied. Therefore a 4 layered ReLU network could be polynomially reduced from 3SAT problem. ∎
Proof of Nonconvexity.
The following example demonstrate this property is no longer true for a twolayer case:
Example 1.
For , , and observation , the solution set for
is nonconvex.
Example 1 is very straightforward to show the nonconvexity of the preimage. Notice point and are in the solution set, but their convex combination is not a solution point with .
b.2 Proof of Exact Recovery for the Realizable Case
The proof of Theorem 4 highly depends on the exact inversion for a single layer:
Lemma 1.
Under Assumption 1, a mapping is injective with high probability . Namely, when .
Proof.
Notice for each th index, is positive w.p. . Therefore, the number of positive coordinates in , denoted by variable
, follows Binomial distribution
, where and . With Hoeffding’s inequality, . Meanwhile, for a matrix with entries following Gaussian distribution, with probability 1 it is invertible. Therefore could only have unique solution if there is one. ∎Within the proof of Lemma 1, we show that with high probability the observation has at least nonzero entries, meaning the original linear programming has at least equalities. Therefore the corresponding
rows forms an invertible matrix with high probability. Therefore simply by solving the linear equations we will attain the ground truth.
b.3 error bound
With Assumption 2, we are able to show the following theorem that bounds the recovery error.
Proof of Approximate Recovery with and Error Bound:
Theorem 4 depends on the layerwise recovery of the intermediate ground truth vectors. We first present the following lemma for recovering a single layer with Algorithm 1 and then extend the findings to arbitrary depth .
Lemma 2 (Approximate Inversion of a Noisy Layer with Error Bound).
Proof.
Denote , and to be the true output. Notice it also satisfies from the error bound assumption. Since has more than entries , the observation satisfies . Notice for a feasible vector with constraints in (6), it satisfies that
(10)  
since the error is bounded uniformly for each coordinate in . Meanwhile, notice the real satisfies , we have . With Assumption 2, satisfies for an arbitrary whp. Therefore together with (10) and let and get:
(11) 
Therefore with probability .
∎
Theorem 4 is the direct extension to the multilayer case and we simply apply Lemma 2 from th layer backwards to the input vector with initial error of for the th layer.
Now we look at some examples that fulfill the assumptions. The proof of extension is not easy and we look at the following looser result instead.
Lemma 3 (Related result from [22]).
For a subGaussian random matrix
with height and width , where . Its smallest singular valuesatisfies with high probability , where is some absolute constant.
The original paper requires and we presented above with a relaxed condition that .
Proof of Corollary 1.
With the aid of Lemma 3, Assumption 2 is satisfied with for each layer with high probability. This is because for a random Gaussian matrix , w.h.p. Without loss of generality we assume . We hereby only need to prove that for each th layer, , the output satisfies: with high probability. We start with the input layer. Notice each entry of follows , . Meanwhile, the number of coordinates in that are larger or equal to follows binomial distribution Bin. Therefore the number of valid coordinates (since ) with probability . Afterwards since and is always smaller than and with high probability since the network is expansive, the condition for the remaining layers is easier and also satisfied with probability at least . By using union bound over all layers, the proof is complete.
∎
The proof for the error bound analysis is similar to that of norm and we only show the essential difference. The key point in transmitting the error from next layer to previous layer is as follows:
(Optimality of Algorithm 2 and being a feasible point) 
Together with Assumption 3, we have:
Here is the ground truth of th intermediate vector. is the one we observe and is the solution Algorithm 2 produces.
Appendix C More Experimental Results
More Results on LP Relaxation.
In Figure 5, we compare the performance with respect to different noise levels over all our proposals, including the results of Algorithm 3 that we omit in the main text. Although we do not see significant improvement of the LP relaxation method over our other proposals, we believe the relaxation over the strict ReLU configurations estimation is of good potential and should be more investigated in the future.
(a) Uniform Noise; Random Net  (b) Gaussian Noise; Random Net 
(c) Uniform Noise; Real Net  (d) Gaussian Noise; Real Net 
Time comparison.
Firstly, we should declare that for the very wellconditioned random weighted networks, gradient descent converges with large stepsize and we don’t observe much supriority over GD in terms of the running time. In the table below we presented the running time for random net with different input dimensions ranging from 10 to 110.
k  10  30  50  70  90  110  MNIST(k=20) 
LP  0.63  0.73  0.83  0.90  0.95  1.03  0.5 
LP  1.05  1.05  1.23  1.28  1.39  1.22  1.1 
LP relaxation  0.66  0.53  0.58  0.76  0.75  0.70  0.6 
GD  1.59  1.65  1.72  1.80  2.09  2.01  72 
However, for MNIST dataset, since the weight matrices are not longer wellconditioned, a large learning rate makes GD to diverge, and we have to choose small learning rate 1e3. The average running time for gradient descent to converge is roughly 1.2 minute, while for LP it only takes no more than 0.5 second.
More experiments on the Sensing Problem.
Finally we add some more examples for some impainting problem on MNIST with nonidentity forward operator .
Observation  Ground Truth 
Ours ( LP)  Gradient Descent [15] 