# Inverting Deep Generative models, One layer at a time

We study the problem of inverting a deep generative model with ReLU activations. Inversion corresponds to finding a latent code vector that explains observed measurements as much as possible. In most prior works this is performed by attempting to solve a non-convex optimization problem involving the generator. In this paper we obtain several novel theoretical results for the inversion problem. We show that for the realizable case, single layer inversion can be performed exactly in polynomial time, by solving a linear program. Further, we show that for multiple layers, inversion is NP-hard and the pre-image set can be non-convex. For generative models of arbitrary depth, we show that exact recovery is possible in polynomial time with high probability, if the layers are expanding and the weights are randomly selected. Very recent work analyzed the same problem for gradient descent inversion. Their analysis requires significantly higher expansion (logarithmic in the latent dimension) while our proposed algorithm can provably reconstruct even with constant factor expansion. We also provide provable error bounds for different norms for reconstructing noisy observations. Our empirical validation demonstrates that we obtain better reconstructions when the latent dimension is large.

## Authors

• 22 publications
• 10 publications
• 36 publications
• 52 publications
• ### Neural Spectrahedra and Semidefinite Lifts: Global Convex Optimization of Polynomial Activation Neural Networks in Fully Polynomial-Time

The training of two-layer neural networks with nonlinear activation func...
01/07/2021 ∙ by Burak Bartan, et al. ∙ 11

• ### I Don't Need 𝐮: Identifiable Non-Linear ICA Without Side Information

In this work we introduce a new approach for identifiable non-linear ICA...
06/09/2021 ∙ by Matthew Willetts, et al. ∙ 0

• ### A Generalization of Wirtinger Flow for Exact Interferometric Inversion

Interferometric inversion involves recovery of a signal from cross-corre...
01/13/2019 ∙ by Bariscan Yonel, et al. ∙ 0

• ### Vector-output ReLU Neural Network Problems are Copositive Programs: Convex Analysis of Two Layer Networks and Polynomial-time Algorithms

We describe the convex semi-infinite dual of the two-layer vector-output...
12/24/2020 ∙ by Arda Sahiner, et al. ∙ 16

• ### Stochastic seismic waveform inversion using generative adversarial networks as a geological prior

We present an application of deep generative models in the context of pa...
06/10/2018 ∙ by Lukas Mosser, et al. ∙ 0

• ### Fast Algorithms for Robust PCA via Gradient Descent

We consider the problem of Robust PCA in the fully and partially observe...
05/25/2016 ∙ by Xinyang Yi, et al. ∙ 0

• ### Fairness constraints can help exact inference in structured prediction

Many inference problems in structured prediction can be modeled as maxim...
07/01/2020 ∙ by Kevin Bello, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Modern deep generative models are demonstrating excellent performance as signal priors, frequently outperforming the previous state of the art for various inverse problems including denoising, inpainting, reconstruction from Gaussian projections and phase retrieval (see e.g. bora2017compressed ; fletcher2017inference ; gupta2018deep ; dhar2018modeling ; hand2018phase ; tripathi2018correction and references therein). Consequently, there is substantial work on improving compressed sensing with generative adversarial network (GANs) grover2018uncertainty ; mardani2018neural ; heckel2018deep ; mixon2018sunlayer ; pandit2019asymptotics . Similar ideas have been recently applied also for sparse PCA with a generative prior aubin2019spiked .

A central problem that appears when trying to solve inverse problems using deep generative models is inverting a generator (bora2017compressed, ; hand2017global, ; shah2018solving, )

. We are interested in deep generative models, parameterized as feed-forward neural networks with ReLU/LeakyReLU activations. For a generator

that maps low-dimensional vectors in to high dimensional vectors (e.g. images) in , we want to reconstruct the latent code if we can observe (realizable case) or a noisy version where denotes some measurement noise. We are therefore interested in the optimization problem

 argminz∥x−G(z)∥p, (1)

for some norm. With this procedure, we learn a concise image representation of a given image as . This applies to image compressions and denoising tasks as studied in (heckel2018deep2, ; heckel2018deep, ). Meanwhile, this problem is a starting point for general linear inverse problems:

 argminz∥x−AG(z)∥p, (2)

since several recent works leverage inversion as a key step in solving more general inverse problems, see e.g. shah2018solving ; romano2017little . Specifically, Shah et al. (shah2018solving, ) provide theoretical guarantees on obtaining the optimal solution for (2) with projected gradient descent, provided one could solve (1) exactly. This work provides a provable algorithm to perform this projection step under some assumptions.

Previous work focuses on the norm that works slowly with gradient descent bora2017compressed ; huang2018provably . In this work, we focus on direct solvers and error bound analysis for and norm instead.111 Notice the relation between norm guarantees . Therefore the studies on and is enough to bound all intermediate norms for . Note that this is a non-convex optimization problem even for a single-layer network with ReLU activations. Therefore gradient descent may get stuck at local minimima or require a long time to converge. For example, for MNIST, compressing a single image by optimizing (1) takes on average several minutes and may need multiple restarts.

Our Contributions: For the realizable case we show that for a single layer solving (1) is equivalent to solving a linear program. For networks more than one layer, however, we show it is NP-hard to simply determine whether exact recovery exists. For a two-layer network we show that the pre-image in the latent space can be a non-convex set.

For realizable inputs and arbitrary depth we show that inversion is possible in polynomial time if the network layers have sufficient expansion and the weights are randomly selected. A similar result was established very recently for gradient descent huang2018provably . We instead propose inversion by layer-wise Gaussian elimination. Our result holds even if each layer is expanding by a constant factor while huang2018provably requires a logarithmic multiplicative expansion in each layer.

For noisy inputs and arbitrary depth we propose two algorithms that rely on iteratively solving linear programs to reconstruct each layer. We establish provable error bounds on the reconstruction error when the weights are random and have constant expansion. We also show empirically that our method matches and sometimes outperforms gradient descent for inversion, especially when the latent dimension becomes larger.

## 2 Setup

We consider deep generative models with the latent dimension being smaller than the signal dimension , parameterized by a -layer feed-forward network of the form

 G(z)=ϕd(ϕd−1(⋯ϕ2(ϕ1(z))⋯)), (3)

where each layer is defined as a composition of activations and linear maps: . We focus on the ReLU activations applied coordinate-wise, and we will also consider the activation as , where the scaling factor is typically 0.1222The inversion of LeakyReLU networks is much easier than ReLU networks and we therefore only mention it when needed.. are the weights of the network, and are the bias terms. Therefore, and indicate the dimensionality of the input and output of the generator . We use to denote the output of the -th layer. Note that one can absorb the bias term into by adding one more dimension with a constant input. Therefore, without loss of generality, we sometimes omit when writing the equation, unless we explicitly needed it.

We use bold lower-case symbols for vectors, e.g. , and for its coordinates. We use upper-case symbols for denote matrices, e.g. , where is its -th row vector. For a indexed set , represents the submatrix of consisting of each -th row of for any .

The central challenge is to determine the signs for the intermediate variables of the hidden layers. We refer to these sign patterns as "ReLU configurations" throughout the paper, indicating which neurons are ‘on’ and which are ‘off’.

## 3 Invertibility for ReLU Realizable Networks

In this section we study the realizable case, i.e., when we are given an observation vector for which there exists such that . In particular, we show that the problem is NP-hard for ReLU activations in general, but could be solved in polynomial time with some mild assumptions with high probability. We present our theoretical findings first and all proofs of the paper are presented later in the Appendix.

### 3.1 Inverting a Single Layer

We start with the simplest one-layer case to find if for any -norm. Since the problem is non-convex, further assumptions of are required huang2018provably for gradient descent to work. When the problem is realizable, however, to find feasible such that , one could invert the function by solving a linear programming:

 w⊤iz+bi=xi, ∀i s.t. xi>0 w⊤iz+bi≤0, ∀i s.t. xi=0 (4)

Its solution set is convex and forms a polytope, but possibly includes uncountable feasible points. Therefore, it becomes unclear how to continue the process of layer-wise inversion unless further assumptions are made. To demonstrate the challenges to generalize the result to deeper nets, we show that the solution set becomes non-convex, and to determine whether there exists any solution is NP-complete.

### 3.2 Challenges to Invert a Two or More Layered ReLU Network

As a warm-up, we first present the NP-hardness to recover a binary latent code for a two layer network, and then generalize it to the real-valued case.

###### Theorem 1 (NP-hardness to Recover Binary Latent Code For Two-layer ReLU Network).

Given a two-layer ReLU network where weights are all fixed, and an observation , the problem to determine whether there exists such that is NP-complete.

We defer the proof to the Appendix, which is constructive and shows the 3SAT problem is reducible to the above two-layer binary code recovery problem. Meanwhile, when the ReLU configuration for each layer is given, the recovery problem becomes to solve a simple linear system. Therefore the problem lies in NP, and together we have NP-completeness. With similar procedure, we could also construct a 4-layer network with real input and prove the following statement:

###### Theorem 2 (NP-hardness to Recover ReLU Networks with Real Domain).

Given a four-layered ReLU neural network where weights are all fixed, and an observation vector , the problem to determine whether there exists such that is NP-complete.

The conclusion holds naturally for generative models with deeper architecture.

Meanwhile, although the preimage for a single layer is a polytope thus convex, it doesn’t continue to hold for more than one layers, see Example 1. Fortunately, we present next that some moderate conditions guarantee a polynomial time solution with high probability.

### 3.3 Inverting Expansive Random Network in Polynomial Time

###### Assumption 1.

For a weight matrix , we assume 1) its entries are sampled i.i.d Gaussian, and 2) the weight matrix is tall: for some constant .

In the previous section, we indicate that the per layer inversion can be achieved through linear programming (4). With Assumption 1 we will be able to prove that the solution is unique with high probability, and thus Theorem 3 holds for ReLU networks with arbitrary depth.

###### Theorem 3.

Let be a generative model from a -layer neural network using ReLU activations. If for each layer, the weight matrix satisfies Assumption 1, then for any prior and observation , with probability , could be achieved from by solving layer-wise linear equations. Namely, a random, expansive and realizable generative model could be inverted in polynomial time with high probability.

In our proof, we show that with high probability the observation has at least non-zero entries, which forms equalities and the coefficient matrix is invertible with probability 1. Therefore the time complexity of exact recovery is no worse than  golub2012matrix since the recovery simply requires solving linear equations with dimension .

Inverting LeakyReLU Network:   On the other hand, inversion of LeakyReLU layers are significantly easier for the realizable case. Unlike ReLU, LeakyReLU is a bijective map, i.e., each observation corresponds to a unique preimage:

 LeakyReLU−1(x)={xif x≥01/cxotherwise. (5)

Therefore, as long as each is of rank , each layer map is also bijective and could be computed by the inverse of LeakyReLU (5

) and linear regression.

## 4 Invertibility for Noisy ReLU Networks

Besides the realizable case, the study of noise tolerance is essential for many real applications. In this section, we thus consider the noisy setting with observation , and investigate the approximate recovery for by relaxing some equalities in (4). We also analyze the problem with both and error bound, in favor of different types of random noise distribution. In this section, all generators are without the bias term.

### 4.1 ℓ∞ Norm Error Bound

Again we start with a single layer, i.e. we observe . Depending on the distribution over the measurement noise , different norm in the objective should be used, with corresponding error bound analysis. We first look at the case where the entries of are uniformly bounded and the approximation of .

Note that for an error , the true prior that produces the observation falls into the following constraints:

 xj−ϵ≤w⊤jz≤xj+ϵ if xj>ϵ,j∈[n] w⊤jz≤xj+ϵ if xj≤ϵ,j∈[n], (6)

which is also equivalent to the set . Therefore a natural way to approximate the prior is to use linear programming to solve the above constraints.

If is known, inversion is straightforward from constraints (6

). However, suppose we don’t want to use a loose guess, we could start from a small estimation and gradually increase the tolerance until feasibility is achieved. A layer-wise inversion is formally presented in Algorithm

1333For practical use, we introduce a factor to gradually increase the error estimation. In our theorem, it assumed we expicitly set to invert the -th layer as the error estimation ..

A key assumption that possibly conveys the error bound from the output to the solution is the following assumption:

###### Assumption 2 (Submatrix Extends ℓ∞ Norm).

For the weight matrix , there exists an integer and a constant , such that for any , satisfies

 ∥WI,:x∥∞≥c∞∥x∥∞,

with high probability for any , and is a constant. Recall that is the sub-rows of confined to .

With this assumption, we are able to show the following theorem that bounds the recovery error.

###### Theorem 4.

Let be a noisy observation produced by the generator , a -layer ReLU network mapping from . Let each weight matrix satisfies Assumption 2 with the integer and constant . Let the error satisfies , and for each , at least coordinates are larger than . Then by recursively applying Algorithm 1 backwards, it produces an that satisfies with high probability.

We argue that the assumptions required could be satisfied by random weight matrices sampled from i.i.d Gaussian distribution, and present the following corollary.

###### Corollary 1.

Let be a noisy observation produced by the generator , a -layer ReLU network mapping from . Let each weight matrix () be sampled from i.i.d Gaussian distribution , then satisfies Assumption 2 with some constant . Let the error satisfies , where . By recursively applying Algorithm 1, it produces an that satisfies with high probability.

###### Remark 1.

For LeakyReLU, we could do at least as good as ReLU, since we could simply view all negative coordinates as inactive coordinates of ReLU, and each observation will produce a loose bound. On the other hand, if there are significant number of negative entries, we could also change the linear programming constraints of Algorithm 1 as follows:

 argminz,δ δ, s.t.  ⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩xj−δ≤w⊤jz≤xj+δif xj>ϵ1/c(xj−δ)≤w⊤jz≤xj+δif −ϵ

### 4.2 ℓ1 Norm Error Bound

In this section we develop a generative model inversion framework using the norm. We introduce Algorithm 2 that tolerates error in different level for each output coordinate and intends to minimize the norm error bound.

###### Remark 2.

Similar to LP, we could extend this LP to LeakyReLU by simply adding similar constraints for the negative observations.

 z(t),e(t)← argminz,e ∑iei s.t. xj−ej≤w⊤jz≤xj+ej if xj>ϵ 1/c(xj−ej)≤w⊤jz≤xj+ej if ϵ≤xj≤ϵ xj−ej≤cw⊤jz≤xj+ej if xj<−ϵ ej≥0 ∀j∈[n].

Different from Algorithm 1, the deviating error allowed on each observation is no longer uniform and the new algorithm is actually optimizing over the error. Similar to the error bound analysis with norm we are able to get some tight approximation guarantee under some mild assumption related to Restricted Isometry Property for norm:

###### Assumption 3 (Submatrix Extends ℓ1 Norm).

For a weight matrix , there exists an integer and a constant , such that for any , satisfies

 ∥WI,:x∥1≥c1∥x∥1, (7)

with high probability for any .

This assumption is a special case of the lower bound of the well-studied Restricted Isometry Property, for -norm and sparsity , i.e., -RIP-1. Similar to the analysis, we are able to get recovery guarantees for generators with arbitary depth.

###### Theorem 5.

Let be a noisy observation produced by the generator , a -layer ReLU network mapping from . Let each weight matrix satisfy Assumption 3 with the integer and constant . Let the error satisfy , and for each , at least coordinates are larger than . Then by recursively applying Algorithm 2, it produces a that satisfies with high probability.

There is a significant volume of prior work on the RIP-1 condition. For instance, studies in (berinde2008combining, ) showed that a (scaled) random sparse binary matrix with rows is -RIP-1 with high probability. In our case and could be arbitrarily large, therefore again we only require the expansion factor to be constant. Similar results with different weight matrices are also shown in nachin2010lower ; indyk2013model ; allen2016restricted .

### 4.3 Relaxation on the ReLU Configuration Estimation

Our previous methods critically depend on the correct estimation of the ReLU configurations. In both Algorithm 1 and 2, we require the ground truth of all intermediate layer outputs to have many coordinates with large magnitude so that they can be distinguished from noise. An incorrect estimate from an "off" configuration to an "on" condition will possibly cause primal infeasibility when solving the LP. Increasing ameliorates this problem but also increases the recovery error.

With this intuition, a natural workaround is to perform some relaxation to tolerate incorrectly estimated signs of the observations.

 maxz∑imax{0,xi}w⊤iz, s.t,  w⊤iz≤xi+ϵ. (8)

Here the ReLU configuration is no longer explicitly reflected in the constraints. Instead, we only include the upper bound for each inner product , which is always valid whether the ReLU is on or off. The previous requirement for the lower bound is now relaxed and hidden in the objective part. When the value of is relatively large, the solver will produce a larger value of to achieve optimality. Since this value is also upper bounded by , the optimal solution would be approaching to if possible. On the other hand, when the value of is close to 0, the objective dependence on is almost negligible.

Meanwhile, in the realizable case when such that , and , it is easy to show that the solution set for (8) is exactly the preimage of . This also trivially holds for Algorithm 1 and 2.

## 5 Experiments

In this section, we describe our experimental setup and report the performance comparisons of our algorithms with the gradient descent method huang2018provably ; hand2017global . We conduct simulations in various aspects with Gaussian random weights, and a simple GAN architecture with MNIST dataset to show that our approach can work in practice for the denoising problem. We refer to our Algorithm 1 as LP and Algorithm 2 as LP. We focus in the main text the experiments with these two proposals and also include some more empirical findings with the relaxed version described in (8) in the Appendix.

### 5.1 Synthetic Data

We validate our algorithms on synthetic data at various noise levels and verify Theorem 4 and 5 numerically. For our methods, we choose the scaling factor . With gradient descent, we use learning rate of and up to 1,000 iterations or until the gradient norm is no more than .

Model architecture: The architecture we choose in the simulation aligns with our theoretical findings. We choose a two layer network with constant expansion factor : latent dimension , hidden neurons of size and observation dimension . The entries in the weight matrix are independently drawn from .

Noise generation:

We use two kinds of random distribution to generate the noise, i.e., uniform distribution

and Gaussian random noise , in favor of the and error bound analysis respectively. We choose for both noise types.

Recovery with Various Observation Noise: In Figure 1(a)(b) we plot the relative recovery error

at different noise levels. It supports our theoretical findings that with other parameters fixed, the recovery error grows almost linearly to the observation noise. Meanwhile, we observe in both cases, our methods perform similarly to gradient descent on average, while gradient descent is less robust and produces more outlier points. As expected, our

LP performs slightly better than gradient descent when the input error is uniformly bounded; see Figure 1

(a). However, with a large variance in the observation error, as seen in Figure

1(b), LP is not as robust as LP or gradient descent.

Additional experiments can be found in the Appendix including the performance of the LP relaxation that mimics LP but is more efficient and robust.

Recovery with Various Input Neurons: According to the theoretical result, one advantage of our proposals is the much smaller expansion requirement than gradient descent hand2017global (constant vs factors). Therefore we conduct the experiments to verify this point. We follow the exact setting as huang2018provably ; we fix the hidden layer and output sizes as and and vary the input size to measure the empirical success rate of recovery influenced by the input size.

In Figure 2 we report the empirical success rate of recovery for our proposals and gradient descent. With exact setting as in huang2018provably , a run is considered successful when . We observe that when input width is small, both gradient descent and our methods grant success rate. However, as the input neurons grows, gradient descent drops to complete failure when 60, while our algorithms continue to present 100% success rate until . The performance of gradient descent is slightly worse than reported in huang2018provably since they have conducted

number of measurements for each run while we only considered the measurement matrix as identity matrix.

### 5.2 Experiments on Generative Model for MNIST Dataset

To verify the practical contribution of our model, we conduct experiments on a real generative network with the MNIST dataset. We set a simple fully-connected architecture with latent dimension , hidden neurons of size and output size . The network has a single channel. We train the network using the original Generative Adversarial Network goodfellow2014generative . We set to be small since the output usually only has around to non-zero pixels.

Similar to the simulation part, we compared our methods with gradient descent hand2017global ; huang2018provably . Under this setting, we choose the learning rate to be and number of iterations up to 10,000 (or until gradient norm is below ).

We first randomly select some empirical examples to visually show performance comparison in Figure 3. In these examples, observations are perturbed with some Gaussian random noise with variance and we use LP as our algorithm to invert the network. From the figures, we could see that our method could almost perfectly denoise and reconstruct the input image, while gradient descent impairs the completeness of the original images to some extent.

We also compare the distribution of relative recovery error with respect to different input noise levels, as ploted in Figure 1(c)(d). From the figures, we observe that for this real network, our proposals still successfully recover the ground truth with good accuracy most of the time, while gradient descent usually gets stuck in local minimum. This explains why it produces defective image reconstructions as shown in 3.

Finally, we presented some sensing results when we mask part of the observations using PGD with our inverting procedure. As shown in Figure 4, our algorithm always show reliable recovery while gradient descent sometimes fails to output reasonable result. More experiments are presented in the Appendix.

## 6 Conclusion

We introduced a novel algorithm to invert a generative model through linear programming, one layer at a time, given (noisy) observations of its output. We prove that for expansive and random Gaussian networks, we can exactly recover the true latent code in the noiseless setting. For noisy observations we also establish provable performance bounds. Our work is different from the closely related huang2018provably since we require less expansion, we bound for and norm (as opposed to ), and we also only focus on inversion, i.e., without a forward operator. Our method can be used as a projection step to solve general linear inverse problems with projected gradient descent shah2018solving . Empirically we demonstrate good performance, sometimes outperforming gradient descent when the latent vectors are high dimensional.

## Appendix A Methodology Details

In this section we present the detailed steps for our proposed methods.

### a.1 LP Relaxation

We formally present the relaxed version based on (8):

We also propose the relaxed LP for LeakyReLU activation, with key step as follows:

 maxz x⊤Wz s.t. 1/cmin{xi−ϵ,0}≤w⊤iz≤max{xi+ϵ,0}

Similarly when and , the solution to (A.1) is exactly .

## Appendix B Theoretical Analysis

### b.1 Hardness

Warm-up: NP-hardness to Invert a Binary Two-Layer Network:
We show that 3SAT is reducible to the inversion problem. We first review the MAX-3SAT problem: Given a 3-CNF formula (i.e. a formula in conjunctive normal form where each clause is limited to at most three literals), determine its satisfiability.

###### Proof of Theorem 1.

Now we design a network with binary input vectors that could be reduced from 3SAT problem.

Firstly, let the network consists of input nodes . Next, the connecting layers consists of nodes, where each node indicates one clause: will be connected to nodes among , where the weight is for a positive literal, and for a negative literal. In other words, , where the -th row of is 3-sparse, and corresponds to the 3 variables in the -th clause. Let the bias for each to be , i.e. . Therefore, only when none of the three literals is satisfied, will output , otherwise the ReLU activation will make output 0.

Afterwards, the final layer is one node that takes the summation of , i.e. . We will set the output to be . Therefore only when all clauses are satisfied, the problem has feasible solutions. Therefore the original problem is also NP-hard.

NP-hardness for Real Network.

###### Proof of Theorem 2.

Now we design a real-valued network that could be reduced from 3SAT problem. Firstly, the network consists of input nodes . Next, the connecting 2 layers map each to . Now, the third connecting layer consists of nodes, where the first nodes indicate each clause: will be connected to nodes among , where the weight is for a positive literal, and for a negative literal. Let , and . The bias term on this third layer is such that the first values are and the last two values are 0. Finally, the last layer is of 2 nodes, first one is the summation of the first nodes of , and the second one is

We will set the output to be . Notice the first two layers make sure each value of is in the range of When the output of , it means all values of must be . Therefore we go back to the previous setting with binary input vectors and simply means that all clauses are satisfied. Therefore a 4 layered ReLU network could be polynomially reduced from 3SAT problem. ∎

Proof of Non-convexity.
The following example demonstrate this property is no longer true for a two-layer case:

###### Example 1.

For , , and observation , the solution set for

 {z|G(z)=x}, where G(z)≡ReLU(W2ReLU(W1z)),

is non-convex.

Example 1 is very straightforward to show the non-convexity of the preimage. Notice point and are in the solution set, but their convex combination is not a solution point with .

### b.2 Proof of Exact Recovery for the Realizable Case

The proof of Theorem 4 highly depends on the exact inversion for a single layer:

###### Lemma 1.

Under Assumption 1, a mapping is injective with high probability . Namely, when .

###### Proof.

Notice for each -th index, is positive w.p. . Therefore, the number of positive coordinates in , denoted by variable

, follows Binomial distribution

, where and . With Hoeffding’s inequality, . Meanwhile, for a matrix with entries following Gaussian distribution, with probability 1 it is invertible. Therefore could only have unique solution if there is one. ∎

Within the proof of Lemma 1, we show that with high probability the observation has at least non-zero entries, meaning the original linear programming has at least equalities. Therefore the corresponding

rows forms an invertible matrix with high probability. Therefore simply by solving the linear equations we will attain the ground truth.

###### Proof of Theorem 3.

From Lemma 1, for each layer , with probability , and for each observed , by solving a linear system we are able to find . By union bound, failure in the whole layerwise inverting process is upper bounded by , since for each . ∎

### b.3 ℓ∞ error bound

With Assumption 2, we are able to show the following theorem that bounds the recovery error.

Proof of Approximate Recovery with and Error Bound:
Theorem 4 depends on the layer-wise recovery of the intermediate ground truth vectors. We first present the following lemma for recovering a single layer with Algorithm 1 and then extend the findings to arbitrary depth .

###### Lemma 2 (Approximate Inversion of a Noisy Layer with ℓ∞ Error Bound).

Given a noisy observation . Let If satisfies Assumption 2 with the integer , and the observation has at least coordinates that is larger than , then Algorithm 1 outputs an that satisfies with high probability .

###### Proof.

Denote , and to be the true output. Notice it also satisfies from the error bound assumption. Since has more than entries , the observation satisfies . Notice for a feasible vector with constraints in (6), it satisfies that

 ∥WI,:z−(x∗)I∥∞ (10) ≤ ∥WI,:z−xI∥∞+∥xI−x∗I∥∞≤2ϵ,

since the error is bounded uniformly for each coordinate in . Meanwhile, notice the real satisfies , we have . With Assumption 2, satisfies for an arbitrary whp. Therefore together with (10) and let and get:

 c∞∥z−z∗∥∞≤∥WI(z−z∗)∥∞≤2ϵ. (11)

Therefore with probability .

Theorem 4 is the direct extension to the multi-layer case and we simply apply Lemma 2 from -th layer backwards to the input vector with initial error of for the -th layer.

Now we look at some examples that fulfill the assumptions. The proof of extension is not easy and we look at the following looser result instead.

###### Lemma 3 (Related result from [22]).

For a sub-Gaussian random matrix

with height and width , where . Its smallest singular value

 sn(A):=inf∥x∥2=1∥Ax∥2.

satisfies with high probability , where is some absolute constant.

The original paper requires and we presented above with a relaxed condition that .

###### Proof of Corollary 1.

With the aid of Lemma 3, Assumption 2 is satisfied with for each layer with high probability. This is because for a random Gaussian matrix , w.h.p. Without loss of generality we assume . We hereby only need to prove that for each -th layer, , the output satisfies: with high probability. We start with the input layer. Notice each entry of follows , . Meanwhile, the number of coordinates in that are larger or equal to follows binomial distribution Bin. Therefore the number of valid coordinates (since ) with probability . Afterwards since and is always smaller than and with high probability since the network is expansive, the condition for the remaining layers is easier and also satisfied with probability at least . By using union bound over all layers, the proof is complete.

The proof for the error bound analysis is similar to that of norm and we only show the essential difference. The key point in transmitting the error from next layer to previous layer is as follows:

 ∥WI,:zi−1−(z∗i)I∥1 ≤ ∥WI,:zi−1−(zi)I∥1+∥(zi)I−(z∗i)I∥1 ≤ 2∥(zi)I−(z∗i)I∥1 (Optimality of Algorithm 2 and z∗i−1 being a feasible point)

Together with Assumption 3, we have:

 ∥WI,:zi−1−(z∗i)I∥1≥c1∥zi−1−z∗i−1∥1 ⇒ ∥zi−1−z∗i−1∥1≤2c1∥zi−z∗i∥1.

Here is the ground truth of -th intermediate vector. is the one we observe and is the solution Algorithm 2 produces.

## Appendix C More Experimental Results

More Results on LP Relaxation.
In Figure 5, we compare the performance with respect to different noise levels over all our proposals, including the results of Algorithm 3 that we omit in the main text. Although we do not see significant improvement of the LP relaxation method over our other proposals, we believe the relaxation over the strict ReLU configurations estimation is of good potential and should be more investigated in the future.

Time comparison.
Firstly, we should declare that for the very well-conditioned random weighted networks, gradient descent converges with large stepsize and we don’t observe much supriority over GD in terms of the running time. In the table below we presented the running time for random net with different input dimensions ranging from 10 to 110.

However, for MNIST dataset, since the weight matrices are not longer well-conditioned, a large learning rate makes GD to diverge, and we have to choose small learning rate 1e-3. The average running time for gradient descent to converge is roughly 1.2 minute, while for LP it only takes no more than 0.5 second.

More experiments on the Sensing Problem.
Finally we add some more examples for some impainting problem on MNIST with non-identity forward operator .