# Compressed Sensing with Deep Image Prior and Learned Regularization

We propose a novel method for compressed sensing recovery using untrained deep generative models. Our method is based on the recently proposed Deep Image Prior (DIP), wherein the convolutional weights of the network are optimized to match the observed measurements. We show that this approach can be applied to solve any differentiable inverse problem. We also introduce a novel learned regularization technique which incorporates a small amount of prior information, further reducing the number of measurements required for a given reconstruction error. Our algorithm requires approximately 4-6x fewer measurements than classical Lasso methods. Unlike previous approaches based on generative models, our method does not require the model to be pre-trained. As such, we can apply our method to various medical imaging datasets for which data acquisition is expensive and no known generative models exist.

## Authors

• 1 publication
• 9 publications
• 27 publications
• 27 publications
• 51 publications
• ### Compressed Sensing using Generative Models

The goal of compressed sensing is to estimate a vector from an underdete...
03/09/2017 ∙ by Ashish Bora, et al. ∙ 0

• ### Modeling Sparse Deviations for Compressed Sensing using Generative Models

In compressed sensing, a small number of linear measurements can be used...
07/04/2018 ∙ by Manik Dhar, et al. ∙ 2

• ### Image-Adaptive GAN based Reconstruction

In the recent years, there has been a significant improvement in the qua...
06/12/2019 ∙ by Shady Abu Hussein, et al. ∙ 0

• ### Using auto-encoders for solving ill-posed linear inverse problems

Compressed sensing algorithms recover a signal from its under-determined...
01/15/2019 ∙ by Shirin Jalali, et al. ∙ 0

• ### One-dimensional Deep Image Prior for Time Series Inverse Problems

We extend the Deep Image Prior (DIP) framework to one-dimensional signal...
04/18/2019 ∙ by Sriram Ravula, et al. ∙ 18

• ### SUNLayer: Stable denoising with generative networks

It has been experimentally established that deep neural networks can be ...
03/25/2018 ∙ by Dustin G. Mixon, et al. ∙ 0

• ### Phase Retrieval using Single-Instance Deep Generative Prior

Several deep learning methods for phase retrieval exist, but most of the...
06/09/2021 ∙ by Kshitij Tayal, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the well-studied compressed sensing problem of recovering an unknown signal by observing a set of noisy measurements of the form

 y=Ax∗+η. (1)

Here is a known measurement matrix, typically generated with random independent Gaussian entries. Since the number of measurements is smaller than the dimension

of the unknown vector

, this is an under-determined system of noisy linear equations and hence ill-posed. There are many solutions, and some structure must be assumed on to have any hope of recovery. Pioneering research donoho2006compressed ; candes2006robust ; candes2005decoding established that if is assumed to be sparse in a known basis, a small number of measurements will be provably sufficient to recover the unknown vector in polynomial time using methods such as Lasso tibshirani1996regression .

Sparsity approaches have proven successful, but more complex models with additional structure have been recently proposed such as model-based compressive sensing baraniuk2010model and manifold models hegde2008random ; hegde2012signal ; eftekhari2015new . Bora et al. bora2017compressed

showed that deep generative models can be used as excellent priors for images. They also showed that backpropagation can be used to solve the signal recovery problem by performing gradient descent in the generative latent space. This method enabled image generation with significantly fewer measurements compared to Lasso for a given reconstruction error. Compressed sensing using deep generative models was further improved in very recent work

tripathi2018correction ; grover2018amortized ; kabkab2018task ; shah2018solving ; fletcher2017inference ; DBLP:journals/corr/abs-1802-04073 . Additionally a theoretical analysis of the nonconvex gradient descent algorithm bora2017compressed was proposed by Hand et al. hand2017global under some assumptions on the generative model.

Inspired by these impressive benefits of deep generative models, we chose to investigate the application of such methods for medical imaging, a canonical application of compressive sensing. A significant problem, however, is that all these previous methods require the existence of pre-trained models. While this has been achieved for various types of images, e.g. human faces of CelebA liu2015faceattributes via DCGAN radford2015unsupervised , it remains significantly more challenging for medical images wolterink2017generative ; schlegl2017unsupervised ; nie2017medical ; schlemper2017deep . Instead of addressing this problem in generative models, we found an easier way to circumvent it.

Surprising recent work by Ulyanov et al. ulyanov2017deep proposed Deep Image Prior (DIP), which uses untrainedconvolutional neural networks. In DIP-based schemes, a convolutional neural network generator (e.g. DCGAN) is initialized with random weights; these weights are subsequently optimized to make the network produce an output as close to the target image as possible. This procedure is unlearned, using no prior information from other images. The prior is enforced only by the fixed convolutional structure of the generator network.

Generators used for DIP are typically over-parameterized, i.e. the number of network weights is much larger compared to the output dimension. For this reason DIP has empirically been found to overfit to noise if run for too many iterations ulyanov2017deep . In this paper we theoretically prove that this phenomenon occurs with gradient descent and justify the use of early stopping and other regularization methods.

Our Contributions:

• In Section 3 we propose DIP for compressed sensing (CS-DIP). Our basic method is as follows. Initialize a DCGAN generator with random weights; use gradient descent to optimize these weights such that the network produces an output which agrees with the observed measurements as much as possible. This unlearned method can be improved with a novel learned regularization technique, which regularizes the DCGAN weights throughout the optimization process.

• In Section 4 we theoretically prove that DIP will fit any signal to zero error with gradient descent. Our result is established for a network with a single hidden layer and sufficient constant fraction over-parametrization. While it is expected that over-parametrized neural networks can fit any signal, the fact that gradient descent can provably solve this non-convex problem is interesting and provides theoretical justification for early stopping.

• In Section 5 we empirically show that CS-DIP outperforms previous unlearned methods in many cases. While pre-trained or “learned” methods will likely perform better bora2017compressed , we have the advantage of not requiring a generative model trained over large datasets. As such, we can apply our method to various medical imaging datasets for which data acquisition is expensive and generative models are difficult to train.

## 2 Background

### 2.1 Compressed Sensing: Classical and Unlearned Approaches

A classical assumption made in compressed sensing is that the vector is -sparse in some basis such as wavelet or discrete cosine transform (DCT). Finding the sparsest solution to an underdetermined linear system of equations is NP-hard in general; however, if the matrix

satisfies conditions such as the Restricted Eigenvalue Condition (REC) or Restricted Isometry Property (RIP)

candes2006stable ; bickel2009simultaneous ; donoho2006compressed ; tibshirani1996regression , then can be recovered in polynomial time via convex relaxations tropp2006just or iterative methods. There is extensive compressed sensing literature regarding assumptions on , numerous recovery algorithms, and variations of RIP and REC bickel2009simultaneous ; negahban2009unified ; agarwal2010fast ; bach2012optimization ; loh2011high .

Compressed sensing methods have found many applications in imaging, for example the single-pixel camera (SPC) duarte2008single . Medical tomographic applications include x-ray radiography, microwave imaging, magnetic resonance imaging (MRI) winters2010sparsity ; chen2008prior ; lustig2007sparse . Obtaining measurements for medical imaging can be costly, time-consuming, and in some cases dangerous to the patient qaisar2013compressive . As such, an important goal is to reduce the number of measurements while maintaining good reconstruction quality.

Aside from the classical use of sparsity, recent work has used other priors to solve linear inverse problems. Plug-and-play priors venkatakrishnan2013plug ; chan2017plug and Regularization by Denoising romano2017little have shown how image denoisers can be used to solve general linear inverse problems. A key example of this is BM3D-AMP, which applies a Block-Matching and 3D filtering (BM3D) denoiser to an Approximate Message Passing (D-AMP) algorithm metzler2016denoising ; metzler2015bm3d . AMP has also been applied to linear models in other contexts, e.g. schniter2016vector . Another related algorithm is TVAL3 zhang2013improved ; li2009user which leverages augmented Lagrangian multipliers to achieve impressive performance on compressed sensing problems. In many different settings, we compare our algorithm to these prior methods: BM3D-AMP, TVAL3, and Lasso.

### 2.2 Compressed Sensing: Learned Approaches

While sparsity in some chosen basis is well-established, recent work has shown better empirical performance when neural networks are used bora2017compressed . This success is attributed to the fact that neural networks are capable of learning image priors from very large datasets goodfellow2014generative ; kingma2013auto . There is significant recent work on solving linear inverse problems using various learned techniques, e.g. recurrent generative models mardani2017recurrent and auto-regressive models dave2018solving . Additionally approximate message passing (AMP) has been extended to a learned setting by Metzler et al. metzler2017learned .

Bora et al. bora2017compressed is the closest to our set-up. In this work the authors assume that the unknown signal is in the range of a pre-trained generative model such as a generative adversarial network (GAN) goodfellow2014generative

or variational autoencoder (VAE)

kingma2013auto . The recovery of the unknown signal is obtained via gradient descent in the latent space by searching for a signal that satisfies the measurements. This can be directly applied for linear inverse problems and more generally to any differentiable measurement process. Recent work has built upon these methods using new optimization techniques Chang17 , uncertainty autoencoders grover2018uncertainty , and other approaches dhar2018modeling ; kabkab2018task ; mixon2018sunlayer ; pandit2019asymptotics ; rusu2018meta . The key point is that all this prior work requires pre-trained generative models, in contrast to CS-DIP. Finally, there is significant ongoing work to understand DIP and develop related approaches, see e.g. heckel2018deep ; dittmer2018regularization .

## 3 Proposed Algorithm

Let be the signal that we are trying to reconstruct, be the measurement matrix, and be independent noise. Given the measurement matrix and the observations , we wish to reconstruct an that is close to .

A generative model is a deterministic function which takes as input a seed and is parameterized by “weights” , producing an output . These models have shown excellent performance generating real-life signals such as images goodfellow2014generative ; kingma2013auto and audio wavenet . We investigate deep convolutional generative models, a special case in which the model architecture has multiple cascaded layers of convolutional filters krizhevsky2012imagenet . In this paper we apply a DCGAN radford2015unsupervised model and restrict the signals to be images.

### 3.1 Compressed Sensing with Deep Image Prior (CS-DIP)

Our approach is to find a set of weights for the convolutional network such that the measurement matrix applied to the network output, i.e. , matches the measurements we are given. Hence we initialize an untrained network with some fixed and solve the following optimization problem:

 w∗=argminw∥y−AG(z;w)∥2. (2)

This is, of course, a non-convex problem because

is a complex feed-forward neural network. Still we can use gradient-based optimizers for any generative model and measurement process that is differentiable. Generator networks such as DCGAN are biased toward smooth, natural images due to their convolutional structure; thus the network structure alone provides a good prior for reconstructing images in problems such as inpainting and denoising

ulyanov2017deep . Our finding is that this applies to general linear measurement processes. We restrict our solution to lie in the span of a convolutional neural network. If a sufficient number of measurements is given, we obtain an output such that .

Note that this method uses an untrained generative model and optimizes over the network weights . In contrast previous methods, such as that of Bora et al. bora2017compressed , use a trained model and optimize over the latent -space, solving . We instead initialize a random with Gaussian i.i.d. entries and keep this fixed throughout the optimization process.

In our algorithm we leverage the well-established total variation regularization rudin1992nonlinear ; wang2008new ; liu2018image , denoted as . We also propose an additional learned regularization technique, ; note that without this technique, i.e. when , our method is completely unlearned. Lastly we use early stopping, a phenomena that will be analyzed theoretically in Section 4.

Thus the final optimization problem becomes

 w∗=argminw ∥y−AG(z;w)∥2+R(w;λT,λL). (3)

The regularization term contains hyperparameters

and for total variation and learned regularization: . We now discuss this term.

### 3.2 Learned Regularization

Without learned regularization CS-DIP relies only on linear measurements taken from one unknown image. We now introduce a novel method which leverages a small amount of training data to optimize regularization. In this case training data refers to measurements from additional ground truth of a similar type, e.g. other x-ray images.

To leverage this additional information, we pose Eqn. 3

as a Maximum a Posteriori (MAP) estimation problem and propose a novel prior on the weights of the generative model. This prior then acts as a regularization term, penalizing the model toward an optimal set of weights

.

For a set of weights , we model the likelihood of the measurements and the prior on the weights

as Gaussian distributions given by

where and .

In this setting we want to find a set of weights that maximizes the posterior on given , i.e.,

 w∗ = argmaxwp(w|y)≡argminw∥y−AG(z;w)∥2+λL(w−μ)TΣ−1(w−μ). (4)

This gives us the learned regularization term

 LR(w)=(w−μ)TΣ−1(w−μ), (5)

where the coefficient in Eqn. 4 controls the strength of the prior.

Notice that when and this regularization term is equivalent to -regularization. Thus this method can be thought of as a more strategic version of standard weight decay.

#### 3.2.1 Learning the Prior Parameters

In the previous section, we introduced the learned regularization term defined in Eqn. 5. However we have not yet learned values for parameters that incorporate prior knowledge of the network weights. We now propose a way to estimate these parameters.

Assume we have a set of measurements from different images , each obtained with a different measurement matrix . For each measurement we run CS-DIP to solve the optimization problem in Eqn. 3 and obtain an optimal set of weights . Note that when optimizing for the weights we only have access to the measurements , not the ground truth .

The number of weights in deep networks tends to be very large. As such, learning a distribution over each weight, i.e. estimating and , becomes intractable. We instead use a layer-wise approach: with network layers, we have and . Thus each weight within layer is modeled according to the same distribution. For simplicity we assume , i.e. that network weights are independent across layers. The process of estimating statistics from is described in Algorithm 1 of the appendix.

We use this learned in the regularization term from Eqn. 5 for reconstructing measurements of images. We refer to this technique as learned regularization

. While this may seem analogous to batch normalization

ioffe2015batch , note that we only use to penalize the -norm of the weights and do not normalize the layer outputs themselves.

#### 3.2.2 Discussion of Learned Regularization

The proposed CS-DIP does not require training if no learned regularization is used, i.e. if in Eqn. 3. This means that CS-DIP can be applied only with measurements from a single image and no prior information of similar images in a dataset.

Our next idea, learned regularization, utilizes a small amount of prior information, requiring access to measurements from a small number of similar images (roughly ). In contrast, other pre-trained models such as that of Bora et al. bora2017compressed require access to ground truth from a massive number of similar images (tens of thousands for CelebA). If such a large dataset is available, and if a good generative model can be trained on that dataset, we expect that pre-trained models bora2017compressed ; grover2018amortized ; kabkab2018task ; mardani2017recurrent would outperform our method. Our approach is instead more suitable for reconstructing problems where large amounts of data or good generative models are not readily available.

## 4 Theoretical Results

In this section we provide theoretical evidence to highlight the importance of early stopping for DIP-based approaches. Here we focus on denoising a noisy signal via DIP. The optimization problem in this case takes the form

 minw L(w):=∥y−G(z;w)∥2. (6)

This is a special instance of Eqn. 2 with the measurement matrix

corresponding to denoising. We focus on generators consisting of a single hidden-layer ReLU network with

inputs, hidden units, and outputs. Using the generator model in this case is given by

 G(z;W,V)=V⋅ReLU(Wz), (7)

where is the input, the input-to-hidden weights, and the hidden-to-output weights. We assume is fixed at random and train over using gradient descent. With these formulations in place, we are now ready to state our theoretical result.

###### Theorem 4.1.

Consider fitting a generator of the form to a signal with , , , and . Furthermore, assume

is a random matrix with i.i.d.

entries with . Starting from an initial weight matrix selected at random with i.i.d.  entries, we run gradient descent updates of the form on the loss

 L(W)=12∥∥V⋅ReLU(Wz)−y∥∥2,

with step size where . Assuming that

 d≥Cn,

with a fixed numerical constant, then

 ∥∥V⋅ReLU(Wτz)−y∥∥≤3(1−¯η8(4n+d))τ∥y∥

holds for all

with probability at least

.

Our theoretical result shows that after many iterative updates, gradient descent will solve this non-convex optimization problem and fit any signal , if the generator network is sufficiently wide. This occurs as soon as the number of hidden units exceeds the signal size by a constant factor. While our proof is for the case of , a similar result can be shown for other measurement matrices, since the resulting is essentially a Gaussian i.i.d. measurement matrix of different output dimension. This result demonstrates that early stopping is necessary for DIP-based methods to be successful; otherwise the network can fit any signal, including one that is noisy or corrupted.

Our proof builds on theoretical ideas from Oymak et al. oymak2019towards which provide a general framework for establishing global convergence guarantees for overparameterized nonlinear learning problems based on various properties of the Jacobian mapping along the gradient descent trajectory. See also du2018gradient ; Oymak:2018aa and references therein for other related literature. We combine delicate tools from empirical process theory, random matrix theory, and matrix algebra to show that, starting from a random initialization, the Jacobian mapping across all iterates has favorable properties with high probability, hence facilitating convergence to a global optima.

## 5 Experiments

To replicate these experiments or run new experiments using this method, please see our GitHub repository at github.com/davevanveen/compsensing_dip.

### 5.1 Experimental Setup

Measurements: We evaluate our algorithm using two different measurements processes, i.e. matrices . First we set the entries of to be Gaussian i.i.d. such that . Recall is the number of measurements, and is the number of pixels in the ground truth image. This measurement process is standard practice in compressed sensing literature, and hence we use it on each dataset. Additionally we use a Fourier measurement process common in MRI applications mardani2018neural ; mardani2017deep ; hammernik2018learning ; lehtinen2018noise2noise ; lustig2008compressed and evaluate it on the x-ray dataset.

Datasets: We use our algorithm to reconstruct both grayscale and RGB images. For grayscale we use the first 100 images in the test set of MNIST lecun1998gradient and also 60 random images from the Shenzhen Chest X-Ray Dataset jaeger2014two , downsampling a 512x512 crop to 256x256 pixels. For RGB we use retinopathy images from the STARE dataset hoover2000locating with 512x512 crops downsized to 128x128 pixels.

Baselines: We compare our algorithm to state-of-the-art unlearned methods such as BM3D-AMP metzler2016denoising ; metzler2015bm3d , TVAL3 li2011compressive ; li2009user ; zhang2013improved , and Lasso in a DCT basis ahmed1974discrete . We also evaluated the performance of Lasso in a Daubechies wavelet basis daubechies1988orthonormal ; wasilewski2010pywavelets but found this performed worse than Lasso - DCT on all datasets. Thus for simplicity we refer to Lasso - DCT as “Lasso” and do not include results of Lasso - Wavelet. To reconstruct RGB retinopathy images, we must use the colored version CBM3D-AMP. Unfortunately an RGB version of TVAL3 does not currently exist, although related TV algorithms such as FTVd perform similar tasks such as denoising RGB images wang2008new .

Metrics: To quantitatively evaluate the performance of our algorithm, we use per-pixel mean-squared error (MSE) between the reconstruction and true image , i.e. . Note that because these pixels are over the range , it’s possible for MSE to be greater than .

Implementation: To find a set of weights that minimize Eqn. 3

, we use PyTorch

paszke2017automatic with a DCGAN architecture. For baselines BM3D-AMP and TVAL3, we use the repositories provided by the authors Metzler et al. metzler2018 and Li et al. li2013 , respectively. For baseline reconstructions Lasso, we use scikit-learn scikit-learn . Section A in the appendix provides further details on our experimental procedures, e.g. choosing hyperparameters.

### 5.2 Experimental Results

#### Results: Learned Regularization

We first evaluate the benefits of learned regularization by comparing our algorithm with and without learned regularization, i.e. and , respectively. The latter setting is an unlearned method, as we are not leveraging () from a specific dataset. In the former setting we first learn () from a particular set of x-ray images; we then evaluate on a different set of x-ray images. We compare these two settings with varying noise and across different number of measurements.

Our results in Table 1 show that learned regularization does indeed provide benefit. This benefit tends to increase with more noise or fewer measurements. Thus we can infer that assuming a learned Gaussian distribution over weights is useful, especially when the original signal is noisy or significantly compressed.

#### Results: Unlearned CS-DIP

For the remainder of this section, we evaluate our algorithm in the noiseless case without learned regularization, i.e. when in Eqn. 1 and in Eqn. 3. Hence CS-DIP is completely unlearned; as such, we compare it to other state-of-the-art unlearned algorithms on various datasets and with different measurement matrices.

MNIST: In Figure 0(b) we plot reconstruction error with varying number of measurements of = 784. This demonstrates that our algorithm outperforms baselines in almost all cases. Figure 1(b) shows reconstructions for 75 measurements, while remaining reconstructions are in the appendix.

Chest X-Rays: In Figure 0(a) we plot reconstruction error with varying number of measurements of = 65536. Figure 1(a) shows reconstructions for 2000 measurements, while the remaining reconstructions are in the appendix. On this dataset we outperform all baselines except BM3D-AMP for higher . However for lower , e.g. when the ratio , BM3D-AMP often doesn’t converge. This finding seems to support the work of Metzler et al. metzler2015bm3d : BM3D-AMP performs impressively on higher , e.g. , but recovery at lower sampling rates is not demonstrated.

Retinopathy: We plot reconstruction error with varying number of measurements of = 49152 in Figure 2(a) of the appendix. On this RGB dataset we quantitatively outperform all baselines except BM3D-AMP on higher ; however, even at these higher , patches of green and purple pixels corrupt the image reconstructions as seen in Figure 9. Similar to x-ray for lower , BM3D-AMP often fails to produce anything sensible. All retinopathy reconstructions are located in the appendix.

Fourier Measurement Process: All previous experiments used a measurement matrix containing Gaussian i.i.d. entries. We now consider the case where the measurement matrix is a subsampled Fourier matrix. That is, for a 2D image and a set of indices , the measurements we receive are given by , where

is the 2D Fourier transform. We choose

to be indices along radial lines, as shown in Figure 12 of the appendix; this choice of is common in literature candes2006robust and MRI applications mardani2017deep ; lustig2008compressed ; eksioglu2018denoising . We compare our algorithm to baselines on the x-ray dataset for radial lines in the Fourier domain, which corresponds to Fourier coefficients, respectively. We plot reconstruction error with varying number of Fourier coefficients in Figure 2(b) of the appendix, outperforming baselinse BM3D-AMP and TVAL3. Reconstructions can also be found in the appendix.

Runtime: In Table 2 we show the runtimes of CS-DIP on the x-ray dataset. While runtime is not the focus of our work, because our algorithm can utilize GPU, it is competitive with or faster than baseline algorithms. The baselines are implemented in MATLAB or scikit-learn scikit-learn and only leverage CPU, while we run our experiments on a NVIDIA GTX 1080-Ti.

## 6 Conclusion

We demonstrate compressed sensing recovery using untrained, randomly initialized convolutional neural networks. Our method outperforms previous state-of-the-art unlearned methods in most cases, especially when the number of obtained measurements is small. Additionally we propose a learned regularization method, which enforces a learned Gaussian prior on the network weights. This prior reduces reconstruction error, particularly for noisy or compressed measurements. Finally we show that a sufficiently wide single-layer network can fit any signal, thus motivating regularization by early stopping.

## Appendix A Experimentation Details and Insights

Our algorithm CS-DIP is implemented in PyTorch using the RMSProp optimizer [74] with learning rate , momentum , and update steps for every set of measurements. These parameters are the same across all datasets.

We also made some dataset-specific design choices. On larger images such as xray () and retinopathy (), we found no difference using random restarts of the initial seed . However for smaller vectors such as MNIST (), restarts did provide some benefit. As such our experiments utilize 5 random restarts for MNIST and one initial seed (no restarts) for x-ray and retinopathy images. For choosing hyperparameters and in Eqn. 3, we used a standard grid search and selected the best one. We used a similar grid search procedure for choosing dataset-specific hyperparameters in baseline algorithms BM3D-AMP, TVAL3, and Lasso.

The network’s initial seed in Eqn. 3 is initialized with random Gaussian i.i.d. entries and then held fixed as we optimize over network weights . We found negligible difference when varying the dimension of (within reason), as this only affects the number of channels in the network’s first layer. As such we set the dimension of to be , a standard choice for DCGAN architectures.

We further note that the “Error vs. Iterations” curve of CS-DIP with RMSProp did not monotonically decrease for some learning rates, even though error gradually decreased in all cases. As such we implemented a stopping condition which chooses the reconstruction with least error over the last 20 iterations. Note we choose this reconstruction based off measurement loss and do not look at the ground truth image.

## Appendix B Proof of Section 4: Theoretical Justification for Early Stopping

In this section we prove our theoretical result in Theorem 4.1. We begin with a summary of some notations we use throughout in Section B.1. Next, we state some preliminary calculations in Section B.2. Then, we state a few key lemmas in Section B.3 with the proofs deferred to Appendix C. Finally, we complete the proof of Theorem 4.1 in Section B.4.

### b.1 Notation

In this section we gather some notation used throughout the proofs. We use ReLU with . For two matrices/vectors and of the same size we use to denote the entrywise Hadamard product of these two matrices/vectors. We also use to denote their Kronecker product. For two matrices and , we use the Khatrio-Rao product as the matrix with rows given by . For a matrix we use vect to denote a vector obtained by aggregating the rows of the matrix into a vector, i.e. vect. For a matrix we use and

denotes the minimum singular value and spectral norm of

. Similarly, for a symetric matrix we use to denote its smallest eigenvalue.

### b.2 Preliminaries

In this section we carryout some simple calculations yielding simple formulas for the gradient and Jacobian mappings. We begin by noting we can rewrite the gradient descent iterations in the form

 vect(Wτ+1)=vect(Wτ)−ηvect(∇L(Wτ)).

Here,

 vect(∇L(Wτ))=JT(Wτ)r(Wτ),

where

 J(W)=∂∂vect(W)f(W)and

is the Jacobian mapping associated to the network and

 r(W)=ϕ(Vϕ(Wz))−y.

is the misfit or residual vector. Note that

 ∂∂vect(W)vTϕ(Wz)= [v1ϕ′(wT1z)xTv2ϕ′(wT2z)xT…vkϕ′(wTkz)xT] = (v⊙ϕ′(Wx))T⊗xT

Thus

 J(W)=(Vdiag(ϕ′(Wz)))∗(1zT),

This in turn yields

 J(W)JT(W)= (Vdiag(ϕ′(Wz))diag(ϕ′(Wz))VT)⊙(∥z∥211T) = ∥z∥2Vdiag(ϕ′(Wz)⊙ϕ′(Wz))VT (8)

### b.3 Lemmas for controlling the spectrum of the Jacobian and initial misfit

In this section we state a few lemmas concerning the spectral properties of the Jacobian mapping, its perturbation and initial misfit of the model with the proofs deferred to Appendix C.

###### Lemma B.1 (Minimum singular value of the Jacobian at initialization).

Let and be random matrices with i.i.d.  and entries and define the Jacobian mapping . Then as long as ,

 σmin(J(W))≥12ν√d∥z∥.

holds with probability at least .

###### Lemma B.2 (Perturbation lemma).

Let be a matrix with i.i.d.  entries, , and define the Jacobian mapping . Also let be a matrix with i.i.d.  entries. Then,

 ∥J(W)−J(W0)∥≤ν∥z∥⎛⎜ ⎜⎝2√n+  ⎷6(2dR)23log⎛⎜⎝d3(2dR)23⎞⎟⎠⎞⎟ ⎟⎠,

holds for all obeying with probability at least .

###### Lemma B.3 (Spectral norm of the Jacobian).

Let be a matrix with i.i.d.  entries, , and define the Jacobian mapping . Then,

 ∥J(W)∥≤ν(√d+2√n)∥z∥,

holds for all with probability at least .

###### Lemma B.4 (Initial misfit).

Let be a matrix with i.i.d.  entries with . Also let be a matrix with i.i.d.  entries. Then

 ∥Vϕ(Wz)−y∥≤3∥y∥,

holds with probability at least .

### b.4 Proof of Theorem 4.1

Consider a nonlinear least-squares optimization problem of the form

 minθ∈Rp L(θ):=12∥f(θ)−y∥2,

with and . Suppose the Jacobian mapping associated with obeys the following three assumptions.

###### Assumption 1.

Fix a point . We have that .

###### Assumption 2.

Let denote a norm that is dominated by the Euclidean norm i.e.  holds for all . Fix a point and a number . For any satisfying , we have that .

###### Assumption 3.

For all , we have that .

Under these assumptions we can state the following theorem from [60].

###### Theorem B.5 (Non-smooth Overparameterized Optimization).

Given , suppose Assumptions 1, 2, and 3 hold with

 R=5∥y−f(θ0)∥α.

Then, picking constant learning rate , all gradient iterations obey the followings

 ∥y−f(θτ)∥≤(1−ηα24)τ∥y−f(θ0)∥ (9) α5∥θτ−θ0∥+∥y−f(θτ)∥≤∥y−f(θ0)∥. (10)

We shall apply this theorem to the case where the parameter is and the nonlinear mapping is given by and . All that is needed to be able to apply this theorem is check that the assumptions hold. Per the assumptions of the theorem we use

 ν=1√dn∥y∥∥z∥.

To this aim note that using Lemma B.1 Assumption 1 holds with

 α=14ν√d∥z∥=14√n∥y∥,

with probability at least . Furthermore, by Lemma B.3 Assumption 3 holds with

 β=∥y∥√8n√4n+d≥12(√d4n+1)∥y∥=ν(√d+2√n)∥z∥.

with probability at least . All that remains for applying the theorem above is to verify Assumption 2 holds with high probability

 R=60√n=15∥y∥α≥5α∥Vϕ(Wz)−y∥

In the above we have used Lemma B.4 to conclude that holds with probability at least . Thus, using Lemma B.2 all that remains is to show that

 1√dn∥y∥⎛⎜ ⎜⎝2√n+  ⎷6(2dR)23ln⎛⎜⎝d3(2dR)23⎞⎟⎠⎞⎟ ⎟⎠≤α3=112√n∥y∥,

holds with and with probability at least . The latter is equivalent to

which can be rewritten in the form

 2√nd+    ⎷6(120)233√ndln⎛⎜ ⎜⎝13(120)233√nd⎞⎟ ⎟⎠≤112,

which holds as long as . Thus with then Assumptions 1, 2, and 3 holds with probability at least . Thus, Theorem B.5 holds with high probability. Applying Theorem B.5 completes the proof.

## Appendix C Proof of Lemmas for the Spectral Properties of the Jacobian

### c.1 Proof of Lemma b.1

We prove the result for , the general result follows from a simple re-scaling. Define the vectors

 aℓ=Vℓϕ′(⟨wℓ,z⟩)∈Rn,

with the th column of . Using (B.2) we have

 J(W)JT(W)= ∥z∥2Vdiag(ϕ′(Wz)⊙ϕ′(Wz))VT, = ∥z∥2(d∑ℓ=1aℓaTℓ), = d∥z∥2(1dd∑ℓ=1aℓaTℓ). (11)

To bound the minimum eigenvalue we state a result from [58].

###### Theorem C.1.

Assume

are i.i.d. random positive semidefinite matrices whose coordinates have bounded second moments. Define

(this is an entry-wise expectation) and

 ˆΣd=1dd∑ℓ=1Aℓ.

Let be such that for all . Then for any we have

 P{∀u∈Rn:uTˆΣku≥⎛⎝1−7h√n+2ln(2/δ)d⎞⎠uTΣu}≥1−δ

We shall apply this theorem with . To do this we need to calculate the various parameters in the theorem. We begin with and note that for ReLU we have

 Σ:= E[A1] = E[a1aT1] = = = Ew∼N(0,Ik)[I{wTz≥0}]In = 12In.

To calculate we have

 √E[(uTA1u)2]≤ √E[(aT1u)4] ≤ ≤ √32∥u∥4 ≤ √3√2∥u∥2 = √6uT(12In)u = √6⋅uTΣu.

Thus we can take . Therefore, using Theorem C.1 with we can conclude that

 λmin(1dd∑ℓ=1aℓaTℓ)≥14

holds with probability at least as long as

 d≥3528⋅n.

Plugging this into (C.1) we conclude that with probability at least

 σmin(J(W))≥12√d∥z∥.

### c.2 Proof of Lemma b.2

We prove the result for , the general result follows from a simple rescaling. Based on (B.2) we have

 (J(W)−J(W0))(J(W)−J(W0))T=∥z∥2Vdiag((ϕ′(Wz)−ϕ′(W0z))⊙(ϕ′(Wz)−ϕ′(W0z)))VT.

Thus

 ∥J(W)−J(W0)∥≤ ∥z∥∥∥Vdiag(ϕ′(Wz)−ϕ′(W0z))∥∥c (12) = ∥z∥∥∥Vdiag(I{Wz≥0}−I{W0z≥0})∥∥ ≤ ∥z∥∥∥VS(W)∥∥, (13)

where is the set of indices where and have different signs i.e.  and is a submatrix obtained by picking the columns corresponding to .

To continue further note that by Gordon’s lemma we have

 sup|S|≤s∥VS∥≤√n+√2slog(d/s)+t,

with probability at least . In particular using we conclude that

 sup|S|≤s∥VS∥≤2√n+√2slog(d/s), (14)

with probability at least . To continue further we state a lemma controlling the size of based on the size of the radius .

###### Lemma C.2 (sign changes in local neighborhood).

Let be a matrix with i.i.d.  entries. Also for a matrix define . Then for any obeying

 |S(W)|≤2⌈(2dR)23⌉

holds with probability at least .

Combining (12) together with (14) (using ) and Lemma C.2 we conclude that

 ∥J(W)−J(W0)∥≤∥z∥⎛⎜ ⎜⎝2√n+  ⎷6(2dR)23log⎛⎜⎝d3(2dR)23⎞⎟⎠⎞⎟ ⎟⎠

holds with probability at least .

### c.3 Proof of Lemma c.2

To prove this result we utilize two lemmas from [60]. In these lemmas we use to denote the th smallest entry of after sorting its entries in terms of absolute value.

###### Lemma C.3.

[60, Lemma C.2] Given an integer , suppose

 ∥W−W0∥≤√m|W0z|m−∥z∥,

then

 |S(W)|≤2m.
###### Lemma C.4.

[60, Lemma C.3] Let . Also let be a matrix with i.i.d.  entries. Then, with probability at least ,

 |W0z|m−∥z∥≥m2d.

Combining the latter two lemmas with we conclude that when

 ∥W−W0∥≤ R ≤ m322d ≤ √mm2d ≤ √m|W0z|m−∥z∥,

then with probability at least we have

 |S