compsensing_dip
Compressed sensing with deep image prior algorithm
view repo
We propose a novel method for compressed sensing recovery using untrained deep generative models. Our method is based on the recently proposed Deep Image Prior (DIP), wherein the convolutional weights of the network are optimized to match the observed measurements. We show that this approach can be applied to solve any differentiable inverse problem. We also introduce a novel learned regularization technique which incorporates a small amount of prior information, further reducing the number of measurements required for a given reconstruction error. Our algorithm requires approximately 4-6x fewer measurements than classical Lasso methods. Unlike previous approaches based on generative models, our method does not require the model to be pre-trained. As such, we can apply our method to various medical imaging datasets for which data acquisition is expensive and no known generative models exist.
READ FULL TEXT VIEW PDFCompressed sensing with deep image prior algorithm
We consider the well-studied compressed sensing problem of recovering an unknown signal by observing a set of noisy measurements of the form
(1) |
Here is a known measurement matrix, typically generated with random independent Gaussian entries. Since the number of measurements is smaller than the dimension
of the unknown vector
, this is an under-determined system of noisy linear equations and hence ill-posed. There are many solutions, and some structure must be assumed on to have any hope of recovery. Pioneering research donoho2006compressed ; candes2006robust ; candes2005decoding established that if is assumed to be sparse in a known basis, a small number of measurements will be provably sufficient to recover the unknown vector in polynomial time using methods such as Lasso tibshirani1996regression .Sparsity approaches have proven successful, but more complex models with additional structure have been recently proposed such as model-based compressive sensing baraniuk2010model and manifold models hegde2008random ; hegde2012signal ; eftekhari2015new . Bora et al. bora2017compressed
showed that deep generative models can be used as excellent priors for images. They also showed that backpropagation can be used to solve the signal recovery problem by performing gradient descent in the generative latent space. This method enabled image generation with significantly fewer measurements compared to Lasso for a given reconstruction error. Compressed sensing using deep generative models was further improved in very recent work
tripathi2018correction ; grover2018amortized ; kabkab2018task ; shah2018solving ; fletcher2017inference ; DBLP:journals/corr/abs-1802-04073 . Additionally a theoretical analysis of the nonconvex gradient descent algorithm bora2017compressed was proposed by Hand et al. hand2017global under some assumptions on the generative model.Inspired by these impressive benefits of deep generative models, we chose to investigate the application of such methods for medical imaging, a canonical application of compressive sensing. A significant problem, however, is that all these previous methods require the existence of pre-trained models. While this has been achieved for various types of images, e.g. human faces of CelebA liu2015faceattributes via DCGAN radford2015unsupervised , it remains significantly more challenging for medical images wolterink2017generative ; schlegl2017unsupervised ; nie2017medical ; schlemper2017deep . Instead of addressing this problem in generative models, we found an easier way to circumvent it.
Surprising recent work by Ulyanov et al. ulyanov2017deep proposed Deep Image Prior (DIP), which uses untrainedconvolutional neural networks. In DIP-based schemes, a convolutional neural network generator (e.g. DCGAN) is initialized with random weights; these weights are subsequently optimized to make the network produce an output as close to the target image as possible. This procedure is unlearned, using no prior information from other images. The prior is enforced only by the fixed convolutional structure of the generator network.
Generators used for DIP are typically over-parameterized, i.e. the number of network weights is much larger compared to the output dimension. For this reason DIP has empirically been found to overfit to noise if run for too many iterations ulyanov2017deep . In this paper we theoretically prove that this phenomenon occurs with gradient descent and justify the use of early stopping and other regularization methods.
Our Contributions:
In Section 3 we propose DIP for compressed sensing (CS-DIP). Our basic method is as follows. Initialize a DCGAN generator with random weights; use gradient descent to optimize these weights such that the network produces an output which agrees with the observed measurements as much as possible. This unlearned method can be improved with a novel learned regularization technique, which regularizes the DCGAN weights throughout the optimization process.
In Section 4 we theoretically prove that DIP will fit any signal to zero error with gradient descent. Our result is established for a network with a single hidden layer and sufficient constant fraction over-parametrization. While it is expected that over-parametrized neural networks can fit any signal, the fact that gradient descent can provably solve this non-convex problem is interesting and provides theoretical justification for early stopping.
In Section 5 we empirically show that CS-DIP outperforms previous unlearned methods in many cases. While pre-trained or “learned” methods will likely perform better bora2017compressed , we have the advantage of not requiring a generative model trained over large datasets. As such, we can apply our method to various medical imaging datasets for which data acquisition is expensive and generative models are difficult to train.
A classical assumption made in compressed sensing is that the vector is -sparse in some basis such as wavelet or discrete cosine transform (DCT). Finding the sparsest solution to an underdetermined linear system of equations is NP-hard in general; however, if the matrix
satisfies conditions such as the Restricted Eigenvalue Condition (REC) or Restricted Isometry Property (RIP)
candes2006stable ; bickel2009simultaneous ; donoho2006compressed ; tibshirani1996regression , then can be recovered in polynomial time via convex relaxations tropp2006just or iterative methods. There is extensive compressed sensing literature regarding assumptions on , numerous recovery algorithms, and variations of RIP and REC bickel2009simultaneous ; negahban2009unified ; agarwal2010fast ; bach2012optimization ; loh2011high .Compressed sensing methods have found many applications in imaging, for example the single-pixel camera (SPC) duarte2008single . Medical tomographic applications include x-ray radiography, microwave imaging, magnetic resonance imaging (MRI) winters2010sparsity ; chen2008prior ; lustig2007sparse . Obtaining measurements for medical imaging can be costly, time-consuming, and in some cases dangerous to the patient qaisar2013compressive . As such, an important goal is to reduce the number of measurements while maintaining good reconstruction quality.
Aside from the classical use of sparsity, recent work has used other priors to solve linear inverse problems. Plug-and-play priors venkatakrishnan2013plug ; chan2017plug and Regularization by Denoising romano2017little have shown how image denoisers can be used to solve general linear inverse problems. A key example of this is BM3D-AMP, which applies a Block-Matching and 3D filtering (BM3D) denoiser to an Approximate Message Passing (D-AMP) algorithm metzler2016denoising ; metzler2015bm3d . AMP has also been applied to linear models in other contexts, e.g. schniter2016vector . Another related algorithm is TVAL3 zhang2013improved ; li2009user which leverages augmented Lagrangian multipliers to achieve impressive performance on compressed sensing problems. In many different settings, we compare our algorithm to these prior methods: BM3D-AMP, TVAL3, and Lasso.
While sparsity in some chosen basis is well-established, recent work has shown better empirical performance when neural networks are used bora2017compressed . This success is attributed to the fact that neural networks are capable of learning image priors from very large datasets goodfellow2014generative ; kingma2013auto . There is significant recent work on solving linear inverse problems using various learned techniques, e.g. recurrent generative models mardani2017recurrent and auto-regressive models dave2018solving . Additionally approximate message passing (AMP) has been extended to a learned setting by Metzler et al. metzler2017learned .
Bora et al. bora2017compressed is the closest to our set-up. In this work the authors assume that the unknown signal is in the range of a pre-trained generative model such as a generative adversarial network (GAN) goodfellow2014generative
or variational autoencoder (VAE)
kingma2013auto . The recovery of the unknown signal is obtained via gradient descent in the latent space by searching for a signal that satisfies the measurements. This can be directly applied for linear inverse problems and more generally to any differentiable measurement process. Recent work has built upon these methods using new optimization techniques Chang17 , uncertainty autoencoders grover2018uncertainty , and other approaches dhar2018modeling ; kabkab2018task ; mixon2018sunlayer ; pandit2019asymptotics ; rusu2018meta . The key point is that all this prior work requires pre-trained generative models, in contrast to CS-DIP. Finally, there is significant ongoing work to understand DIP and develop related approaches, see e.g. heckel2018deep ; dittmer2018regularization .Let be the signal that we are trying to reconstruct, be the measurement matrix, and be independent noise. Given the measurement matrix and the observations , we wish to reconstruct an that is close to .
A generative model is a deterministic function which takes as input a seed and is parameterized by “weights” , producing an output . These models have shown excellent performance generating real-life signals such as images goodfellow2014generative ; kingma2013auto and audio wavenet . We investigate deep convolutional generative models, a special case in which the model architecture has multiple cascaded layers of convolutional filters krizhevsky2012imagenet . In this paper we apply a DCGAN radford2015unsupervised model and restrict the signals to be images.
Our approach is to find a set of weights for the convolutional network such that the measurement matrix applied to the network output, i.e. , matches the measurements we are given. Hence we initialize an untrained network with some fixed and solve the following optimization problem:
(2) |
This is, of course, a non-convex problem because
is a complex feed-forward neural network. Still we can use gradient-based optimizers for any generative model and measurement process that is differentiable. Generator networks such as DCGAN are biased toward smooth, natural images due to their convolutional structure; thus the network structure alone provides a good prior for reconstructing images in problems such as inpainting and denoising
ulyanov2017deep . Our finding is that this applies to general linear measurement processes. We restrict our solution to lie in the span of a convolutional neural network. If a sufficient number of measurements is given, we obtain an output such that .Note that this method uses an untrained generative model and optimizes over the network weights . In contrast previous methods, such as that of Bora et al. bora2017compressed , use a trained model and optimize over the latent -space, solving . We instead initialize a random with Gaussian i.i.d. entries and keep this fixed throughout the optimization process.
In our algorithm we leverage the well-established total variation regularization rudin1992nonlinear ; wang2008new ; liu2018image , denoted as . We also propose an additional learned regularization technique, ; note that without this technique, i.e. when , our method is completely unlearned. Lastly we use early stopping, a phenomena that will be analyzed theoretically in Section 4.
Thus the final optimization problem becomes
(3) |
The regularization term contains hyperparameters
and for total variation and learned regularization: . We now discuss this term.Without learned regularization CS-DIP relies only on linear measurements taken from one unknown image. We now introduce a novel method which leverages a small amount of training data to optimize regularization. In this case training data refers to measurements from additional ground truth of a similar type, e.g. other x-ray images.
To leverage this additional information, we pose Eqn. 3
as a Maximum a Posteriori (MAP) estimation problem and propose a novel prior on the weights of the generative model. This prior then acts as a regularization term, penalizing the model toward an optimal set of weights
.For a set of weights , we model the likelihood of the measurements and the prior on the weights
as Gaussian distributions given by
where and .
In this setting we want to find a set of weights that maximizes the posterior on given , i.e.,
(4) |
This gives us the learned regularization term
(5) |
where the coefficient in Eqn. 4 controls the strength of the prior.
Notice that when and this regularization term is equivalent to -regularization. Thus this method can be thought of as a more strategic version of standard weight decay.
In the previous section, we introduced the learned regularization term defined in Eqn. 5. However we have not yet learned values for parameters that incorporate prior knowledge of the network weights. We now propose a way to estimate these parameters.
Assume we have a set of measurements from different images , each obtained with a different measurement matrix . For each measurement we run CS-DIP to solve the optimization problem in Eqn. 3 and obtain an optimal set of weights . Note that when optimizing for the weights we only have access to the measurements , not the ground truth .
The number of weights in deep networks tends to be very large. As such, learning a distribution over each weight, i.e. estimating and , becomes intractable. We instead use a layer-wise approach: with network layers, we have and . Thus each weight within layer is modeled according to the same distribution. For simplicity we assume , i.e. that network weights are independent across layers. The process of estimating statistics from is described in Algorithm 1 of the appendix.
We use this learned in the regularization term from Eqn. 5 for reconstructing measurements of images. We refer to this technique as learned regularization
. While this may seem analogous to batch normalization
ioffe2015batch , note that we only use to penalize the -norm of the weights and do not normalize the layer outputs themselves.The proposed CS-DIP does not require training if no learned regularization is used, i.e. if in Eqn. 3. This means that CS-DIP can be applied only with measurements from a single image and no prior information of similar images in a dataset.
Our next idea, learned regularization, utilizes a small amount of prior information, requiring access to measurements from a small number of similar images (roughly ). In contrast, other pre-trained models such as that of Bora et al. bora2017compressed require access to ground truth from a massive number of similar images (tens of thousands for CelebA). If such a large dataset is available, and if a good generative model can be trained on that dataset, we expect that pre-trained models bora2017compressed ; grover2018amortized ; kabkab2018task ; mardani2017recurrent would outperform our method. Our approach is instead more suitable for reconstructing problems where large amounts of data or good generative models are not readily available.
In this section we provide theoretical evidence to highlight the importance of early stopping for DIP-based approaches. Here we focus on denoising a noisy signal via DIP. The optimization problem in this case takes the form
(6) |
This is a special instance of Eqn. 2 with the measurement matrix
corresponding to denoising. We focus on generators consisting of a single hidden-layer ReLU network with
inputs, hidden units, and outputs. Using the generator model in this case is given by(7) |
where is the input, the input-to-hidden weights, and the hidden-to-output weights. We assume is fixed at random and train over using gradient descent. With these formulations in place, we are now ready to state our theoretical result.
Consider fitting a generator of the form to a signal with , , , and . Furthermore, assume
is a random matrix with i.i.d.
entries with . Starting from an initial weight matrix selected at random with i.i.d. entries, we run gradient descent updates of the form on the losswith step size where . Assuming that
with a fixed numerical constant, then
holds for all
with probability at least
.Our theoretical result shows that after many iterative updates, gradient descent will solve this non-convex optimization problem and fit any signal , if the generator network is sufficiently wide. This occurs as soon as the number of hidden units exceeds the signal size by a constant factor. While our proof is for the case of , a similar result can be shown for other measurement matrices, since the resulting is essentially a Gaussian i.i.d. measurement matrix of different output dimension. This result demonstrates that early stopping is necessary for DIP-based methods to be successful; otherwise the network can fit any signal, including one that is noisy or corrupted.
Our proof builds on theoretical ideas from Oymak et al. oymak2019towards which provide a general framework for establishing global convergence guarantees for overparameterized nonlinear learning problems based on various properties of the Jacobian mapping along the gradient descent trajectory. See also du2018gradient ; Oymak:2018aa and references therein for other related literature. We combine delicate tools from empirical process theory, random matrix theory, and matrix algebra to show that, starting from a random initialization, the Jacobian mapping across all iterates has favorable properties with high probability, hence facilitating convergence to a global optima.
To replicate these experiments or run new experiments using this method, please see our GitHub repository at github.com/davevanveen/compsensing_dip.
Measurements, | |||||
---|---|---|---|---|---|
500 | 1000 | 2000 | 4000 | 8000 | |
0 | 9.9% | 2.9% | 0.2% | 2.0% | 0.6% |
10 | 11.6% | 4.6% | 4.5% | 2.4% | 1.0% |
100 | 14.9% | 19.2% | 5.0% | 3.9% | 2.8% |
1000 | 37.4% | 30.6% | 19.8% | 3.0% | 6.2% |
corresponds to variance of the noise vector
in Eqn. 1, i.e. each entry of is drawn independently . These results indicate that LR tends to provide greater benefit with noisy signals and with fewer measurements.Measurements: We evaluate our algorithm using two different measurements processes, i.e. matrices . First we set the entries of to be Gaussian i.i.d. such that . Recall is the number of measurements, and is the number of pixels in the ground truth image. This measurement process is standard practice in compressed sensing literature, and hence we use it on each dataset. Additionally we use a Fourier measurement process common in MRI applications mardani2018neural ; mardani2017deep ; hammernik2018learning ; lehtinen2018noise2noise ; lustig2008compressed and evaluate it on the x-ray dataset.
Datasets: We use our algorithm to reconstruct both grayscale and RGB images. For grayscale we use the first 100 images in the test set of MNIST lecun1998gradient and also 60 random images from the Shenzhen Chest X-Ray Dataset jaeger2014two , downsampling a 512x512 crop to 256x256 pixels. For RGB we use retinopathy images from the STARE dataset hoover2000locating with 512x512 crops downsized to 128x128 pixels.
Baselines: We compare our algorithm to state-of-the-art unlearned methods such as BM3D-AMP metzler2016denoising ; metzler2015bm3d , TVAL3 li2011compressive ; li2009user ; zhang2013improved , and Lasso in a DCT basis ahmed1974discrete . We also evaluated the performance of Lasso in a Daubechies wavelet basis daubechies1988orthonormal ; wasilewski2010pywavelets but found this performed worse than Lasso - DCT on all datasets. Thus for simplicity we refer to Lasso - DCT as “Lasso” and do not include results of Lasso - Wavelet. To reconstruct RGB retinopathy images, we must use the colored version CBM3D-AMP. Unfortunately an RGB version of TVAL3 does not currently exist, although related TV algorithms such as FTVd perform similar tasks such as denoising RGB images wang2008new .
Metrics: To quantitatively evaluate the performance of our algorithm, we use per-pixel mean-squared error (MSE) between the reconstruction and true image , i.e. . Note that because these pixels are over the range , it’s possible for MSE to be greater than .
Implementation: To find a set of weights that minimize Eqn. 3
, we use PyTorch
paszke2017automatic with a DCGAN architecture. For baselines BM3D-AMP and TVAL3, we use the repositories provided by the authors Metzler et al. metzler2018 and Li et al. li2013 , respectively. For baseline reconstructions Lasso, we use scikit-learn scikit-learn . Section A in the appendix provides further details on our experimental procedures, e.g. choosing hyperparameters.Per-pixel reconstruction error (MSE) vs. number of measurements. Vertical bars indicate 95% confidence intervals. BM3D-AMP frequently fails to converge for fewer than
measurements on x-ray images, as denoted by error values far above the vertical axis.We first evaluate the benefits of learned regularization by comparing our algorithm with and without learned regularization, i.e. and , respectively. The latter setting is an unlearned method, as we are not leveraging () from a specific dataset. In the former setting we first learn () from a particular set of x-ray images; we then evaluate on a different set of x-ray images. We compare these two settings with varying noise and across different number of measurements.
Our results in Table 1 show that learned regularization does indeed provide benefit. This benefit tends to increase with more noise or fewer measurements. Thus we can infer that assuming a learned Gaussian distribution over weights is useful, especially when the original signal is noisy or significantly compressed.
For the remainder of this section, we evaluate our algorithm in the noiseless case without learned regularization, i.e. when in Eqn. 1 and in Eqn. 3. Hence CS-DIP is completely unlearned; as such, we compare it to other state-of-the-art unlearned algorithms on various datasets and with different measurement matrices.
MNIST: In Figure 0(b) we plot reconstruction error with varying number of measurements of = 784. This demonstrates that our algorithm outperforms baselines in almost all cases. Figure 1(b) shows reconstructions for 75 measurements, while remaining reconstructions are in the appendix.
Chest X-Rays: In Figure 0(a) we plot reconstruction error with varying number of measurements of = 65536. Figure 1(a) shows reconstructions for 2000 measurements, while the remaining reconstructions are in the appendix. On this dataset we outperform all baselines except BM3D-AMP for higher . However for lower , e.g. when the ratio , BM3D-AMP often doesn’t converge. This finding seems to support the work of Metzler et al. metzler2015bm3d : BM3D-AMP performs impressively on higher , e.g. , but recovery at lower sampling rates is not demonstrated.
Retinopathy: We plot reconstruction error with varying number of measurements of = 49152 in Figure 2(a) of the appendix. On this RGB dataset we quantitatively outperform all baselines except BM3D-AMP on higher ; however, even at these higher , patches of green and purple pixels corrupt the image reconstructions as seen in Figure 9. Similar to x-ray for lower , BM3D-AMP often fails to produce anything sensible. All retinopathy reconstructions are located in the appendix.
Fourier Measurement Process: All previous experiments used a measurement matrix containing Gaussian i.i.d. entries. We now consider the case where the measurement matrix is a subsampled Fourier matrix. That is, for a 2D image and a set of indices , the measurements we receive are given by , where
is the 2D Fourier transform. We choose
to be indices along radial lines, as shown in Figure 12 of the appendix; this choice of is common in literature candes2006robust and MRI applications mardani2017deep ; lustig2008compressed ; eksioglu2018denoising . We compare our algorithm to baselines on the x-ray dataset for radial lines in the Fourier domain, which corresponds to Fourier coefficients, respectively. We plot reconstruction error with varying number of Fourier coefficients in Figure 2(b) of the appendix, outperforming baselinse BM3D-AMP and TVAL3. Reconstructions can also be found in the appendix.Runtime: In Table 2 we show the runtimes of CS-DIP on the x-ray dataset. While runtime is not the focus of our work, because our algorithm can utilize GPU, it is competitive with or faster than baseline algorithms. The baselines are implemented in MATLAB or scikit-learn scikit-learn and only leverage CPU, while we run our experiments on a NVIDIA GTX 1080-Ti.
Algorithm | 1000 | 2000 | 4000 | 8000 |
---|---|---|---|---|
CS-DIP | 15.6 | 17.1 | 20.4 | 29.9 |
BM3D-AMP | 51.1 | 54.0 | 67.8 | 71.2 |
TVAL3 | 13.8 | 22.1 | 31.9 | 56.7 |
Lasso DCT | 27.1 | 33.0 | 52.2 | 96.4 |
We demonstrate compressed sensing recovery using untrained, randomly initialized convolutional neural networks. Our method outperforms previous state-of-the-art unlearned methods in most cases, especially when the number of obtained measurements is small. Additionally we propose a learned regularization method, which enforces a learned Gaussian prior on the network weights. This prior reduces reconstruction error, particularly for noisy or compressed measurements. Finally we show that a sufficiently wide single-layer network can fit any signal, thus motivating regularization by early stopping.
Fast global convergence rates of gradient methods for high-dimensional statistical recovery.
In Advances in Neural Information Processing Systems, pages 37–45, 2010.Foundations and Trends® in Machine Learning
, 4(1):1–106, 2012.Decoding by linear programming.
IEEE transactions on information theory, 51(12):4203–4215, 2005.Proceedings of International Conference on Computer Vision (ICCV)
, 2015.Unsupervised anomaly detection with generative adversarial networks to guide marker discovery.
In International Conference on Information Processing in Medical Imaging, pages 146–157. Springer, 2017.Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.
COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.Our algorithm CS-DIP is implemented in PyTorch using the RMSProp optimizer [74] with learning rate , momentum , and update steps for every set of measurements. These parameters are the same across all datasets.
We also made some dataset-specific design choices. On larger images such as xray () and retinopathy (), we found no difference using random restarts of the initial seed . However for smaller vectors such as MNIST (), restarts did provide some benefit. As such our experiments utilize 5 random restarts for MNIST and one initial seed (no restarts) for x-ray and retinopathy images. For choosing hyperparameters and in Eqn. 3, we used a standard grid search and selected the best one. We used a similar grid search procedure for choosing dataset-specific hyperparameters in baseline algorithms BM3D-AMP, TVAL3, and Lasso.
The network’s initial seed in Eqn. 3 is initialized with random Gaussian i.i.d. entries and then held fixed as we optimize over network weights . We found negligible difference when varying the dimension of (within reason), as this only affects the number of channels in the network’s first layer. As such we set the dimension of to be , a standard choice for DCGAN architectures.
We further note that the “Error vs. Iterations” curve of CS-DIP with RMSProp did not monotonically decrease for some learning rates, even though error gradually decreased in all cases. As such we implemented a stopping condition which chooses the reconstruction with least error over the last 20 iterations. Note we choose this reconstruction based off measurement loss and do not look at the ground truth image.
In this section we prove our theoretical result in Theorem 4.1. We begin with a summary of some notations we use throughout in Section B.1. Next, we state some preliminary calculations in Section B.2. Then, we state a few key lemmas in Section B.3 with the proofs deferred to Appendix C. Finally, we complete the proof of Theorem 4.1 in Section B.4.
In this section we gather some notation used throughout the proofs. We use ReLU with . For two matrices/vectors and of the same size we use to denote the entrywise Hadamard product of these two matrices/vectors. We also use to denote their Kronecker product. For two matrices and , we use the Khatrio-Rao product as the matrix with rows given by . For a matrix we use vect to denote a vector obtained by aggregating the rows of the matrix into a vector, i.e. vect. For a matrix we use and
denotes the minimum singular value and spectral norm of
. Similarly, for a symetric matrix we use to denote its smallest eigenvalue.In this section we carryout some simple calculations yielding simple formulas for the gradient and Jacobian mappings. We begin by noting we can rewrite the gradient descent iterations in the form
Here,
where
is the Jacobian mapping associated to the network and
is the misfit or residual vector. Note that
Thus
This in turn yields
(8) |
In this section we state a few lemmas concerning the spectral properties of the Jacobian mapping, its perturbation and initial misfit of the model with the proofs deferred to Appendix C.
Let and be random matrices with i.i.d. and entries and define the Jacobian mapping . Then as long as ,
holds with probability at least .
Let be a matrix with i.i.d. entries, , and define the Jacobian mapping . Also let be a matrix with i.i.d. entries. Then,
holds for all obeying with probability at least .
Let be a matrix with i.i.d. entries, , and define the Jacobian mapping . Then,
holds for all with probability at least .
Let be a matrix with i.i.d. entries with . Also let be a matrix with i.i.d. entries. Then
holds with probability at least .
Consider a nonlinear least-squares optimization problem of the form
with and . Suppose the Jacobian mapping associated with obeys the following three assumptions.
Fix a point . We have that .
Let denote a norm that is dominated by the Euclidean norm i.e. holds for all . Fix a point and a number . For any satisfying , we have that .
For all , we have that .
Under these assumptions we can state the following theorem from [60].
We shall apply this theorem to the case where the parameter is and the nonlinear mapping is given by and . All that is needed to be able to apply this theorem is check that the assumptions hold. Per the assumptions of the theorem we use
To this aim note that using Lemma B.1 Assumption 1 holds with
with probability at least . Furthermore, by Lemma B.3 Assumption 3 holds with
with probability at least . All that remains for applying the theorem above is to verify Assumption 2 holds with high probability
In the above we have used Lemma B.4 to conclude that holds with probability at least . Thus, using Lemma B.2 all that remains is to show that
holds with and with probability at least . The latter is equivalent to
which can be rewritten in the form
which holds as long as . Thus with then Assumptions 1, 2, and 3 holds with probability at least . Thus, Theorem B.5 holds with high probability. Applying Theorem B.5 completes the proof.
We prove the result for , the general result follows from a simple re-scaling. Define the vectors
with the th column of . Using (B.2) we have
(11) |
To bound the minimum eigenvalue we state a result from [58].
Assume
are i.i.d. random positive semidefinite matrices whose coordinates have bounded second moments. Define
(this is an entry-wise expectation) andLet be such that for all . Then for any we have
We shall apply this theorem with . To do this we need to calculate the various parameters in the theorem. We begin with and note that for ReLU we have
To calculate we have
Thus we can take . Therefore, using Theorem C.1 with we can conclude that
holds with probability at least as long as
Plugging this into (C.1) we conclude that with probability at least
We prove the result for , the general result follows from a simple rescaling. Based on (B.2) we have
Thus
(12) | ||||
(13) |
where is the set of indices where and have different signs i.e. and is a submatrix obtained by picking the columns corresponding to .
To continue further note that by Gordon’s lemma we have
with probability at least . In particular using we conclude that
(14) |
with probability at least . To continue further we state a lemma controlling the size of based on the size of the radius .
Let be a matrix with i.i.d. entries. Also for a matrix define . Then for any obeying
holds with probability at least .
To prove this result we utilize two lemmas from [60]. In these lemmas we use to denote the th smallest entry of after sorting its entries in terms of absolute value.
[60, Lemma C.3] Let . Also let be a matrix with i.i.d. entries. Then, with probability at least ,
Combining the latter two lemmas with we conclude that when
then with probability at least we have
We prove the result for