dipspectral
None
view repo
The "deep image prior" proposed by Ulyanov et al. is an intriguing property of neural nets: a convolutional encoderdecoder network can be used as a prior for natural images. The network architecture implicitly introduces a bias; If we train the model to map white noise to a corrupted image, this bias guides the model to fit the true image before fitting the corrupted regions. This paper explores why the deep image prior helps in denoising natural images. We present a novel method to analyze trajectories generated by the deep image prior optimization and demonstrate: (i) convolution layers of the an encoderdecoder decouple the frequency components of the image, learning each at different rates (ii) the model fits lower frequencies first, making early stopping behave as a low pass filter. The experiments study an extension of Cheng et al which showed that at initialization, the deep image prior is equivalent to a stationary Gaussian process.
READ FULL TEXT VIEW PDFNone
It is well known that large neural nets have the capacity to overfit training data, even fitting random labels perfectly Zhang et al. (2016). Arpit et al. Arpit et al. (2017) confirmed this, but showed that networks learn "simple patterns" first. This may explain why these models, despite their capacity, generalize well with early stopping. Neyshabur et al. (2017); Gunasekar et al. (2018); Soudry et al. (2018); Neyshabur et al. (2014). One way to formalize the notion of a simple pattern is to observe the frequency components of the function learned by the model. Simple patterns are smooth, composed of low frequencies. A number of works use this approach to describe the bias induced by deep networks. For example, Xu et al. (2018, 2019) study the Fprinciple which states that the models first learn the lowfrequency components when fitting a signal. Rahaman et al. Rahaman et al. (2018) demonstrated a similar spectral bias
of a deep fullyconnected network with ReLU activation.
The Deep Image Prior (DIP) considers the following setup. Let be a convolutional encoderdecoder parameterized by . , are spaces of dimensional signals. DIP studies the following optimization: where, is a fixed
dimensional white noise vector.
steps of gradient descent for this optimization traces out a trajectory in the parameter space: . This has a corresponding trajectory in the output space : . Given enough capacity and suitable learning rate scheme, for a large , the model will perfectly fit the signal, i.e., . For image denoising , are spaces of all images () and is a noisy image: a clean image with added Gaussian noise . Experiments in Ulyanov et al. (2018) show that early stopping with gradient descent will lead to denoised the image. In other words, the trajectory will contain a point that is close to the clean image .We use the spectral bias of the network to explain this denoising behavior. It is known that at initialization the generated output of the DIP is drawn from distribution that is approximately a stationary Gaussian process with smooth covariance function Cheng et al. (2019). The experiments presented here suggest that this trend continues throughout the optimization, i.e., the model learns to construct the image from low to high frequencies. Thus, early stopping prevents fitting the high frequency components introduced by the additive Gaussian noise. The source code to reproduce results is available online^{1}^{1}1https://github.com/PCJohn/dipspectral.
(a)  (b)  (c)  (d)  (e)  (f)  (g)  (h) 
We first demonstrate the spectral bias of the DIP in the exact setting as Ulyanov et al. Ulyanov et al. (2018)
. Using the same model, i.e., a 5layered convolutional autoencoder with
channels each, the DIP optimization was run on clean images which have a range of frequency components, as shown in Figure 1(a). For each image, the optimization was run twice to generate trajectories , in the output space. Let be the variation of the sum of squared error (SSE) between trajectories: .(a)  (b)  (c) 
Fig. 1 tracks the on two images: barbara.png naturally has high frequency components while triangle.png is a smooth image with a high frequency pattern superimposed on it. We observe the following:
[label=•,nosep,leftmargin=*]
Controlled amounts of the high frequency pattern were added to triangle.png. As it covers a larger spatial extent, the trajectories diverge more (Fig. 1(fh)). This suggests that the reason for divergence corresponds to the model learning the higher frequencies.
These observations show that the model learns the frequency components of the image at different rates, fitting the low frequencies first (see Appendix A
). Thus, early stopping is similar to lowpass filtering in the frequency domain.
The denoising experiments in Ulyanov et al. (2018) used additive Gaussian noise. Early stopping with DIP prevents fitting these, predicting the clean image before fitting the noise. We can now construct samples where deep image prior is guaranteed to fail. Low frequency noise is added to barbara.png and baboon.png: images which naturally have high frequencies, as shown in Fig. 2. After iterations, the model fits the noise, but not the high frequency components and goes on the fit the input perfectly after iterations. Thus, there is no point at which the model predicts the clean image. This strongly suggests that the ability of DIP to denoise images is brought about due to the frequency bias.
Here we investigate what elements of the DIP lead to the aforementioned frequency bias in learning. In particular we show that both convolutions and upsampling introduce a bias.
(a)  (b)  (c)  (d) 
(e)  (f)  (g)  (h) 
The frequency selectiveness of a convolutional network architecture was shown in Saxe et al. (2011). DIP uses a convolutional encoderdecoder. Here, we demonstrate using a simple experiment that the convolution layers of this architecture decouple the frequencies of the signal, fitting each at different rates. This does not happen if the encoderdecoder has only linear layers. Consider DIP on 1D signals, where . The signal with and is shown in fig. 3(a). We run the DIP optimization on this signal and track the squared error , where is the predicted amplitude and is the true amplitude for frequency . We say that a frequency has converged at if after iterations ( for these experiments). This is similar to Experiment 1 in Rahaman et al. (2018). We compare the use of convolutions against linear layers in the model:
[label=•,nosep,leftmargin=*]
DIPConv. A 10layered encoderdecoder with 1D convolution layers, 256 channels. The variation in and is shown in Fig. 3(e). drops sharply, leading to converge at 45 iterations. is learned slowly, converging at 151 iterations. Observing the predictions after each component converges (Fig. 3(b,c)), we see when converges, the model predicts a smooth reconstruction of the signal. Clearly, the frequency components are learned at different rates.
DIPLinear. The convolution layers of the DIPconv were replaced with 256 unit linear layers (fullyconnected) and the optimization was run again. Fig. 3(f) shows the results. The error for both frequencies drop at the same rate. The model never predicts a lowfrequency reconstruction of the signal. Further, the lack of decoupling remains if we change the depth or width of the network (Fig. 3(g,h)) which suggests that it is intrinsic to linear layers and not related to the model capacity.
DIP 


ReLUNet  

27.47  20.54  19.17  27.58  
Effect on image denoising. To confirm that the above results extend to 2D signals, the same architecture variations were applied to images. DIP: Standard DIP for images using the model from Ulyanov et al. (2018) (similar to DIPConv above); DIP Linear128: An encoderdecoder with 5 fullyconnected layers, units each; DIP Linear2048: DIP Linear, with units each per layer (higher capacity); ReLUNet: A 10layered fullyconnected network with nodes per layer and ReLU activations to model images as signals. The model is trained to map pixel coordinates to the corresponding intensities (this will exhibit frequency bias as shown in Rahaman et al. (2018)).
The ReLUNet and DIPLinear architectures do not explicitly have convolution layers. However, their behavior is closely related to special cases of the DIP model. A ReLUNet predicts individual pixels without using neighborhood information, similar to a convolutional encoderdecoder with a kernel size of 1. DIPLinear is similar to the other extreme, when the kernel size equals the size of the entire signal.
These models were used to fit images used in Ulyanov et al. (2018); Cheng et al. (2019) from a standard dataset^{2}^{2}2http://www.cs.tut.fi/~foi/GCFBM3D/index.html. For efficiency, the images were downsampled by a factor of . The entire trajectories in the output space were saved to see if they contain the denoised image (samples in Appendix D). To eliminate the effect of a badly chosen stopping time, we track the PSNR across the entire trajectory and record the best.
Table 1 shows the mean PSNR across 9 images (See appendix C for results per image). DIP performs the best, with ReLUNets achieving comparable performance. Encoderdecoders with only fullyconnected layers perform significantly worse, irrespective of capacity, as they do not decouple frequencies. These results reinforce the idea that the denoising ability is a result of the frequency bias.
(a)  (b) 
(c)  (d) 
iterations using (a) stride=
and (b) stride=. (cd) Output at iterations with (c) stride= and (d) stride=. A larger stride produces a smoother output at the same iteration.Two types of upsampling methods were explored in Ulyanov et al. (2018): nearest neighbor and bilinear. Both methods can be viewed as upsampling with zeros followed by a convolution with a fixed smoothing kernel. These strongly bias the model towards smooth images. For example, upsampling a 1D signal with stride , the frequency responses of the associated convolution operations are for nearestneighbor. and for bilinear upsampling (Appendix E). The responses decay with frequency and stride. This is qualitatively demonstrated in Fig. 4. The low error region of the power spectrum (dark region at the bottom of Fig. 4(a,b)) grows more slowly, with a higher stride.
This paper studies the DIP, proposing an explanation for the denoising experiments presented in Ulyanov et al. (2018). We associate the phenomenon with the notion of frequency bias: convolutional networks fit the low frequencies of a signal first. Experiments on 1D signals and images with additive Gaussian noise strongly suggest that this is the causative factor behind the denoising behavior observed in DIP.
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 233–242. Cited by: §1.Exploring generalization in deep learning
. In Advances in Neural Information Processing Systems, pp. 5947–5956. Cited by: §1.On the spectral bias of deep neural networks
. arXiv preprint arXiv:1806.08734. Cited by: §1, §3.1, §3.1.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 9446–9454. Cited by: The Spectral Bias of the Deep Image Prior, §1, §2, §2, §3.1, §3.1, §3.2, §4.Fig. 5 shows the trajectory in Exp. 1 Fig. 1) and the corresponding power spectrum. This clearly shows the frequency components growing from low to high frequencies.
We study the effect of changing the model capacity in the experiments with 1D signals with 2 frequency components (Section 3.1). The depth and width of the model was varied, keeping all other settings fixed. For each capacity, we record the mean and standard deviation of the time to convergence for each frequency component across
runs.Fig. 6(a,b) show the results when the model uses only convolution layers. We observe the following:
[label=•,nosep,leftmargin=*]
Increasing the depth makes the model separate frequencies more. In Fig. 6(a), as the depth increases, the mean convergence time for increases much faster than that for , indicating that the model fits the higher frequency more slowly.
Increasing the width (number of channels per layer) leads to faster convergence for both frequencies (fig. 6(b)). The convergence time of drops faster, eventually becoming lower than that of ( channels per layer generally fits the higher frequency first).
These results suggest that a deep, narrow model will be more effective at decoupling frequencies than a wide, shallow one. Fig. 6(c,d) show the results when the model uses only fullyconnected layers. The overall trend is the same as that of the convolutional model: time to convergence increases with depth and decreases with width. However, both frequency components converge at the same time. This supports the conclusion that fullyconnected layers are unable to decouple frequencies.
(a)  (b)  (c)  (d) 
Denoising results on each image using the architectures in section 3.1. Table 1 summarizes the table below, showing the means per column.
Image  DIP  DIP Linear128  DIP Linear2048  ReLUNet 

Baboon  24.730.04  20.480.06  20.350.43  24.560.02 
F16  27.000.15  20.720.12  18.601.85  26.670.14 
House  26.890.10  20.690.04  18.891.45  27.210.22 
kodim01  25.900.08  20.350.07  20.540.07  25.950.07 
kodim02  29.200.13  20.740.01  20.590.19  29.990.09 
kodim03  29.260.13  20.470.04  18.130.13  29.620.16 
kodim12  29.610.06  20.360.03  17.740.04  29.630.16 
Lena  27.490.06  20.480.01  18.670.72  27.780.16 
Peppers  27.120.11  20.600.01  18.981.31  26.820.16 
Fig. 7 shows samples from the trajectories while denoising with DIP (fig. 7(a)) and with a ReLUNet (fig. 7(b)). The input image is House.png, downsampled by a factor of (to x) with added Gaussian noise. Both models demostrate frequency bias and predict the clean image before fitting the noise. The samples also show the difference between the DIP and ReLUNet models. ReLUNets are extremely biased towards smooth images, taking much longer to fit the higher frequencies: the DIP model starts fitting the noise at iterations, while the ReLUNet model does so at iterations (even then, the model doesn’t fit these components perfectly, predicting a blurred version of the noise).
(a)  (b) 
Standard upsampling methods can be viewed as upsampling by padding with zeros followed by a convolution with a fixed smoothing kernel
. Let be the frequency response of this kernel (where is the frequency). Upsampling with zeros scales the image by a factor , but adds high frequency components. These components are removed after convolution with if decays with frequency. The spatial extent of is the upsampling stride . For example, in 1D, these kernels are as follows:[label=•,nosep,leftmargin=*]
For nearest neighbor, the kernel is the box function:
The frequency response of this kernel is:
For bilinear upsampling, the kernel is the triangle wave function:
The frequency response of this kernel is:
Clearly, both favor low frequency components, with the bilinear having a stronger bias () than nearest neighbor ().