Invertible generative models for inverse problems: mitigating representation error and dataset bias

05/28/2019 ∙ by Muhammad Asim, et al. ∙ Northeastern University Information Technology University 0

Trained generative models have shown remarkable performance as priors for inverse problems in imaging. For example, Generative Adversarial Network priors permit recovery of test images from 5-10x fewer measurements than sparsity priors. Unfortunately, these models may be unable to represent any particular image because of architectural choices, mode collapse, and bias in the training dataset. In this paper, we demonstrate that invertible neural networks, which have zero representation error by design, can be effective natural signal priors at inverse problems such as denoising, compressive sensing, and inpainting. Given a trained generative model, we study the empirical risk formulation of the desired inverse problem under a regularization that promotes high likelihood images, either directly by penalization or algorithmically by initialization. For compressive sensing, invertible priors can yield higher accuracy than sparsity priors across almost all undersampling ratios. For the same accuracy on test images, they can use 10-20x fewer measurements. We demonstrate that invertible priors can yield better reconstructions than sparsity priors for images that have rare features of variation within the biased training set, including out-of-distribution natural images.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training                            Truth         Lasso         DCGAN         Ours
       
       

Figure 1: We train an invertible generative model with CelebA images (including those at left). When used as a prior for compressed sensing, it can yield higher quality image reconstructions than Lasso and a trained DCGAN, even on out-of-distribution images. Note that the DCGAN reflects biases of the training set by removing the man’s glasses and beard, whereas our invertible prior does not.

Generative deep neural networks have shown remarkable performance as natural signal priors in imaging inverse problems, such as denoising, inpainting, compressed sensing, blind deconvolution, and phase retrieval. These generative models can be trained from datasets consisting of images of particular natural signal classes, such as faces, fingerprints, MRIs, and more [Karras et al., 2017, Minaee and Abdolrashidi, 2018, Shin et al., 2018, Chen et al., 2018]

. Some such models, including variational autoencoders (VAEs) and generative adversarial networks (GANs)

[Goodfellow et al., 2014, Kingma and Welling, 2013, Rezende et al., 2014], learn an explicit low-dimensional manifold that approximates a natural signal class. We will refer to such models as GAN priors. With an explicit parameterization of the natural signal manifold by a low dimensional latent representation, these generative models allow for direct optimization over a natural signal class. Consequently, they can obtain significant performance improvements over non-learning based methods. For example, GAN priors have been shown to outperform sparsity priors at compressed sensing with 5-10x fewer measurements. Additionally, GAN priors have led to theory for signal recovery in the linear compressive sensing and nonlinear phase retrieval problems [Bora et al., 2017, Hand and Voroninski, 2017, Hand et al., 2018], and they have also shown promising results for the nonlinear blind image deblurring problem [Asim et al., 2018].

A significant drawback of GAN priors for solving inverse problems is that they can have representation error or bias due to architecture and training. This can happen for many reasons, including because the generator only approximates the natural signal manifold, because the natural signal manifold is of higher dimensionality than modeled, because of mode collapse, or because of bias in the training dataset itself. As many aspects of generator architecture and training lack clear principles, representation error of GANs may continue to be a challenge even after substantial hand crafting and engineering. Additionally, learning-based methods are particularly vulnerable to the biases of their training data, and training data, no matter how carefully collected, will always contain degrees of bias. As an example, the CelebA dataset [Liu et al., 2015] is biased toward people who are young, who do not have facial hair or glasses, and who have a light skin tone. As we will see, a GAN prior trained on this dataset learns these biases and exhibit image recovery failures because of them.

In contrast, invertible neural networks can be trained as generators with zero representation error. These networks are invertible (one-to-one and onto) by architectural design [Dinh et al., 2016, Gomez et al., 2017, Jacobsen et al., 2018, Kingma and Dhariwal, 2018]. Consequently, they are capable of recovering any image, including those significantly out-of-distribution relative to a biased training set. We call the domain of an invertible generator the latent space, and we call the range of the generator the signal space. These must have equal dimensionality. Flow-based invertible generative models are composed of a sequence of learned invertible transformations. Their strengths include: their architecture allows exact and efficient latent-variable inference, direct log-likelihood evaluation, and efficient image synthesis; they have the potential for significant memory savings in gradient computations; and they can be trained by directly optimizing the likelihood of training images. This paper emphasizes an additional strength: because they lack representation error, invertible models can mitigate dataset bias and improve performance on out-of-distribution data.

In this paper, we study generative invertible neural network priors for imaging inverse problems. We will specifically use the Glow architecture, though our framework could be used with other architectures. A Glow-based model is composed of a sequence of invertible affine coupling layers, 1x1 convolutional layers, and normalization layers. Glow models have been successfully trained to generate high resolution photorealistic images of human faces [Kingma and Dhariwal, 2018].

We present a method for using pretrained generative invertible neural networks as priors for imaging inverse problems. The invertible generator, once trained, can be used for a wide variety of inverse problems, with no specific knowledge of those problems used during the training process. Our method is a standard empirical risk formulation, which we supplement with regularization either by a penalty on the norm of an image’s latent representation or by an initialization at a latent representation of zero or small norm. This regularization promotes images with high likelihood under the invertible model.

We train a generative invertible model using the CelebA dataset. With this fixed model as a signal prior, we study its performance at denoising, compressive sensing, and inpainting. For denoising, it can outperform BM3D [Dabov et al., 2007]. For compressive sensing on test images, it can obtain higher quality reconstructions than Lasso across almost all subsampling ratios, and at similar reconstruction errors can succeed with 10-20x fewer measurements than Lasso. It provides an improvement of about 2-3x fewer linear measurements when compared to [Bora et al., 2017]. Despite being trained on the CelebA dataset, our generative invertible prior can give higher quality reconstructions than Lasso on out-of-distribution images of faces, and, to a lesser extent, unrelated natural images. Our invertible prior outperforms a pretrained DCGAN [Radford et al., 2015] at face inpainting and exhibits qualitatively reasonable results on out-of-distribution human faces. We provide additional experiments in the supplemental materials, including for training on other datasets.

2 Method

We assume that we have access to a pretrained generative invertible neural network . We write and , where is an image that corresponds to the latent representation . We will consider a that has the Glow architecture introduced in [Kingma and Dhariwal, 2018]

. It can be trained by direct optimization of the likelihood of a collection of training images of a natural signal class, under a standard Gaussian distribution over the latent space. We consider recovering an image

from possibly-noisy linear measurements given by ,

where models noise. Given a pretrained generator , we propose the following penalized empirical risk formulation for recovering the image . One can solve

(1)

beginning from an initialization

. The estimate for the image

is then given by , where minimizes (1). The penalty term on the norm of is meant to enforce ‘naturalness’ of the resulting image and is the root log likelihood of under a Gaussian prior. Similar performance is observed if the penalization is used instead; as demonstrated in the supplement. In the case of GAN prior and , this formulation reduces to that of [Bora et al., 2017].

All the experiments that follow will be for an invertible model we trained on the CelebA dataset of celebrity faces, as in [Kingma and Dhariwal, 2018]. Similar results for models trained on birds and flowers [Wah et al., 2011, Nilsback and Zisserman, 2008] can be found in the supplemental materials. Due to computational considerations, we run experiments on color images with the pixel values scaled between . The train and test sets contain a total of 27,000 and 3,000 images, respectively. We trained a Glow architecture [Kingma and Dhariwal, 2018]; see the supplementary material for details. Once trained, the Glow prior is fixed for use in each of the inverse problems below. We also trained a DCGAN for the same dataset. We solve (1) using LBFGS, which was found to outperform Adam [Kingma and Ba, 2014]. Unless otherwise stated, all Glow experiments were initialized at , and thus there is no randomness in solving (1

). DCGAN results are reported for an average of 3 runs because we observed some variance due to random initialization.

3 Applications

3.1 Denoising

We consider the denoising problem with and , for images in the CelebA test dataset. We evaluate the performance of a Glow prior, a DCGAN prior, and BM3D for two different noise levels. Figure 2 shows the recovered PSNR values as a function of for denoising by the Glow and DCGAN priors, along with the PSNR by BM3D. The figure shows that the performance of the regularized Glow prior increases with , and then decreases. If is too low, then the network fits to the noise in the image. If is too high, then data fit is not enforced strongly enough. The left panel reveals that an appropriately regularized Glow prior can outperform BM3D by almost 2 dB. The experiments also reveal that appropriately regularized Glow priors outperform the DCGAN prior, which suffers from representation error and is not aided by the regularization. The right panel confirms that with smaller noise levels, less regularization is needed for optimal performance.

Figure 2: Recovered PSNR values as a function of for denoising by the Glow and DCGAN priors. All the results are averaged over 12 test set images. For reference, we show the average PSNRs of the original noisy images, after applyig BM3D, and under the Glow prior in the noiseless case ().

A visual comparison of the recoveries at the noise level using Glow, DCGAN priors, and BM3D can be seen in Figure 3. Note that the recoveries with Glow are sharper than BM3D. See the supplementary material for more quantitative and qualitative results.

Truth n.a.
Noisy n.a.
DCGAN
BM3D n.a.
GLOW

Figure 3: Denoising results using the Glow prior, the DCGAN prior, and BM3D at noise level . Note that the Glow prior gives a sharper image than BM3D.

3.2 Compressed Sensing

In compressed sensing, one is given undersampled linear measurements of an image, and the goal is to recover the image from those measurements. In our notation, with . As the image is undersampled, there is an affine space of images consistent with the measurements, and an algorithm must select which is most ‘natural.’ A common proxy for naturalness in the literature has been sparsity with respect to the DCT or wavelet bases. With a GAN prior, an image is considered natural if it lies in or near the range of the GAN. For an invertible prior, we consider an image to be natural if it has a latent representation of small norm.

We study compressed sensing in the case that is an matrix of i.i.d. entries, and is an image from the CelebA test set. Here, . We consider the case where is standard iid Gaussian random noise normalized such that . We compare Glow, DCGAN, Lasso111The inverse problems with Lasso were solved by using coordinate descent. with respect to the DCT and wavelet bases.

The left panel of Figure 4 shows that when and , the Glow prior outperforms both DCGAN and Lasso in reconstruction quality over all undersampling ratios. In the case of no undersampling, the Glow and Lasso priors are comparable. Replicating the results of [Bora et al., 2017], our experiments demonstrate that the DCGAN prior can achieve comparable reconstruction error as sparsity priors with 5-10x fewer measurements in some contexts. As also demonstrated in that paper, when there are a sufficient number of measurements, the sparsity priors outperform the GAN prior because the DCT and wavelet bases have zero representation error. The Glow prior (1) can result in 15 dB higher PSNRs than DCGAN, and (2) can give comparable recovery errors with 2-3x fewer measurements at high undersampling ratios. This difference is explained by the representation error of DCGAN. Additional plots and visual comparison, available in the supplemental material, show notable improvements in quality of in- and out-of-distribution images using an invertible prior relative to DCGAN and Lasso.

Figure 4: The left panel shows recovered PSNRs averaged over 12 test set images under the Glow, and DCGAN prior with ; and the Lasso with respect to the DCT and a Wavelet Transform. We initialize with . See the supplement for a zoom-in of the case of small . The right panel shows the resulting PSNR when with a Glow prior after different initialization strategies, as described in the text. The highest PSNR was recovered with initialization and .

We conducted several additional experiments to understand the regularizing effects of and the initialization . The right panel of Figure 4 shows the PSNRs under multiple initialization strategies: , , , with given by the solution to Lasso with respect to the wavelet basis, and where is perturbed by a random point in the null space of . We observe that setting can noticeably improve recovery error for certain initializations. This behavior is expected because if and is consistent with the measurements, the optimization algorithm will stay stuck at the initialization. With positive and such initializations, the formulation will return an image with data-fit error, but higher PSNR. 222Unnatural initializations consistent with measurements could have arbitrarily low PSNRs. We do not include such experiments because the Glow prior has unstable numerical behavior for highly unnatural images, which are well outside the bounds where the network was trained.

Surprisingly, we observe that if the optimization of (1) for compressed sensing is initialized with a small latent variable, then optimal recovery quality occurs when . As there is no explicit regularization in the objective, this indicates that algorithmic regularization is occurring and that initialization plays a role.333These experiments are presented for a LBFGS solver. See the supplemental material for experiments revealing the same effect for the Adam solver The left panel in Figure 5 shows that as the norm of the latent initialization increases, the norm of the recovered latent representation increases and the PSNR of the recovered image decreases. Moreover, the right panel in Figure 5 shows the norm of the estimated latent representation at each iteration of the optimization. In all our experiments, it monotonically grows versus iteration number. These experiments provide further evidence that smaller latent initializations lead to outputs that are more natural and have smaller latent representations.

Figure 5: The left panel shows the avaerage PSNR over 12 test set images and norm of the optimizer as a function of the norm of the initialization for the LBFGS solver to (1) for Copressed sensing under Glow prior with . The right panel shows the norm of the estimated latent representation as a function of iteration number for multiple initializations. The Adam solver behaves similarly.

Additionally, we remark that the optimization landscape of the Glow prior appears to be smoother than that of the DCGAN prior. In Figure 6, we plot versus where and are scaled to have the same norm as

, the latent representation of a fixed test image. For DCGAN, we plot the loss landscape versus two pairs of random directions. For Glow, we plot the loss landscape versus a pair of random directions and a pair of directions that linearly interpolate in latent space between

and another test image. With the GAN prior, some random latent directions lead to highly irregular landscapes. In contrast, for both random and interpolating directions in latent space, the Glow prior exhibits a smoother objective. These landscapes help explain the observation that DCGAN is sensitive toward its random initialization, where as the Glow prior is not.

(a) DCGAN random dir.
(b) DCGAN random dir.
(c) Glow random dir.
(d) Glow interpolating dir.
Figure 6: Loss landscapes for (1) with around the latent representation of a fixed image and with respect to either random latent directions or latent directions that interpolate between images.

Finally, we observe that the Glow prior is much more robust to out-of-distribution examples than the GAN Prior. Figure 7 shows recovered images using (1) for compressive sensing for images not belonging to the CelebA dataset. DCGAN’s performance reveals biases of the underlying dataset and limitations of low-dimensional modeling. For example, projecting onto the CelebA-trained DCGAN causes incorrect skin tones, genders, and ages. It’s performance on out-of-distribution images is poor. In contrast, the Glow prior mitigates this bias, even demonstrating image recovery for natural images that are not representative of the CelebA training set, including people who are older, have darker skin tones, wear glasses, have a beard, or have unusual makeup. The Glow prior’s performance also extends to significantly out-of-distribution images, such as animated characters and natural images unrelated to faces. See the supplemental material for additional experiments.


Truth
DCT
WVT
DCGAN
GLOW

Figure 7: Compressed sensing (CS) with a number () of measurements of out-of-distribution images. Visual comparisons: CS under the Glow prior, DCGAN prior, Lasso-WVT, and Lasso-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance. We use for both DCGAN and Glow priors and for Lasso-WVT, and Lasso-DCT, respectively.

3.3 Inpainting

In inpainting, one is given a masked image of the form , where is a masking matrix with binary entries and is an n-pixel image. The goal is to find . We could rewrite (1) as

Within Distrib.             Out of Distrib.
Truth
Masked
DCGAN
GLOW

Figure 8: Inpainting: Recoveries under DCGAN and Glow, both with .

There is an affine space of images consistent with the measurements, and an algorithm must select which is most natural. As before, using the minimizer , the estimated image is given by . Our experiments reveal the same story as for compressed sensing. If the initialization for an invertible prior has a small or zero latent representation, then the empirical risk formulation with exhibits high PSNRs on test images. Algorithmic regularization is again occurring due to initialization. In contrast, DCGAN is limited by its representation error. See Figure 8, and the supplemental materials for more results, including visually reasonable face inpainting, even for out-of-distribution human faces.

4 Related Work

The idea of analyzing inverse problems with invertible neural networks has appeared in [Ardizzone et al., 2018]. The authors study estimation of the complete posterior parameter distribution under a forward process, conditioned on observed measurements. Specifically, the authors approximate a particular forward process by training an invertible neural network. The inverse map is then directly available. In order to cope with information loss, the authors augment the measurements with additional variables. This work differs from our work because it involves training a separate net for every particular inverse problem. For example, inpainting problems with separate masks would require being trained separately, and a net trained for inpainting could not be used for denoising. In contrast, our work studies how to use a pretrained invertible generator of a particular signal class for a variety of inverse problems not known at training time. Training invertible networks is challenging and computationally expensive; hence, it is desirable to separate the training of an off-the-shelf invertible models from potential applications in a broad variety of scientific domains.

5 Discussion

We have demonstrated that pretrained generative invertible models can be used as natural signal priors in imaging inverse problems. Their strength is that every desired image is in the range of an invertible model, and the challenge that they overcome is that every undesired image is also in the range of the model. We study a regularization for empirical loss minimization that promotes recovery of images that have high likelihood under the generative model. We demonstrate that the invertible prior can quantitatively and qualitatively outperform BM3D at denoising. Additionally, it has lower recovery errors than Lasso across all levels of undersampling, and it can get comparable errors from 10-20x fewer measurements, which is a 2x reduction from [Bora et al., 2017]. Our invertible model trained, yields significantly better reconstructions than Lasso even on out-of-distribution images, including images with rare features of variation within the training set, and on unrelated natural images.

The lack of representation error of invertible nets presents a significant opportunity for imaging with a learned prior. Any image is potentially recoverable, even if the image is significantly outside of the training distribution. In contrast, methods based on projecting onto an explicit low-dimensional representation of a natural signal manifold will have representation error, perhaps due to modeling assumptions, mode collapse, or bias in a training set. Such methods will see performance prematurely saturate as the number of measurements increases. In contrast, an invertible prior would not see performance saturate. In the extreme case of having a full set of exact measurements, an invertible prior could recover any image exactly.

The lack of representation error of invertible models also presents a challenge for inverse problems. An image to be denoised is already in the range of the model. Similarly, in compressive sensing, there is a space of images that are consistent with undersampled linear measurements and are in the range of the generator. Fortunately, invertible nets come with direct estimates of the likelihood of images, which can be used for regularization. Because of their training, invertible nets attempt to assign high likelihoods to natural images under a Gaussian prior on latent space. Hence, images that are more natural are expected to have smaller norm in latent space.

Experiments verify that natural images have smaller latent representations than unnatural images. In the supplemental materials, we show that adding noise to natural images increases the norm of their latent representations, and that higher noise levels result in larger increases. Additional evidence is that random perturbations in image space induce larger changes in than comparable natural perturbations in image space. Figure 9 shows a plot of the norm of the change in image space, averaged over 1000 test images, as a function of the size of a perturbation in latent space. Natural directions are given by the interpolation between the latent representation of two images. This difference in sensitivity indicates that the optimization algorithm might obtain a larger decrease in by an image modification that reduces unnatural image components than by a correspondingly large modification in a natural direction.

Figure 9:

The magnitude of the change in image space as a function of the size of a perturbation in latent space. Solid lines are the mean behavior and shaded region depicts one standard deviation.


We have demonstrated that invertible generators can be regularized by direct penalization of the norm of latent image representations. This observation is perhaps surprising because when applied to GANs it exacerbates the representation error, as visible in Figure 2 and as reported in [Athar et al., 2018]. We suspect the regularization is effective for invertible priors because of the high dimensionality of latent space and the small fraction of perturbations that correspond to natural image alterations.

Surprisingly, it is possible to regularize an invertible model merely by initialization of an optimization procedure at a latent image representation . Without direct penalization of , the optimization algorithm could in principle find any image consistent with the measurements. Our experiments show that this failure mode does not happen when initialized at small . When solving compressive sensing via (1) with , we observe that larger latent initializations result in larger latent estimates and lower PSNRs, as shown in Figure 5. We note that this effect persists for both the LBFGS and Adam solvers. Over the course of solving (1), the algorithm finds latent representations that are monotonically increasing with respect to iteration number. It appears that a reasonable way to incentivize small is thus to initialize with a small , and in particular .

It is natural to wonder which images can be effectively recovered using an invertible prior trained on a particular signal class. As expected, we see the best reconstruction errors on in-distribution images and performance degrades as images get further out-of-distribution. Nonetheless, we observe that reconstruction errors of unrelated natural images are still of higher quality than with the Lasso. It appears that the invertible generator learns some general attributes of the class of natural images. We suspect that an invertible prior will be effective at recovering any natural image for which the model assigns a higher likelihood than other feasible images. This leads to several questions: when a generative invertible net is trained, how far out-of-distribution can an image be while maintaining a high likelihood? How do invertible nets learn useful statistics of natural images? Is that due primarily to training? Or is there architectural bias toward natural images, as suggested by the Deep Image Prior and Deep Decoder [Ulyanov et al., 2018, Heckel and Hand, 2018]?

The results of this paper provide further evidence that reducing representational error of generators can significantly enhance the performance of generative models for inverse problems in imaging. This idea was also recently explored in [Athar et al., 2018]

, where the authors trained a GAN-like prior with a high-dimensional latent space. The high dimensionality of this space lowers representational error, though it is not zero. In their work, the high-dimensionality latent space had a structure that was difficult to directly optimize, so the authors successfully modeled latent representations as the output of an untrained convolutional neural network whose parameters are estimated at test time. Their paper and ours raises several questions: Which generator architectures provide a good balance between low representation error, ease of training, and ease of inversion? Should a generative model be capable of producing all images in order to perform well on out-of-distribution images of interest? Are there cheaper architectures that perform comparably? These questions are quite important, as solving (

1) in our 6464 experiments took 15 GPU-minutes. New developments are needed on architectures and frameworks in between low-dimensional generative priors and fully invertible generative priors. Such methods could leverage the strengths of invertible models while being much cheaper to train and use.

References

6 Appendix

6.1 Experimental Setup

Simulations were completed mainly on CelebA-HQ dataset, used in [Kingma and Dhariwal, 2018]; it has 30,000 color images that were resized to for computational reasons, and were split into 27,000 training and 3000 test images. We also provide some additional experiments on the Flowers [Nilsback and Zisserman, 2008], and Birds [Wah et al., 2011] datasets. Flowers dataset contains 8189 color images resized to out of which images are spared for testing. Birds dataset contains a total of 11,788 images, which were center aligned and resized to out of which images are set aside for testing.

We specifically model our invertible networks after the recently proposed Glow [Kingma and Dhariwal, 2018] architecture, which consists of a multiple flow steps. Each flow step comprises of an activation normalization layer, a convolutional layer, and an affine coupling layer, each of which is invertible. Let be the number of steps of flow before a splitting layer, and be the number of times the splitting is performed. To train over CelebA, we choose the network to have , and affine coupling, and train it with a learning rate , and a batch size at resolution . The model was trained over bit images with 10,000 warmup iterations as in [Kingma and Dhariwal, 2018], but when solving inverse problems using Glow original bit images were used. We refer the reader to [Kingma and Dhariwal, 2018] for specific details on the operations performed in each of the network layer.

We use LBFGS to solve the inverse problem. For best performance, we set the number of iterations and learning rate for denoising, compressed sensing, and inpainting to be , ; , ; and ,

; respectively. we use Pytorch to implement Glow network training and solve the inverse problem. Glow training was conducted on a single Titan Xp GPU using a maximum allowable (under given computational constraints) batch size of 6. In case of CS, recovering a single image on Titan Xp using LBFGS solver with

steps takes seconds ( minutes). However, we can solve inverse problems in parallel on the given hardware platform.

Unless specified otherwise, inverse problem under Glow prior is always initialized with . Whereas under DCGAN prior, we initialize with and report average over three random restarts. In all the quantitative experiments over, the reported quality metrics such as PSNR, and reconstruction errors are averaged over 12 randomly drawn test set images.

Figure 10: Samples from training set of CelebA downsampled to .

6.2 Denoising: Additional Experiments

We present additional quantitative experiments on image denoising here. Complete set of experiments on average PSNR over 12 CelebA (within distribution444The redundant ’within distribution’ phrase is added to emphasize that the test set images are drawn from the same distribution as the train set. We do this to avoid confusion with the out-of-distribution recoveries also presented in this paper.) test set images versus penalization parameter under noise levels and are presented in Figure 11 below. The central message is that Glow prior outperforms DCGAN prior uniformly across all due to the representation limit of DCGAN. In addition, striking the right balance between the misfit term and the penalization term by appropriately choosing improves the performance of Glow, and it also approaches state-of-the-art BM3D algorithm at low noise levels, and clearly visible in higher noise, for example, at a noise level of , the Glow prior improves upon BM3D by 2dB. Visually the results of Glow prior are clearly even superior to BM3D recoveries that are generally blurry and over smoothed as can be spotted in the qualitative results below. To avoid fitting the noisy image using the Glow model, we force the recoveries to be natural by choosing large enough .

Figure 11: Image Denoising — Recovered PSNR values as a function of under Glow prior, and DCGAN prior on (within-distribution) test set CelebA images. For reference, we show the average PSNRs of the original noisy images, and under the Glow prior in the noiseless case () in both panels. The average PSNR after applying BM3D, and the average PSNR under the Glow prior at noise levels are reported.

Recall that we are solving a regularized empirical risk minimization program

In general, one can instead solve , where is a monotonically increasing function. Figure 12 shows the comparison of most common choices of linear (already used in the rest of the paper), and quadratic in the context of densoing. We find that the highest achievable PSNR remains the same in both the cases, however, the penalization parameter has to be adjusted accordingly.

Figure 12: Image Denoising — Recovered PSNR values as a function of under Glow prior with and penalization on (within-distribution) test set CelebA images. Comparison is provided with BM3D denoising at noise level

We train Glow and DCGAN on CelebA. Additional qualitative image denosing results under higher noise level comparing Glow prior against DCGAN prior, and BM3D are presented below in Figure 13, and 14.

Truth n.a.

Noisy n.a.

DCGAN

BM3D n.a.

GLOW

GLOW

GLOW

Figure 13: Image Denoising — Visual comparisons under the Glow prior, the DCGAN prior, and BM3D at a noise level on CelebA (within-distribution) test set images. Under DCGAN prior, we only show the case of as this consistently gave the best performance for DCGAN. Under Glow prior, the best performance over is achieved with , overfitting of the image occurs with and underfitting occurs at . Note that the Glow prior with also gives a sharper image than BM3D.

Truth n.a.

Noisy n.a.

DCGAN

BM3D n.a.

GLOW

GLOW

GLOW

Figure 14: Image Denoising — Visual comparisons under the Glow prior, the DCGAN prior, and BM3D at noise level on CelebA (within-distribution) test set images. Under DCGAN prior, we only show the case of as this consistently gives the best performance. Under Glow prior, the best performance is achieved with , overfitting of the image occurs with and underfitting occurs with . Note that the Glow prior with also gives a sharper image than BM3D.

We also trained Glow model on Flowers dataset. Below we present its qualitative denoising performance against BM3D on the test set Flowers images. We also show the effect of varying — smaller leads to overfitting and vice versa.

Truth n.a.

Noisy n.a.

BM3D n.a.

GLOW

GLOW

GLOW

Figure 15: Image Denoising — Visual comparisons under the Glow prior, and BM3D at noise level on (within-distribution) test set Flowers images. Under Glow prior, the best performance is obtained with . Note that the Glow prior with also gives a sharper image than BM3D.

6.3 Compressed Senisng: Additional Experiments

Some additional quantitative image recovery results on test set of CelebA dataset are presented in Figure 16; it depicts the comparison of Glow prior, DCGAN prior, LASSO-DCT, and LASSO-WVT at compressed sensing. We plot the reconstruction error : = , where is the recovered image and is the number of pixels in the CelebA images. Glow uniformly outperforms DCGAN, and LASSO across entire range of the number of measuremnts. LASSO-DCT and LASSO-WVT eventually catch up to Glow but only when observed measurements are a significant fraction of the total number of pixels. On the other hand, DCGAN is initially better than LASSO but prematurely saturates due to limited representation capacity.


Figure 16: Compressed sensing — Reconstruction error vs. number of measurements under Glow prior, DCGAN prior, LASSO-DCT and LASSO-WVT on CelebA (within-distribution) test set images. Noise is scaled such that and the penalization parameter for Glow, and DCGAN; and for LASSO-DCT, and LASSO-WVT.

Figure 17: Compressed sensing — Zoomed-in version of the left panel of Figure 4 in the main paper in the low measurement regime for CelebA. PSNR vs. number of measurements under Glow prior, DCGAN prior, LASSO-DCT and LASSO-WVT on the CelebA (within distribution) test set images. Noise is scaled such that and the penalization parameter for Glow and DCGAN; and for LASSO-DCT, and LASSO-WVT.
Figure 18: Compressed sensing under Glow prior. Performance comparison between LBFGS and Adam solver for the inverse problem. For Adam solver, gradient steps were taken with learning rate chosen to be . The rest of the parameters were fixed to be the same as with LBFGS.
Figure 19: Residual error vs. number of iterations. Left panel compares DCGAN and Glow priors. Both converge roughly at the same rate to their respective saturation levels. The right panel compares LBFGS and Adam solvers for compressed sensing under Glow prior. LBFGS tends to converge far more quickly than Adam. We choose in both the experiments.

Recall that the natural face images correspond to smaller . In Figure 20, we plot the norm of the latent codes of the iterates of each algorithm vs. the number of iterations. The central message is that initializing with smaller norm tends to yield natural (smaller latent representations) recoveries. This is one explanation as to why in compressed sensing, one is able to obtain the true solution out of the affine space of solutions without penalizing the unnaturalness of the recoveries.

Figure 20: Compressed sensing — Norm of the latent codes with iterations. Left panel shows how the norm of the latent codes evolves over iterations of the LBFGS solver under different size initializations. Right panel shows the same experiment for the Adam solver (although over much larger number of iterations as Adam requires comparatively more iterations to converge). Each point is averaged over 12 test set images under random rescaled initializations . We set the penalization parameter in both experiments.

We now present visual recovery results on test images from the CelebA dataset under varying number of measurements in compressed sesing. We compare recoveries under Glow prior, DCGAN prior, LASSO-DCT, and LASSO-WVT.

Truth

DCT

WVT

DCGAN

GLOW

Figure 21: Compressed sensing visual comparisons — Recoveries on (within-distribution) test set images with a number () of measurements under the Glow prior, the DCGAN prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for both DCGAN, and Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

Truth

DCT

WVT

DCGAN

GLOW

Figure 22: Compressed sensing visual comparisons — Recoveries on the (within-distribution) test set images with a number () of measurements under the Glow prior, the DCGAN prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for both DCGAN, and Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

Truth

DCT

WVT

DCGAN

GLOW

Figure 23: Compressed sensing visual comparisons — Recoveries on (within-distribution) test set images with a number () of measurements under the Glow prior, the DCGAN prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for both DCGAN, and Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

Truth

DCT

WVT

DCGAN

GLOW

Figure 24: Compressed sensing visual comparisons — Recoveries on (within-distribution) test set images with a number () of measurements under the Glow prior, the DCGAN prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for both DCGAN, and Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

Truth

DCT

WVT

DCGAN

GLOW

Figure 25: Compressed sensing visual comparisons — Recoveries on (within-distribution) test set images with a number () of measurements under the Glow prior, the DCGAN prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for both DCGAN, and Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

Truth

DCT

WVT

DCGAN

GLOW

Figure 26: Compressed sensing visual comparisons — Recoveries on (within-distribution) test set images with a number () of measurements under the Glow prior, the DCGAN prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for both DCGAN, and Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

Truth

DCT

WVT

DCGAN

GLOW

Figure 27: Compressed sensing visual comparisons — Recoveries on (within-distribution) test set images with a number () of measurements under the Glow prior, the DCGAN prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for both DCGAN, and Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

Truth

DCT

WVT

DCGAN

GLOW

Figure 28: Compressed sensing visual comparisons — Recoveries on (within-distribution) test set images with a number () of measurements under the Glow prior, the DCGAN prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for both DCGAN, and Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

Truth

DCT

WVT

DCGAN

GLOW

Figure 29: Compressed sensing visual comparisons — Recoveries on (within-distribution) test set images with a number () of measurements under the Glow prior, the DCGAN prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for both DCGAN, and Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

Truth

DCT

WVT

DCGAN

GLOW

Figure 30: Compressed sensing visual comparisons — Recoveries on (within-distribution) test set images with a number () of measurements under the Glow prior, the DCGAN prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for both DCGAN, and Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

6.3.1 Compressed Sensing on Flower and Bird Dataset

We also performed compressed sensing experiments similar to those on CelebA dataset above on Birds dataset, and Flowers dataset. We trained a Glow invertible network for each dataset, and present below the quantitative and qualitative recoveries for each dataset.

Figure 31: PSNR vs. number of measurements in compressed sensing under Glow prior, LASSO-DCT and LASSO-WVT on Birds dataset (left panel) and Flowers dataset (right panel). Noise is scaled such that and the penalization parameter for Glow, and for LASSO-DCT, and LASSO-WVT.

Truth

DCT

WVT

GLOW

Figure 32: Compressed sensing — Visual comparisons on (within-distribution) test set images from Birds and Flowers dataset with a number () of measurements under the Glow prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

Truth

DCT

WVT

GLOW

Figure 33: Compressed sensing — Visual comparisons on (within-distribution) test set images from Birds and Flowers dataset with a number () of measurements under the Glow prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

Truth

DCT

WVT

GLOW

Figure 34: Compressed sensing — Visual comparisons on (within-distribution) test set images from Birds and Flowers dataset with a number () of measurements under the Glow prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

Truth

DCT

WVT

GLOW

Figure 35: Compressed sensing — Visual comparisons on (within-distribution) test set images from Birds and Flowers dataset with a number () of measurements under the Glow prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

Truth

DCT

WVT

GLOW

Figure 36: Compressed sensing — Visual comparisons on the test set images from Birds and Flowers dataset with a number () of measurements under the Glow prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

Truth

DCT

WVT

GLOW

Figure 37: Compressed sensing — Visual comparisons on the test set images from Birds and Flowers dataset with a number () of measurements under the Glow prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

Truth

DCT

WVT

GLOW

Figure 38: Compressed sensing — Visual comparisons on the test set images from Birds and Flowers dataset with a number () of measurements under the Glow prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

Truth

DCT

WVT

GLOW

Figure 39: Compressed sensing — Visual comparisons on the test set images from Birds and Flowers dataset with a number () of measurements under the Glow prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

Truth

DCT

WVT

GLOW

Figure 40: Visual comparisons of compressed sensing of the test set images from Birds and Flowers dataset with a number () of measurements under the Glow prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

Truth

DCT

WVT

GLOW

Figure 41: Visual comparisons of compressed sensing of the test set images from Birds and Flowers dataset with a number () of measurements under the Glow prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance among the tested values. We use for Glow prior and for LASSO-WVT, and LASSO-DCT, respectively.

6.3.2 Compressed Sensing on Out of Distribution Images

Lack of representation error in invertible nets leads us to an important and interesting question: does the trained network fit related natural images that are underrepresented or even unrepresented in the training dataset? Specifically, can a Glow network trained on CelebA faces be a good prior on other faces; for example, those with dark-skin tone, faces with glasses or facial hair, or even animated faces? In general, our experiments show that Glow prior has an excellent performance on such out-of-distribution images that are semantically similar to celebrity faces but not representative of the CelebA dataset. In particular, we have been able to recover faces of darker skin tone, older people with beards, eastern women, men with hats, and animated characters such as Shrek, from compressed measurements under the Glow prior. Recoveries under the Glow prior convincingly beat the DCGAN prior, which shows a definite bias due to training. Not only that, the Glow prior also outperforms unbiased methods such as LASSO-DCT, and LASSO-WVT.

Can we expect the Glow prior to continue to be an effective proxy for arbitrarily out-of-distribution images? To answer this question, we tested arbitrary natural images such as car, house door, and butterfly wings that are semantically unrelated to CelebA images. In general, we found that Glow is an effective prior at compressed sensing of out-of-distribution natural images, which are assigned a high likelihood score (small normed latent representations). On these images, Glow also outperforms LASSO.

Recoveries of natural images that are assigned very low-likelihood scores by the Glow model generally run into instability issues. During training, invertible nets learn to assign high likelihood scores to the training images. All the network parameters such as scaling in the coupling layers of Glow network are learned to behave stably with such high likelihood representations. However, on very low-likelihood representations, unseen during the training process, the networks becomes unstable and outputs of network begin to diverge to very large values; this may be due to several reasons, such as normalization (scaling) layers not being tuned to the unseen representations. An LBFGS search for the solution of an inverse problem to recover a low-likelihood image leads the iterates into neighborhoods of low-likelihood representations that may lead the network to instability.

We find that Glow network has the tendency to assign higher likelihood scores to arbitrarily out-of-distribution natural images. This means that invertible networks have at least partially learned something more general about natural images from CelebA dataset — may be some high level features that face images share with other natural images such as smooth regions followed by discontinuities, etc. This allows Glow prior to extend its effectiveness as a prior to other natural images beyond just the training set.

Figure 42, 43 , 44, 45, and 46 compare the performance of LASSO-DCT, LASSO-WVT, DCGAN prior, and Glow prior on the compressed sensing of out-of-distribution images under varying number of measurements.


Truth
DCT
WVT
DCGAN
GLOW
Truth
DCT
WVT
DCGAN
GLOW

Figure 42: Compressed sensing ( of ) visual comparisons on out-of-distribution images. We compare the recoveries under Glow (trained on CelebA) prior, DCGAN (trained on CelebA) prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance. We use for both DCGAN, and Glow prior and and optimize for each recovery using LASSO-WVT, and LASSO-DCT.


Truth
DCT
WVT
DCGAN
GLOW
Truth
DCT
WVT
DCGAN
GLOW

Figure 43: Compressed sensing ( of ) visual comparisons on out-of-distribution images. We compare the recoveries under Glow (trained on CelebA) prior, DCGAN (trained on CelebA) prior, LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance. We use for both DCGAN, and Glow prior and and optimize for each recovery using LASSO-WVT, and LASSO-DCT.


Truth
DCT
WVT
DCGAN
GLOW
Truth
DCT
WVT
DCGAN
GLOW

Figure 44: Compressed sensing ( of ) visual comparisons on out-of-distribution images. We compare the recoveries under Glow prior (trained on CelebA), DCGAN prior (trained on CelebA), LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance. We use for both DCGAN, and Glow prior and and optimize for each recovery using LASSO-WVT, and LASSO-DCT.


Truth
DCT
WVT
DCGAN
GLOW
Truth
DCT
WVT
DCGAN
GLOW

Figure 45: Compressed sensing ( of ) visual comparisons on out-of-distribution images. We compare the recoveries under Glow prior (trained on CelebA), DCGAN prior (trained on CelebA), LASSO-WVT, and LASSO-DCT at a noise level . In each case, we choose values of the penalization parameter to yield the best performance. We use for both DCGAN, and Glow prior and and optimize for each recovery using LASSO-WVT, and LASSO-DCT.


Truth
DCT
WVT