Training Image Estimators without Image Ground-Truth

by   Zhihao Xia, et al.
Washington University in St Louis

Deep neural networks have been very successful in image estimation applications such as compressive-sensing and image restoration, as a means to estimate images from partial, blurry, or otherwise degraded measurements. These networks are trained on a large number of corresponding pairs of measurements and ground-truth images, and thus implicitly learn to exploit domain-specific image statistics. But unlike measurement data, it is often expensive or impractical to collect a large training set of ground-truth images in many application settings. In this paper, we introduce an unsupervised framework for training image estimation networks, from a training set that contains only measurements---with two varied measurements per image---but no ground-truth for the full images desired as output. We demonstrate that our framework can be applied for both regular and blind image estimation tasks, where in the latter case parameters of the measurement model (e.g., the blur kernel) are unknown: during inference, and potentially, also during training. We evaluate our method for training networks for compressive-sensing and blind deconvolution, considering both non-blind and blind training for the latter. Our unsupervised framework yields models that are nearly as accurate as those from fully supervised training, despite not having access to any ground-truth images.



There are no comments yet.


page 6

page 8

page 12

page 13

page 14


Unsupervised Image Restoration Using Partially Linear Denoisers

Deep neural network based methods are the state of the art in various im...

Deep Unsupervised Drum Transcription

We introduce DrummerNet, a drum transcription system that is trained in ...

Estimation of respiratory pattern from video using selective ensemble aggregation

Non-contact estimation of respiratory pattern (RP) and respiration rate ...

Simultaneous compressive image recovery and deep denoiser learning from undersampled measurements

Compressive image recovery utilizes sparse image priors such as wavelet ...

Mass Estimation from Images using Deep Neural Network and Sparse Ground Truth

Supervised learning is the workhorse for regression and classification t...

Redefining Binarization and the Visual Archetype

Although binarization is considered passe, it still remains a highly pop...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reconstructing images from imperfect observations is a classic inference task in many imaging applications. In compressive sensing donoho2006compressed , a sensor makes partial measurements for efficient acquisition. These measurements correspond to a low-dimensional projection of the higher-dimensional image signal, and the system relies on computational inference for recovering the full-dimensional image. In other cases, cameras capture degraded images that are low-resolution, blurry, etc., and require a restoration algorithm freeman2002example ; yuan2007image ; zoran2011learning

to recover a corresponding un-corrupted image. Deep convolutional neural networks (CNNs) have recently emerged as an effective tool for such image estimation tasks 

chen2017trainable ; ircnn ; chakrabarti2016neural ; dong2015image ; kulkarni2016reconnet ; facedeblur ; istanet . Specifically, a CNN for a given application is trained on a large dataset that consists of pairs of ground-truth images and observed measurements (in many cases where the measurement or degradation process is well characterized, having a set of ground-truth images is sufficient to generate corresponding measurements). This training set allows the CNN to learn to exploit the expected statistical properties of images in that application domain, to solve what is essentially an ill-posed inverse problem.

Figure 1: Unsupervised Training from Measurements. Our method allows training image estimation networks from sets of pairs of varied measurements, but without the underlying ground-truth images. (Top Right) We supervise training by requiring that network predictions from one measurement be consistent with the other, when measured with the corresponding parameter. (Bottom) In the blind training setting, when both the image and measurement parameters are unavailable, we also train a parameter estimator . Here, we generate a proxy training set from the predictions of the model (as it is training), and use synthetic measurements from these proxies to supervise training of the parameter estimator , and augment training of the image estimator .

But for many domains, it is impractical or prohibitively expensive to capture full-dimensional or un-corrupted images, and construct such a large representative training set. Unfortunately, it is often in such domains that a computational imaging solution is most useful. Recently, Lehtinen et alnoise2noise proposed a solution to this issue for denoising, with a method that trains with only pairs of noisy observations. While their method yields remarkably accurate network models without needing any ground-truth images for training, it is applicable only to the specific case of estimation from noisy measurements—when each image intensity is observed as a sample from a (potentially unknown) distribution with mean or mode equal to its corresponding true value.

In this work, we introduce an unsupervised method for training image estimation networks that can be applied to a general class of observation models—where measurements are a linear function of the true image, potentially with additive noise. As training data, it only requires two observations for the same image but not the underlying image itself111Note that at test time, the trained network only requires one observation as input as usual.. The two measurements in each pair are made with different parameters (such as different compressive measurement matrices or different blur kernels), and these parameters vary across different pairs. Collecting such a training set provides a practical alternative to the more laborious one of collecting full image ground-truth. Given these measurements, our method trains an image estimation network by requiring that its prediction from one measurement of a pair be consistent with the other measurement, when observed with the corresponding parameter. With sufficient diversity in measurement parameters for different training pairs, we show this is sufficient to train an accurate network model despite lacking direct ground-truth supervision.

While our method requires knowledge of the measurement model (e.g., blur by convolution), it also incorporates a novel mechanism to handle the blind setting during training—when the measurement parameters (e.g., the blur kernels) for training observations are unknown. To be able to enforce consistency as above, we use an estimator for measurement parameters that is trained simultaneously using a “proxy” training set. This set is created on-the-fly by taking predictions from the image network even as it trains, and pairing them with observations synthetically created using randomly sampled, and thus known, parameters. The proxy set provides supervision for training the parameter estimator, and to augment training of the image estimator as well. This mechanism allows our method to nearly match the accuracy of fully supervised training on image and parameter ground-truth.

We validate our method with experiments on image reconstruction from compressive measurements and on blind deblurring of face images, with blind and non-blind training for the latter, and compare to fully-supervised baselines with state-of-the-art performance. The supervised baselines use a training set of ground-truth images and generate observations with random parameters on the fly in each epoch, to create a much larger number of effective image-measurement pairs. In contrast, our method is trained with only two measurements per image from the same training set (but not the image itself), with the pairs kept fixed through all epochs of training. Despite this, our unsupervised training method yields models with test accuracy close to that of the supervised baselines, and thus presents a practical way to train CNNs for image estimation when lacking access to image ground truth.

2 Related Work

CNN-based Image Estimation. Many imaging tasks require inverting the measurement process to obtain a clean image from the partial or degraded observations—denoising buades2005non , deblurring yuan2007image

, super-resolution 

freeman2002example , compressive sensing donoho2006compressed , etc. While traditionally solved using statistical image priors foe ; zoran2011learning ; figueiredo2007gradient , CNN-based estimators have been successfully employed for many of these tasks. Most methods Nah2017DeepMC ; chen2017trainable ; ircnn ; chakrabarti2016neural ; dong2015image ; kulkarni2016reconnet ; facedeblur ; istanet learn a network to map measurements to corresponding images from a large training set of pairs of measurements and ideal ground-truth images. Some learn CNN-based image priors, as denoisers ircnn ; onenet ; romano2017little or GANs anirudh2018unsupervised , that are agnostic to the inference task (denoising, deblurring, etc.), but still tailored to a chosen class of images. All these methods require access to a large domain-specific dataset of ground-truth images for training. However, capturing image ground-truth is burdensome or simply infeasible in many settings (e.g., for MRI scans lustig2008compressed and other biomedical imaging applications). In such settings, our method provides a practical alternative by allowing estimation networks to be trained from measurement data alone.

Unsupervised Learning.Unsupervised learning for CNNs is broadly useful in many applications where large-scale training data is hard to collect. Accordingly, researchers have proposed unsupervised and weakly-supervised methods for such applications, such as depth estimation zhou2017unsupervised ; godard2017unsupervised , intrinsic image decomposition ma2018single ; li2018learning , etc. However, these methods are closely tied to their specific applications. In this work, we seek to enable unsupervised learning for image estimation networks. In the context of image modeling, Bora et albora2018ambientgan propose a method to learn a GAN model from only degraded observations. Their method, like ours, includes a measurement model with its discriminator for training (but requires knowledge of measurement parameters, while we are able to handle the blind setting). Their method proves successful in training a generator for ideal images. We seek a similar unsupervised means for training image reconstruction and restoration networks.

The closest work to ours is the recent Noise2Noise method of Lehtinen et alnoise2noise , who propose an unsupervised framework for training denoising networks by training on pairs of noisy observations of the same image. In their case, supervision comes from requiring the denoised output from one observation be close to the other. This works surprisingly well, but is based on the assumption that the expected or median value of the noisy observations is the image itself. We focus on a more general class of observation models, which requires injecting the measurement process in loss computation. We also introduce a proxy training approach to handle blind image estimation applications.

3 Proposed Approach

Given a measurement of an ideal image that are related as


our goal is to train a CNN to produce an estimate of the image from . Here, is random noise with distribution that is assumed to be zero-mean and independent of the image , and the parameter is an matrix that models the linear measurement operation. Often, the measurement matrix is structured with fewer than degrees of freedom based on the measurement model—e.g., it is block-Toeplitz for deblurring with entries defined by the blur kernel. We consider both non-blind estimation when the measurement parameter is known for a given measurement during inference, and the blind setting where is unavailable but we know the distribution . For blind estimators, we address both non-blind and blind training—when is known for each measurement in the training set but not at test time, and when it is unknown during training as well.

Since (1) is typically non-invertible, image estimation requires reasoning with the statistical distribution of images for the application domain, and conventionally, this is provided by a large training set of typical ground-truth images . In particular, CNN-based image estimation methods train a network on a large training set of pairs of corresponding images and measurements, based on a loss that measures error between predicted and true images across the training set. In the non-blind setting, the measurement parameter is known and provided as input to the network (we omit this in the notation for convenience), while in the blind setting, the network must also reason about the unknown measurement parameter .

To avoid the need for a large number of ground-truth training images, we propose an unsupervised learning method that is able to train an image estimation network using measurements alone. Specifically, we assume we are given a training set of two measurements for each image :


but not the images themselves. We require the corresponding measurement parameters and to be different for each pair, and further, to also vary across different training pairs. These parameters are assumed to be known for the non-blind training setting, but not for blind training.

3.1 Unsupervised Training for Non-Blind Image Estimation

We begin with the simpler case of non-blind estimation, when the parameter for a given measurement is known, both during inference and training. Given pairs of measurements with known parameters, our method trains the network using a “swap-measurement” loss on each pair, defined as:


This loss evaluates the accuracy of the full images predicted by the network from each measurement in a pair, by comparing it to the other measurement—using an error function —after simulating observation with the corresponding measurement parameter. Note Noise2Noise noise2noise can be seen as a special case of (3) for measurements are degraded only by noise, with .

When the parameters used to acquire the training set are sufficiently diverse and statistically independent for each underlying , this loss provides sufficient supervision to train the network . To see this, we consider using the distance for the error function , and note that (3) represents an empirical approximation of the expected loss over image, parameter, and noise distributions. Assuming the training measurement pairs are obtained using (2) with , , and drawn i.i.d. from their respective distributions, we have


Therefore, because the measurement matrices are independent, we find that in expectation the swap-measurement loss is equivalent to supervised training against the true image , with an loss that is weighted by the matrix

(upto an additive constant given by noise variance). With a sufficiently diverse distribution

of measurement parameters, will be full-rank (even though the individual are not). Then, the swap-measurement loss will provide supervision along all image dimensions, and will reach its theoretical minimum iff the network makes exact predictions.

In addition to the swap loss, we also find it useful to train with an additional “self-measurement” loss that measures consistency between an image prediction and its own corresponding input measurement:


While not sufficient by itself, we find the additional supervision it provides to be practically useful in yielding more accurate network models. Therefore, our overall unsupervised training objective is a weighted version of the two losses , with weight chosen on a validation set.

3.2 Unsupervised Training for Blind Image Estimation

We next consider the more challenging case of blind estimation, when the measurement parameter for an observation is unknown—and specifically, the blind training setting, when it is unknown even during training. The blind training setting complicates the use of our unsupervised losses in (3) and (5), since the values of and used there are unknown. Also, blind estimation tasks often have a more diverse set of possible parameters . While supervised training methods with access to ground-truth images can generate a very large database of synthetic image-measurement pairs by pairing the same image with many different (assuming is known), our unsupervised framework has access only to two measurements per image.

To address this, we propose a “proxy training” approach that treats estimates from our network during training as a source of image ground-truth to train an estimator for measurement parameters. We use the image network’s predictions to construct synthetic observations as:


where and are sampled on the fly from the parameter and noise distributions, and indicates an assignment with a “stop-gradient” operation (to prevent loss gradients on the proxy images from affecting the image estimator ). We use these synthetic observations , with known sampled parameters , to train the parameter estimation network based on the loss:


As the parameter network trains with augmented data, we simultaneously use it to compute estimates of parameters for the original observations: , and compute the swap- and self-measurement losses in (3) and (5) on the original observations using these estimated, instead of true, parameters. Notice that we use a stop-gradient here as well, since we do not wish to train the parameter estimator based on the swap- or self-measurement losses—the behavior observed in (3.1) no longer holds in this case, and we empirically observe that removing the stop-gradient leads to instability and often causes training to fail.

In addition to training the parameter estimator , the proxy training data in (6) can be used to augment training for the image estimator , now with full supervision from the proxy images as:


This loss can be used even in the non-blind training setting, and provides a means of generating additional training data with more pairings of image and measurement parameters. Also note that although our proxy images are approximate estimates of the true images, they represent the ground-truth for the synthetically generated observations . Hence, the losses and are approximate only in the sense that they are based on images that are not sampled from the true image distribution . And the effect of this approximation diminishes as training progresses, and the image estimation network produces better image predictions (especially on the training set).

Our overall method randomly initializes the weights of the image and parameter networks and , and then trains them with a weighted combination of all losses: , where the scalar weights are hyper-parameters determined on a validation set. For non-blind training (of blind estimators), only the image estimator needs to be trained, and can be set to .

4 Experiments

We evaluate our framework on two well-established tasks: non-blind image reconstruction from compressive measurements, and blind deblurring of face images. These tasks were chosen since large training sets of ground-truth images is available in both cases, which allows us to demonstrate the effectiveness of our approach through comparisons to fully supervised baselines. The source code of our implementation is available at

4.1 Reconstruction from Compressive Measurements

We consider the task of training a CNN to reconstruct images from compressive measurements. We follow the measurement model of kulkarni2016reconnet ; istanet , where all non-overlapping patches in an image are measured individually by the same low-dimensional orthonormal matrix. Like kulkarni2016reconnet ; istanet , we train CNN models that operate on individual patches at a time, and assume ideal observations without noise (the supplementary includes additional results for noisy measurements). We train models for compression ratios of , and (using corresponding matrices provided by kulkarni2016reconnet ).

Method Supervised BSD68 Set11
1% 4% 10% 1% 4% 10%
TVAL3 tval3 - - - 16.43 18.75 22.99
ReconNet kulkarni2016reconnet - 21.66 24.15 17.27 20.63 24.28
ISTA-Net+ istanet 19.14 22.17 25.33 17.34 21.31 26.64
Supervised Baseline (Ours) 19.74 22.94 25.57 17.88 22.61 26.74
Unsupervised Training (Ours) 19.67 22.78 25.40 17.84 22.20 26.33
Table 1: Performance (in PSNR dB) of various methods for compressive measurement reconstruction, on BSD68 and Set11 images for different compression ratios.
Ground truth ReconNet kulkarni2016reconnet ISTA-Net+ istanet Supervised Baseline (Ours) Unsupervised Training (Ours)
PSNR: 21.89 dB 23.61 dB 24.34 dB 24.03 dB
PSNR: 21.29 dB 23.66 dB 24.37 dB 24.17 dB
Figure 2: Images reconstructed by various methods from compressive measurements (at 10% ratio).

We generate a training and validation set, of k and images respectively, by taking

crops from images in the ImageNet database 

imagenet . We use a CNN architecture that stacks two U-Nets unet

, with a residual connection between the two (see supplementary). We begin by training our architecture with full supervision, using

all overlapping patches from the training images, and an loss between the network’s predictions and the ground-truth image patches. For unsupervised training with our approach, we create two partitions of the original image, each containing non-overlapping patches. The partitions themselves overlap, with patches in one partition being shifted from those in the other (see supplementary). We measure patches in both partitions with the same measurement matrix, to yield two sets of measurements. These provide the diversity required by our method as each pixel is measured with a different patch in the two partitions. Moreover, this measurement scheme can be simply implemented in practice by camera translation. The shifts for each image are randomly selected, but kept fixed throughout training. Since the network operates independently on patches, it can be used on measurements from both partitions. To compute the swap-measurement loss, we take the network’s individual patch predictions from one partition, arrange them to form the image, and extract and then apply the measurement matrix to shifted patches corresponding to the other partition. The weight for the self-measurement loss is set to 0.05 based on the validation set.

In Table 1, we first compare our fully supervised baseline to existing compressive sensing methods that use supervised training kulkarni2016reconnet ; istanet as well as one that uses a manual regularizer tval3 (numbers are reported from istanet ), and show that it achieves state-of-the-art performance. We then report results for training with our unsupervised framework, and find that this leads to accurate models that only lag our supervised baseline by 0.4 db or less in terms of average PSNR on both test sets—and in most cases, actually outperforms previous methods. This is despite the fact that these models have been trained without any access to ground-truth images. Figure 2 provides example reconstructions for some images, and we find that results from our unsupervised method are extremely close in visual quality to those of the baseline model trained with full supervision.

4.2 Blind Face Image Deblurring

We next consider the problem of blind motion deblurring of face images. Like facedeblur , we consider the problem of restoring aligned and cropped face images that have been affected by motion blur, through convolution with motion blur kernels of size upto

, and Gaussian noise with standard deviation of two gray levels. We use all 160k images in the CelebA training set 

celeba and 1.8k images from Helen training set helen to construct our training set, and 2k images from CelebA val and 200 from the Helen training set for our validation set. We use a set of 18k and 2k random motion kernels for training and validation respectively, generated using the method described in chakrabarti2016neural . We evaluate our method on the official blurred test images provided by facedeblur (derived from the CelebA and Helen test sets). Note that unlike facedeblur , we do not use any semantic labels for training.

In this case, we use a single U-Net architecture to map blurry observations to sharp images. We again train a model for this architecture with full supervision, generating blurry-sharp training pairs on the fly by pairing random of blur kernels from training set with the sharp images. Then, for unsupervised training with our approach, we choose two kernels for each training image to form a training set of measurement pairs, that are kept fixed (including the added Gaussian noise) across all epochs of training. We first consider non-blind training, using the true blur kernels to compute the swap- and self-measurement losses. Here, we consider training with and without the proxy loss for the network. Then, we consider the blind training case where we also learn an estimator for blur kernels, and use its predictions to compute the measurement losses. Instead of training a entirely separate network, we share the initial layers with the image UNet, and form a separate decoder path going from the bottleneck to the blur kernel. The weights are all set to one in this case.

We report results for all versions of our method in Table 2, and compare it to facedeblur , as well as a traditional deblurring method that is not trained on face images xu2013unnatural . We find that with full supervision, our architecture achieves state-of-the-art performance. Then with non-blind training, we find that our method is able to come close to supervised performance when using the proxy loss, but does worse without—highlighting its utility even in the non-blind setting. Finally, we note that models derived using blind-training with our approach are also able to produce results nearly as accurate as those trained with full supervision—despite lacking access both to ground truth image data, and knowledge of the blur kernels in their training measurements. Figure 3 illustrates this performance qualitatively, with example deblurred results from various models on the official test images. We also visualize the blur kernel estimator learned during blind training with our approach in Fig. 4 on images from our validation set. Additional results, including those on real images, are included in the supplementary.

Method Supervised Helen CelebA
Xu et alxu2013unnatural 20.11 0.711 18.93 0.685
Shen et alfacedeblur 25.99 0.871 25.05 0.879
Supervised Baseline (Ours) 26.13 0.886 25.20 0.892
Unsupervised Non-blind (Ours) 25.95 0.878 25.09 0.885
Unsupervised Non-blind (Ours) without proxy loss 25.47 0.867 24.64 0.873
Unsupervised Blind (Ours) 25.93 0.876 25.06 0.883

Table 2: Performance of various methods on blind face deblurring on test images from facedeblur .
Ground truth    Blurred input  Shen et alfacedeblur Supervised (Ours) Non-blind (Ours) Blind (Ours)
22.69, 24.61, 25.16, 25.19
26.83, 28.18, 28.27, 28.16
26.59, 28.29, 27.42, 26.77
22.36, 23.50, 22.84, 22.94
Figure 3: Blind face deblurring results using various methods. Results from our unsupervised approach, with both non-blind and blind training, nearly match the quality of the supervised baseline.
Ground Truth Blurred Predictions Ground Truth Blurred Predictions
Figure 4: Image and kernel predictions on validation images. We show outputs of our model’s kernel estimator, that is learned as part of blind training to compute swap- and self-measurement losses.

5 Conclusion

We presented an unsupervised method to train image estimation networks from only measurements pairs, without access to ground-truth images, and in blind settings, without knowledge of measurement parameters. In this paper, we validated this approach on well-established tasks where sufficient ground-truth data (for natural and face images) was available, since it allowed us to compare to training with full-supervision and study the performance gap between the supervised and unsupervised settings. But we believe that our method’s real utility will be in opening up the use of CNNs for image estimation to new domains—such as medical imaging, applications in astronomy, etc.—where such use has been so far infeasible due to the difficulty of collecting large ground-truth datasets.

Acknowledgments. This work was supported by the NSF under award no. IIS-1820693.


  • (1) Rushil Anirudh, Jayaraman J Thiagarajan, Bhavya Kailkhura, and Timo Bremer. An unsupervised approach to solving inverse problems using generative adversarial networks. arXiv preprint arXiv:1805.07281, 2018.
  • (2) Ashish Bora, Eric Price, and Alexandros G Dimakis. Ambientgan: Generative models from lossy measurements. In International Conference on Learning Representations (ICLR), 2018.
  • (3) Antoni Buades, Bartomeu Coll, and J-M Morel. A non-local algorithm for image denoising. In

    2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

    , volume 2, pages 60–65. IEEE, 2005.
  • (4) Ayan Chakrabarti. A neural approach to blind motion deblurring. In European conference on computer vision, pages 221–235. Springer, 2016.
  • (5) Jen-Hao Rick Chang, Chun-Liang Li, Barnabas Poczos, BVK Vijaya Kumar, and Aswin C Sankaranarayanan. One network to solve them all-solving linear inverse problems using deep projection models. In Proc. ICCV, 2017.
  • (6) Yunjin Chen and Thomas Pock. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE transactions on pattern analysis and machine intelligence, 39(6):1256–1272, 2017.
  • (7) Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2015.
  • (8) David L Donoho et al. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
  • (9) Mário AT Figueiredo, Robert D Nowak, and Stephen J Wright. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE Journal of selected topics in signal processing, 1(4):586–597, 2007.
  • (10) William T Freeman, Thouis R Jones, and Egon C Pasztor. Example-based super-resolution. IEEE Computer graphics and Applications, (2):56–65, 2002.
  • (11) Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 270–279, 2017.
  • (12) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • (13) Kuldeep Kulkarni, Suhas Lohit, Pavan Turaga, Ronan Kerviche, and Amit Ashok. Reconnet: Non-iterative reconstruction of images from compressively sensed measurements. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 449–458, 2016.
  • (14) Vuong Ba Lê, Jonathan Brandt, Zhe L. Lin, Lubomir D. Bourdev, and Thomas S. Huang. Interactive facial feature localization. In ECCV, 2012.
  • (15) Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila. Noise2noise: Learning image restoration without clean data. arXiv preprint arXiv:1803.04189, 2018.
  • (16) Chengbo Li, Wotao Yin, Hong Jiang, and Yin Zhang. An efficient augmented lagrangian method with applications to total variation minimization. Computational Optimization and Applications, 56(3):507–530, 2013.
  • (17) Zhengqi Li and Noah Snavely. Learning intrinsic image decomposition from watching the world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9039–9048, 2018.
  • (18) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
  • (19) Michael Lustig, David L Donoho, Juan M Santos, and John M Pauly. Compressed sensing mri. IEEE signal processing magazine, 25(2):72, 2008.
  • (20) Wei-Chiu Ma, Hang Chu, Bolei Zhou, Raquel Urtasun, and Antonio Torralba. Single image intrinsic decomposition without a single intrinsic image. In Proceedings of the European Conference on Computer Vision (ECCV), pages 201–217, 2018.
  • (21) Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 257–265, 2017.
  • (22) Yaniv Romano, Michael Elad, and Peyman Milanfar. The little engine that could: Regularization by denoising (red). SIAM Journal on Imaging Sciences, 10(4):1804–1844, 2017.
  • (23) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention, 2015.
  • (24) Stefan Roth and Michael J Black. Fields of experts. IJCV, 2009.
  • (25) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.
  • (26) Ziyi Shen, Wei-Sheng Lai, Tingfa Xu, Jan Kautz, and Ming-Hsuan Yang. Deep semantic face deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8260–8269, 2018.
  • (27) Li Xu, Shicheng Zheng, and Jiaya Jia. Unnatural l0 sparse representation for natural image deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1107–1114, 2013.
  • (28) Lu Yuan, Jian Sun, Long Quan, and Heung-Yeung Shum. Image deblurring with blurred/noisy image pairs. ACM Transactions on Graphics (TOG), 26(3):1, 2007.
  • (29) Jian Zhang and Bernard Ghanem. Ista-net: Interpretable optimization-inspired deep network for image compressive sensing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1828–1837, 2018.
  • (30) Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang. Learning deep cnn denoiser prior for image restoration. In Proc. CVPR, 2017.
  • (31) Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1851–1858, 2017.
  • (32) Daniel Zoran and Yair Weiss. From learning models of natural image patches to whole image restoration. In 2011 International Conference on Computer Vision, pages 479–486. IEEE, 2011.

Appendix A Additional Results

a.1 Reconstruction from Compressive Measurements

We include additional results for the case where the compressive measurements are corrupted by additive white Gaussian noise. When training with full supervision, we generate the noisy measurement on the fly resulting in many noisy compressed measurements for each image. But for unsupervised training with our approach, we keep the noise values (along with the measurement parameters) for each image fixed across all training epochs. We show results for Gaussian noise with different standard deviations in Table 3 for the 10% compression ratio. Again, our unsupervised training approach comes close to matching the accuracy of the fully supervised baseline. Figure 5 shows example reconstructions for this case.

Method BSD68 Set11
=0 =0.1 =0.2 =0.3 =0 =0.1 =0.2 =0.3
Supervised Baseline 25.57 24.60 23.49 22.57 26.74 25.24 23.67 22.30
Unsupervised Training 25.40 24.41 23.12 21.99 26.33 24.94 23.21 21.79
Table 3: Performance (in PSNR dB) of our supervised baseline and proposed unsupervised method for noisy compressive measurement reconstruction, on BSD68 and Set11 images for different noise levels and compression ratio 10%.


PSNR: 31.26


PSNR: 30.64


PSNR: 24.67


PSNR: 24.55


PSNR: 26.13


PSNR: 26.00
Figure 5: Example reconstructions from noisy compressive measurements, with supervised and unsupervised models.

a.2 Blind Face Image Deblurring

We show additional face deblurring results from facedeblur ’s test set in Fig. 6. Moreover, facedeblur also provides a dataset of real blurred images that are aligned, cropped, and scaled. While there is no ground-truth image data available for this set, we include example results from it in Fig. 7 for qualitative evaluation. We again find that results from models trained using our unsupervised approach are close in visual quality to those from our supervised baseline.

Ground truth    Blurred input  Shen et alfacedeblur Supervised (Ours) Non-blind (Ours) Blind (Ours)
25.28, 27.01, 26.35, 26.21
22.06, 23.76, 23.77, 23.96
24.34, 26.38, 26.06, 25.88
24.84, 26.20, 27.05, 26.77
25.87, 27.14, 26.87, 26.67
29.04, 30.10, 29.61, 29.05
27.59, 29.82, 30.62, 30.66
Figure 6: Additional face deblurring results from the test set from facedeblur .
   Blurred input  Shen et alfacedeblur Supervised (Ours) Non-blind (Ours) Blind (Ours)
Figure 7: Face deblurring results on real blurry face images as provided by facedeblur .
Figure 8: Forming pairs of compressive mesaurements with shifted partitions. We form our measurements by dividing the image into two shifted sets of overlapping patches, where the shifts for are sampled randomly for each training image, but kept fixed through all epochs of training. All patches, in both partitions, are measured with a common measurement matrix. This provides the required diversity of our method since each pixel in the image (except those near boundaries) are measured twice, differently within two different overlapping patches.

Appendix B Network Architectures and Details of Training

Both of our compressive measurement reconstruction and face deblurring networks are based on U-Net unet

, featuring encoder-decoder architectures with skip connections. We use convolutional layers with stride larger than 1 for downsampling, and transpose convolutional layers for upsampling. Except for the last layer of each network, all layers are followed by batch normalization and ReLU. We use

distance as for all losses for compressive measurement reconstruction, and the distance (again, for all losses) in blind face deblurring. All networks are trained with Adam adam optimizer and a learning rate of . We drop the learning rate twice by when the loss on the validation set flattens out. Training takes about one to two days on a 1080 Ti GPU.

Compressive Reconstruction. Our compressive measurement reconstruction network is a stack of two U-Nets, with the detailed configuration of each U-Net shown in Table 4

. Given a compressed vector

for a single patch and the sensing matrix , we first compute and reshape it to the original size of the patch (i.e., ) and input this to the first U-Net. The second U-Net then takes as input the concatenation of and the output from the first U-Net. Finally, we add the outputs of these two U-Nets to derive the final estimate of the image.

Our approach to deriving measurement pairs during training is visualized in Fig. 8.

Input Output Kernel Size
# input
# output
Stride Output Size
U-Net-1 out
conv1 2 1 or 2 32 1 32 (VALID)
conv1 conv2 4 32 64 2 16
conv2 conv3 4 64 128 2 8
conv3 conv4 4 128 256 2 4
conv4 conv5 4 256 256 2 2
conv5 conv6 4 256 256 2 1
conv6 upconv1 4 256 256 1/2 2
conv5 upconv1 upconv2 4 512 256 1/2 4
conv4 upconv2 upconv3 4 512 128 1/2 8
conv3 upconv3 upconv4 4 256 64 1/2 16
conv2 upconv4 upconv5 4 128 32 1/2 32
conv1 upconv5 upconv6 2 64 32 1 33 (VALID)
upconv6 end1 3 32 32 1 33
end1 end2 1 32 1 1 33
Table 4: Detailed architecture of the U-Net used for compressive measurement reconstruction. We stack two such networks together, and the final image estimate is the sum of their outputs. All “upconv” layers correspond to transpose convolution,

implies concatenation, and unless indicated with “VALID”, all layers use “SAME” padding.

Face deblurring. Our face deblurring network is also a U-Net that maps the blurred observation to a sharp image estimate of the same size. For blind training, we have an auxiliary decoder path to produce the kernel estimate (i.e., to act as ). The kernel decoder path has the same number of transpose convolution layers, but only the first few upsample by two and have skip connections, since the kernel is smaller. The remaining transpose convolution layers have stride 1, but increase spatial size (as they represent transpose of a ’VALID’ convolution). The final output of the kernel decoder path is passed through a “softmax” that is normalized across spatial locations. This yields a kernel with elements that sum to 1 (which matches the constraint that the blur kernel doesn’t change the average intensity, or DC value, of the image). The detailed architecture is presented in Table 5.

Input Output Kernel Size
# input
# output
Stride Output Size
RGB conv1 4 3 64 2 64
conv1 conv2 4 64 128 2 32
conv2 conv3 4 128 256 2 16
conv3 conv4 4 256 512 2 8
conv4 conv5 4 512 512 2 4
conv5 conv6 4 512 512 2 2
conv6 conv7 4 512 512 2 1
conv7 upconv1 4 512 512 1/2 2
conv6 upconv1 upconv2 4 1024 512 1/2 4
conv5 upconv2 upconv3 4 1024 512 1/2 8
conv4 upconv3 upconv4 4 1024 256 1/2 16
conv3 upconv4 upconv5 4 512 128 1/2 32
conv2 upconv5 upconv6 4 256 64 1/2 64
conv1 upconv6 output 4 128 3 1/2 128
conv7 kupconv1 4 512 512 1/2 2
conv6 kupconv1 kupconv2 4 1024 512 1/2 4
conv5 kupconv2 kupconv3 4 1024 512 1/2 8
conv4 kupconv3 kupconv4 4 1024 256 1/2 16
conv3 kupconv4 kupconv5 4 512 128 1 19 (VALID)
kupconv5 kupconv6 4 128 64 1 22(VALID)
kupconv6 kupconv7 4 128 64 1 25(VALID)
kupconv7 koutput 3 128 64 1 27(VALID)
Table 5: Architecture of the U-Net used for blind face deblurring. The second decoder path that produces a kernel estimate (koutput) is only used for blind training.