Image restoration (IR) attempts to reconstruct clean signals from their corrupted observations, which is known to be an ill-posed inverse problem. By accommodating different types of corruption distributions, the same mathematical model applies to problems such as image denoising, super-resolution and deblurring. Recently, deep neural networks (DNNs) and generative adversarial networks (GANs)  have shown their superior performance in various low-level vision tasks. Nonetheless, most of these methods need paired training data for specific tasks, which limits their generality, scalability and practicality in real-world multimedia applications. In addition, strong supervision may suffer from the overfitting training and lower generalization to real image corruption types.
More recently, the domain transfer based unsupervised learning methods have attracted lots of attention due to the great progress [9, 18, 20, 21, 40] achieved in style transfer, attribute editing and image translation, e.g., CycleGAN , UNIT  and DRIT . Although these methods have been expanded to specific restoration tasks, they could not reconstruct the high-quality images due to losing finer details or inconsistency backgrounds, as shown in Fig. 1. Different from DNNs based supervised models, which aim at learning a powerful mapping between the noisy and clean images. Directly applying existing domain-transfer methods is unsuitable for generalized image inverse problems due to the following reasons:
Indistinct Domain Boundary. Image translation aims to learn abstract shared-representations from unpaired data with clear domain characteristics, such as horse-to-zebra, day-to-night, etc. On the contrary, varying noise levels and complicated backgrounds blur domain boundaries between unpaired inputs.
Weak Representation. Unsupervised domain-adaption methods extract high-level representations from unpaired data by shared-weight encoder and explicit target domain discriminator. For slight noisy signals, it is easy to cause domain shift problems in translated images and lead to low-quality reconstruction.
Poor Generalization. Image translation learns a domain mapping from one-to-one image, which hardly captures the generalized semantic and texture representations. This also exacerbates the instability of GAN.
In order to address these problems, inspired by image sparse representation  and domain adaption [7, 8], we attempt to learn invariant representation from unpaired samples via domain adaption and reconstruct clean images instead of relying on pure unsupervised domain transfer. Different from general image translation methods [18, 21, 40], our goal is to learn robust intermediate representation free of noise (referred to as Invariant Representation) and reconstruct clean observations. Specifically, to achieve this goal, we factorize content and noise representations for corrupted images via disentangled learning; then a representation discriminator is utilized to align features to the expected distribution of clean domain. In addition, the extra self-supervised modules, including background and semantic consistency constraints, are used to supervise representation learning from image domains further.
In short, the main contributions of the paper could be summarized as follows: 1) Propose an unsupervised representation learning method for image restoration based on data-driven, which is easily expanded to other low-level vision tasks, such as super-resolution and deblurring. 2) Disentangle deep representation via dual domain constraints, i.e., feature and image domains. Extra self-supervised modules, including semantic meaning and background consistency modules, further improve the robustness of representations. 3) Build an unsupervised image restoration framework based on cross domain transfer with more effective training and faster convergence speed. To our knowledge, this is the first unsupervised representation learning approach that achieves competing results for processing synthetic and real noise removal with end-to-end training.
2 Related Work
2.1 Single Image Restoraion
Traditional Methods. Classical methods, containing Total Variation [29, 34], BM3D , Non-local mean  and dictionary learning [3, 12], have achieved good performance on general image restoration tasks, such as image denoising, super-resolution and deblurring. In addition, considering that image restoration is in general an ill-posed problem, some methods based on regularization are also proved effective [11, 42].
Deep Neural Networks. Relying on powerful computer sources, data-driven DNN methods have achieved better performance than traditional methods in the past few years. Vincent et al.  proposed stacked denoising auto-encoder for image restoration. Xie et al.  combined sparse coding and pre-trained DNN for image denoising and inpainting. Mao et al.  proposed RedNet with symmetric skip connections for noise removal and super-resolution. Zhang et al.  introduced residual learning for Gaussian noise removal. In general, DNNs-based methods could realize superior results on synthetic noise removal via effective supervised training, but it is unsuitable for real-world applications.
2.2 Unsupervised Learning for IR
Learning from noisy observations. One interesting direction for unsupervised IR is directly recovering clean signals from noisy observations. Dmitry et al.  proposed deep image prior (DIP) for IR, which requires suitable networks and interrupts its training process based on low-level statistical prior. That is usually unpredictable for different samples. Via zero-mean noise distribution prior, Noise2Noise (N2N)  directly learns reconstruction between two images with independent noise sampling. That is unsuitable for noise removal in real-world, e.g., medical image denoising. To alleviate this problem, Noise2Void  predicted a pixel from its surroundings by learning a blind-spot network for corrupted images. Similar to Noise2Self , this method reduces the training efficiency, but also decreases the denoising performance.
Image Domain Transfer. Another direction solves image restoration by domain transfer, which aims to learn one2one mapping from one domain to another and output image to lie on the manifold of clean image. Previous works, e.g., CycleGAN , DualGAN  and BicycleGAN  have shown great capacity in image translation. Expanding works, containing CouplesGAN , UNIT  and DRIT  learn shared-latent representation for diverse image translation. Along this way, Yuan et al.  proposed a nested CycleGAN to solve the unsupervised image super-resolution. Expanding DRIT, Lu et al.  decoupled image content domain and blur domain to solve image deblurring, referred to as DRNet. However, these methods aim to learn stronger domain generators, they require obvious domain boundary and complicated network structure.
3 The Proposed Method
Our goal is to learn abstract intermediate representations from noise inputs and reconstruct clear observations. In a certain way, unsupervised IR could be viewed as a specific domain transfer problem, i.e., from noise domain to clean domain. Therefore, the method is injected into the general domain transfer architecture, as shown in Fig. 2.
In supervised domain transfer, we are given samples
drawn from a joint distribution, where and are two image domains. For unsupervised domain translation, samples are drawn from the marginal distributions and . In order to infer the joint distribution from the marginal samples, a shared-latent space assumption is proposed that there exists a shared latent code in a shared-latent space , so that we can recover both images from this code. Given samples from the joint distribution, this process is presented by
A key step is how to implement this shared-latent space assumption. To do so, an effective strategy is sharing high-level representation by shared-weight encoder, which samples the features from the unified distribution. However, it is unsuitable for IR that latent representation only contains semantic meanings, which leads to domain shift in recovered images, e.g., blurred details and inconsistent backgrounds. Therefore, we attempt to learn more generalized representations containing richer texture and semantic features from inputs, i.e., invariant representations. To achieve it, adversarial domain adaption based discrete representation learning and self-supervised constraint modules are introduced into our method. Details are described in the subsections.
3.1 Discrete Representation Learning
Discrete representation aims to compute the latent code from inputs, where contains texture and semantic information as much as possible. To do so, we use two auto-encoders to model and separately. Given any unpaired samples , where and separately denote noise and clean sample from different domains, Eq. 1 is reformulated as and . Further, IR could be represented as . However, considering noise always adheres to high-frequency signals, directly reconstructing clean images is difficult due to varying noise levels and types, which requires powerful domain generator and discriminator. Therefore, we introduce the disentangling representation into our architecture.
Disentangling Representation. For noise sample , an extra noise encoder is used to model varying noisy levels and types. The self-reconstruction is formulated by , where and . Assuming the latent codes and obey same distribution in shared-space that , similar to image translation, unsupervised image restoration could be divided into two stages: forward translation and back reconstruction.
Forward Cross Translation. We first extract the representations from and extra noise code . Restoration and degradation could be represented by
where represents the recovered clean sample, denotes the degraded noise sample. represents channel-wise concatenation operation. and are viewed as specific domain generators.
Backward Cross Reconstruction. After performing the first translation, reconstruction could be achieved by swapping the inputs and that:
where and denote reconstructed inputs. To enforce this constraint, we add the cross-cycle consistency loss for and domains:
Adversarial Domain Adaption. Another factor is how to embed latent representations and into shared space. Inspired by unsupervised domain adaption, we implement it by adversarial learning instead of shared-weight encoder. Our goal is to facilitate representations from inputs obeying the similar distribution while preserving richer texture and semantic information of inputs. Therefore, a representation discriminator is utilized in our architecture. We express this feature adversarial loss as
3.2 Self-Supervised Constraint
Due to lack of effective supervised signals for translated images, only relying on feature domain discriminant constraints would lead to domain shift problems inevitably in generated images. To speed convergence while learning more robust representations, self-supervised modules including Background Consistency Module (BCM) and Semantic Consistency Module (SCM) are introduced to provide more reasonable and reliable supervision.
BCM aims to preserve the background consistency between the translated images and inputs. Similar strategies have been applied for self-supervised image reconstruction tasks [14, 28]. These methods use the gradient error to constrain reconstructed images by smoothing the input and output images with blur operators, e.g., Gaussian blur kernel and guided filtering . Different from them, a loss is directly used for the recovered images instead of gradient error loss in our module, as shown in Fig. 3, which is simple but effective to retain background consistency while recovering finer texture in our experiments. Specifically, a multi-scale Gaussian-Blur operator is used to obtain multi-scale features respectively. Therefore, a background consistency loss could be formulated as:
where denotes the Gaussian-Blur operator with blur kernel , is the hyper-parameter to balance the errors at different Gaussian-Blur levels. and denote original input and the translated output, i.e., and . Based on experimental attempts at image denoising, we set as for respectively.
In addition, inspired by perception loss , the feature from the deeper layers of the pre-trained model contain semantic meanings only, which are noiseless or with little noise. Therefore, different from the general feature loss, which aims to recover finer image texture details via similarities among shallow features, we only extract deeper features as semantic representations from the corrupted and recovered images to keep consistency, referred to as semantic consistency loss . It could be formulated as
where denotes the features from layer of the pre-trained model. In our experiments, we use the conv5-1 layer of VGG-19 
pre-trained network on ImageNet.
3.3 Jointly Optimizing
Other than proposed cross-cycle consistency loss, representation adversarial loss and self-supervised loss, we also use other loss functions in our joint optimization.
Target Domain Adversarial Loss. We impose domain adversarial loss , where and attempt to discriminate the realness of generated images from each domain. For the noise domain, we define the adversarial loss as
Similarly, we define adversarial loss for clean image domain as
Self-Reconstruction Loss. In addition to the cross-cycle reconstruction, we also apply a self-reconstruction loss to facilitate the training. This process is represented as and .
KL Loss. In order to model the noise encoder branch, we add a KL divergence loss to regularize the distribution of the noise code
to be close to the normal distribution that, where .
The full objective function of our method is summarized as follows:
where the hyper-parameters control the importance of each term.
Restoration: After learning, we only retain the cross encoder-generator network , extracts the domain-invariant representation from corrupted sample , and recover the clean image from the that .
In this section, we first give the implementation details of our method for classical image denoising. Traditional metrics, such as Peak-Signal-Noise-Rate (PSNR) and Structural Similarity (SSIM), are used for evaluation in experiments. Detailed results on synthetic and real noise removal tasks are shown with other state-of-the-art methods. For the synthetic noise removal, we start with general noise distributions including additive white Gaussian noise (AWGN) and Poisson noise. Two well-known datasets BSD68  and Kodak are used to verify the performance of our method in denoising and texture restoration. Furthermore, the real noise images from the medical Low-Dose Computed Tomography (LDCT) dataset are used to evaluate the generalized capacity of the method. Extra ablation study is used to verify the effectiveness of the proposed framework.
|Ours||32.37 1.55||0.957 0.01|
We follow the similar network architecture as the one used in 
, the difference is we introduce an extra noise encoder branch and remove the shared-weight encoder. Representation discriminator is a full convolutional network structure, which stacks four convolutional layers with two strides and a global average pooling layer. Proposed framework is implemented with Pytorch and an Nvidia TITAN-V GPU is used in experiments. During the training, we use Adam  to perform optimization and momentum is set to 0.9. The learning rate is initially set to 0.0001 and exponential decay over the 10K iterators. In all experiments, we randomly crop 6464 patches with batch size of 16 for training. Hyper-parameters are set to , , and .
4.2 Synthetic Noise Removal
We train the model with the images from the Pascal2007 
training set. Samples are randomly divided into two parts without coinciding. We add different noise-levels to each sample in part one, which is viewed as corrupted set, and another is clean set. Proposed method needs to estimate the magnitude of noise while removing it (“blind” image denoising). Some supervised and unsupervised based methods are selected to evaluate.
We add the AWGN with zero mean and standard deviation randomly generated with ranges from 5 to 50 for each training example, test on BSD68 with. The representative unsupervised methods, including DIP , Noise2Noise (N2N) , CycleGAN , UNIT  and DRNet , and supervised methods (e.g., RedNet-30  and DnCNN ), are selected to compare the performance on image denoising. Traditional BM3D is also included for evaluation. For CycleGAN, UNIT and DRNet, we retrain them with the same training data.
The visualized results from BSD68 dataset are given in Fig. 4. Although all the methods show the ability for noise reduction, domain transfer based unsupervised methods, including CycleGAN, UNIT and DRNet, have obvious domain shift problems, e.g., inconsistent brightness and undesired artifacts, resulting in worse visual perception. N2N and DIP achieve higher PSNR and SSIM. However, DIP loses fine local details and leads to over-smoothness in the generated images. Depending on the zero-mean distribution prior, N2N achieves similar results with other supervised methods, such as RedNet-30 and DnCNN. Our approach presents comparable performance on noise removal and texture preserving. Although the PSNR is slightly lower than other supervised methods’, our method achieves better visual consistency with natural images. Quantitative results for BSD68 are given in Table. 1. The proposed method shows stronger ability to blind image denoising.
Poisson Noise Removal. For corrupted samples, we randomly generate the noise data from Scikit-image library , which generates independent Poisson noise by the number of unique values in the given samples, and test on Kodak111http://r0k.us/graphics/kodak/ dataset. Some representative methods, including DIP, N2N, ANSC  and RedNet-30, are selected in our evaluations.
Comprehensive results are shown in Fig. 5 and Table. 2. DIP tends to generate more blurred results. The traditional ANSC method first transforms the Poisson noise into Gaussian (Anscombe transform), then applies the BM3D to remove noise, and finally inverts the transform, achieving higher PSNR and SSIM. Considering the different way of generating Poisson noise, the published RedNet-30 and N2N models don’t achieve the best results. Our method achieves the highest PSNR and SSIM. In addition, visualized results also show that for slight noise signals, the proposed framework has better generalized capacity to remove noise while restoring finer details.
4.3 Real Noise Removal
X-ray computed tomography (CT) is widely used as important imaging modalities in modern clinical diagnosis. Considering the potential radiation risk to the patient, lowering the radiation dose increases the noise and artifacts in reconstructed images, which can compromise diagnostic information. Typically, noise in x-ray photon measurements can be simply modeled as the combination of Poisson quantum noise and Gaussian electronic noise. However, the noise in reconstructed images is more complicated and does not obey any statistical distribution across the whole image. Therefore, classical image post-processing methods based on noise statistic prior, e.g., N2N, are unsuitable for Low-dose CT (LDCT) denoising.
A real clinical dataset authorized by Mayo Clinic for the 2016 NIH-AAPM-Mayo Clinic Low Dose CT Grand Challenge222http://www.aapm.org/GrandChallenge/LowDoseCT/ is used to evaluate LDCT image reconstruction algorithms, which contains 5936 images in 512512 resolution from 10 different subjects. We randomly select 4000 images as training set, the remaining is as testing set. DIP, BM3D and RedCNN , which is an extended version of RedNet, are selected for evaluation in our experiments. The representative results are shown in Fig. 6, BM3D introduces waxy artifacts into the reconstructed image. DIP fails to generate the fine local structures. RedCNN tends to generate smoother images. Our approach achieves the better balance between visual quality and noise removal. Table. 3 gives the quantitative results.
4.4 Ablation Study
In this section, we perform an ablation study to analyze the effects of discrete disentangling representation and self-supervised modules in the proposed framework. Both quantitative and qualitative results on Gaussian noise removal are shown for the following three variants of the proposed method where each component is separately studied: a) Remove the noise encoder branch; b) Remove the representation adversarial network , directly learn the representations and by the target domain constraints only; c) Remove the background consistency constraint from self-supervised modules, only retain the semantic consistency constraints.
The representative results are shown in Fig. 7. Compared with the full model, referred to as (d), directly learning invariant representations from noise images would lead to the generator producing over-smooth results for (a) due to unexpected noise contained in features, which requires a powerful domain generator. Although (b) gives the better PSNR and SSIM after removing the feature adversarial module, some undesired artifacts adhere to high-frequency signals. Due to failing to provide the effective self-supervised constraint for the recovered images, although retaining the semantic consistency module, the model (c) also produces domain shift problems in generated images, e.g., inconsistency brightness and blurred details, resulting in worse visual perception. Quantitative results are shown in Table. 4.
In addition, considering DRNet  has similar architecture with ours, which extends DRIT  while introducing extra feature loss to solve image deblurring, we select it as a representative domain transfer method to compare the convergence of algorithms on denoising task. Fig. 8 gives the convergence plots for AWAN removal, where we trained two models from scratch on the same training set. Although DRNet also uses the similar idea of disentangled representation to solve image restoration, which is different from ours in essence. Varying noise-levels and types lead to unstable learning during training due to lack of clear domain boundary. Aiming to learn invariant representation, our method gives faster and more stable convergence plots.
In this paper, we propose an unsupervised learning method for image restoration. Specifically, we aim to learn invariant representations from noise data via disentangling representations and adversarial domain adaption. Aided by effective self-supervised constraints, our method could reconstruct the higher-quality images with finer details and better visual perception. Experiments on synthetic and real image denoising show our method achieves comparable performance with other state-of-the-art methods, and has faster and more stable convergence than other domain adaption methods.
This work is supported by the National Natural Science Foundation of China under grant 61871277, and in part by the Science and Technology Project of Sichuan Province of China under grant 2019YFH0193.
-  (2019) Noise2Self: blind denoising by self-supervision. In ICML, Cited by: §2.2.
-  (2005) A non-local algorithm for image denoising. 2, pp. 60–65 vol. 2. Cited by: §2.1.
-  (2009) Clustering-based denoising with locally learned dictionaries. IEEE Transactions on Image Processing 18, pp. 1438–1451. Cited by: §2.1.
Low-dose ct with a residual encoder-decoder convolutional neural network. IEEE Transactions on Medical Imaging 36, pp. 2524–2535. Cited by: Figure 6, §4.3.
-  (2007) Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on Image Processing 16, pp. 2080–2095. Cited by: §2.1, Figure 4, Figure 6.
-  (2009) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88, pp. 303–338. Cited by: §4.2.
Unsupervised domain adaptation by backpropagation. ArXiv abs/1409.7495. Cited by: §1.
-  (2015) Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17, pp. 59:1–59:35. Cited by: §1.
-  (2016) Image style transfer using convolutional neural networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2414–2423. Cited by: §1.
-  (2014) Generative adversarial nets. In NIPS, Cited by: §1.
-  (2014) Weighted nuclear norm minimization with application to image denoising. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2862–2869. Cited by: §2.1.
-  (2015) Convolutional sparse coding for image super-resolution. 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1823–1831. Cited by: §2.1.
-  (2013) Guided image filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, pp. 1397–1409. Cited by: §3.2.
-  (2018) Unsupervised single image deraining with self-supervised constraints. 2019 IEEE International Conference on Image Processing (ICIP), pp. 2761–2765. Cited by: §3.2.
-  (2016) Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: §3.2.
-  (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §4.1.
-  (2018) Noise2Void - learning denoising from single noisy images. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2124–2132. Cited by: §2.2.
Diverse image-to-image translation via disentangled representations. ArXiv abs/1808.00948. Cited by: §1, §1, §2.2, §4.4.
-  (2018) Noise2Noise: learning image restoration without clean data. ArXiv abs/1803.04189. Cited by: §2.2, Figure 4, §4.2, Table 2.
-  (2019) STGAN: a unified selective transfer network for arbitrary image attribute editing. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3668–3677. Cited by: §1.
-  (2017) Unsupervised image-to-image translation networks. In NIPS, Cited by: §1, §1, §2.2, Figure 4, §4.1, §4.2.
-  (2016) Coupled generative adversarial networks. In NIPS, Cited by: §2.2.
-  (2019) Unsupervised domain-specific deblurring via disentangled representations. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10217–10226. Cited by: §2.2, Figure 4, §4.2, §4.4.
-  (2008) Sparse representation for color image restoration. IEEE Transactions on Image Processing 17, pp. 53–69. Cited by: §1.
-  (2011) Optimal inversion of the anscombe transformation in low-count poisson image denoising. IEEE Transactions on Image Processing 20, pp. 99–109. Cited by: §4.2, Table 2.
-  (2016) Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In NIPS, Cited by: §2.1, Figure 4, §4.2, Table 2.
-  (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001 2, pp. 416–423 vol.2. Cited by: §4.
-  (2018) Unsupervised class-specific deblurring. In ECCV, Cited by: §3.2.
-  (2005) An iterative regularization method for total variation-based image restoration. Multiscale Modeling & Simulation 4, pp. 460–489. Cited by: §2.1.
-  (2017) Automatic differentiation in pytorch. Cited by: §4.1.
-  (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. Cited by: §3.2.
-  (2017) Deep image prior. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9446–9454. Cited by: §2.2, Figure 4, Figure 6, §4.2, Table 2.
-  (2014) Scikit-image: image processing in python. In PeerJ, Cited by: §4.2.
-  (2004) Image denoising and decomposition with total variation minimization and oscillatory functions. Journal of Mathematical Imaging and Vision 20, pp. 7–18. Cited by: §2.1.
Extracting and composing robust features with denoising autoencoders. In ICML ’08, Cited by: §2.1.
-  (2012) Image denoising and inpainting with deep neural networks. In NIPS, Cited by: §2.1.
-  (2017) DualGAN: unsupervised dual learning for image-to-image translation. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2868–2876. Cited by: §2.2.
-  (2018) Unsupervised image super-resolution using cycle-in-cycle generative adversarial networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 814–81409. Cited by: §2.2.
-  (2017) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26, pp. 3142–3155. Cited by: §2.1, Figure 4, §4.2.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2242–2251. Cited by: §1, §1, §2.2, Figure 4, §4.2.
-  (2017) Toward multimodal image-to-image translation. ArXiv abs/1711.11586. Cited by: §2.2.
-  (2011) From learning models of natural image patches to whole image restoration. 2011 International Conference on Computer Vision, pp. 479–486. Cited by: §2.1.