Image restoration is a longstanding problem in image processing and computer vision. The need of restoration is deeply rooted among acquisition, transmission and application related to images. We call the process from high-quality to low-quality images as degradation, the inverse to restoration. Restoration method is highly related to corresponding degradation process. According to specific model, it can be further divided into deblurring, denoising and super-resolution, to name a few.
Recent advances in deep learning have greatly improved the state-of-the-art performances in those subareas[Tao2018ScaleRecurrentNF, Zhang2019DeepSH, Zhang2017BeyondAG, Zhang2020MemoryEfficientHN]
. However, most previous work reduces to an end-to-end trained neural network with pixel-wise
loss. The pixel-wise loss has the well-known drawbacks of leading to fuzzy in high-variance areas. In other words, the
norm is not so in line with perceptual quality. On consider that in most cases, the ultimate target of image restoration is a better perceived quality from human, the pixel-wise distance is not a perfect choice. Recent work has reconsidered this topic from a perceptual-oriented framework. Most of them leverage the advances in generative models like Generative Adversarial Networks (GAN)[GAN], or perceptual similarity metrics [Dosovitskiy2016GeneratingIW] to get a close approximation to subjective experience [Ledig2017PhotoRealisticSI, Yang2020FromFT]. They adopt a combination of and perceptual loss when training models, thus leading to a tradeoff between PNSR and perceptual quality. Despite their improvement, the fashion of simply putting the pieces together is wired. The loss still involves in the training process and hampers the restored images from looking realistic.
Some researchers have attempted to reformulate the problem to exclude the pixel-wise loss when generating high-quality images. They proposed to take the output of generative models as restored images, since this ensures the output quality with a well trained generative model [PCGAN, stylegan, stylegan2]. In this case, the pixel-wise loss (or perceptual loss) is used only for searching embedding in the latent space of corresponding generative model. Previous work has exploited the possibility and ways of embedding images into the latent space of GANs [Abdal2019Image2StyleGANHT, Bau2019SeeingWA, Gu2020ImagePU].
Although the transformation from original image space to latent space makes it easier to generate images within the natural image manifold, one key issue remains. Most generative models start with a specific distribution. Once trained, the visual quality of generated images are guaranteed only when the corresponding latent vector lies in area with high probability. Again, the problem remains how to restrict the feasible region within given manifold.[2020PULSE] proposed a spherical optimization method to deal with Gaussian prior and an approximation of the mapping network in StyleGAN [stylegan]. In [stylegan2]
, they renormalized the signals to meet the prior distribution after each optimization step. However, those heuristic methods are indeed hand-crafted and apply to simple prior distribution.
For an arbitrary prior distribution in latent space, the only and important information comes from samples. We propose to guide the optimization process with the sampled statistics. We treat the embedding to be solved as samples from a specific distribution, and aim to minimize the distance between such distribution and the prior distribution. We adopt the Maximum Mean Discrepancy (MMD) distance [Gretton2012AKT] for this task, and explicitly incorporate it into the optimization objective using the empirical formulation with Gaussian kernel. The method is parameter-free and does not involve extra models to capture the prior distribution. We find such constraint very effective and bring substantial improvement on the perceptual quality.
Another issue is the optimization objective used in the restoration process. For restoration, we should find the images degrade correctly. However, real-world degradation is complicated. The simple additive noise model and downscale model are far from enough. Previous work [2018To, Zhao2018UnsupervisedDL] attempts to learn a degradation model from real-world image pairs. But only the learned mapping from high-quality to low-quality images is incomplete. We would still have to the adopt distance between the input degraded image and that from the degradation model as the measurement of how "correct" the restored images are. However, the degradation model is trained independently. There is no guarantee from design that a degradation model followed by the distance is a good indicator of correctness.
Besides the "degradation mapping+ distance" fashion, we propose a different framework to accomplish this in one step. We directly model the degradation process as the conditional distribution, namely . Such model explicitly gives the probability from restored image to the given degraded image. It naturally generalizes to situations where the degradation degree is unknown, i.e. the original image could degrade to a set of images. The "degradation mapping" used in [2018To, Zhao2018UnsupervisedDL] does not consider such situation and assumes a "one-to-one" mapping. The overall framework is shown in Figure 1.
Our contributions are as follows.
We propose a novel optimization strategy for restoring images with the GAN prior. To fit the prior distribution, previous work has used hand-crafted and simple methods. The proposed method is parameter-free and applies to arbitrary prior distribution in latent space. We show that our strategy greatly improves the perceptual quality of restored images.
We formulate a universal degradation model in the form of conditional distribution. The model could explicitly tell the correctness of the restoration and guide the optimization process.
With the above two modules, we actually present a novel framework for perceptual image restoration.
2 Related Work
There are much work regarding image restoration and generative models. We briefly review several work that is most related to our method.
Perceptual Image Restoration. Image restoration aims to reverse the degradation process and find the original high-quality images. However, this is in most cases an ill-posed problem [Rani2016ABR]. Taking image super-resolution as example, there are lots of images that could downscale to the same low-resolution image. Therefore, a simultaneous enhancement is desired during restoration. In other words, previous work wants to find the possible images as well as remain high-quality [Blau20182018PC]. Most of them adopt perceptual metric [Dosovitskiy2016GeneratingIW] ( such as VGG[VGG] feature matching loss) and generative models [GAN, pixelcnn, pixelcnn++, Dinh2017DensityEU] to accomplish this. Among them GAN is the most widely used model for its superior performance in generating high-quality images. Furthermore, recent work has proposed to take the output from GAN as desired result for common image processing problems [2020PULSE, Gu2020ImagePU]. Such framework has achieved great improvement on the subjective quality. The key insight is excluding any reference-based metrics from resotration process to best preserve the ability of generative models. In consequence, they count on the generality and degradation model to ensure correctness, which is the inherent defect from design. Besides, they have to deal with the generative model carefully to find a proper embedding.
Editing Image with GAN prior. The goal of image restoration and enhancement is transforming low-quality images into high-quality images. However, it is essential to known in advance what high-quality actually means. Early work treat high-quality as simple statistics like color, contrast and brightness. Recent advances on generative models learn such statistics from real images. With a well trained GAN, [Abdal2019Image2StyleGANHT] find that we can embedding images into the latent space. This provides the way of using GANs as the high-quality prior and [Gu2020ImagePU] generalizes the GAN prior to common image processing problems. To embedd images into the latent space, previous work either directly optimizes the latent codes or trains a reverse encoder [Zhu2016GenerativeVM]. The former method requires no addiontal training and becomes the major choice [Abdal2019Image2StyleGANHT, Shen2020InterpretingTL, stylegan2, 2020PULSE, Gu2020ImagePU].
Image Degradation Model. Some subareas of image restoration have rather simple degradation model, like additive noise and downscale, while some are not, like compression artifacts of JPEG [JPEG], JPEG2000 [Adams2000JPEG2] and learning based compression [Toderici2017FullRI, Han2020TowardVG]. Although there are many learning based methods advancing the performance in those subareas, they still assume a simple degradation pattern [Zhang2019DeepSH, Zhang2017BeyondAG]. The simple assumption cannot handle many complicated practical degradations, which are more concerned in real world applications. Therefore, some researchers proposed to learn to degradation model from real world images [2018To, Zhao2018UnsupervisedDL]. They adopt GANs to learn to degradation mapping, and then train corresponding restoration models with generated images.
We start with the framework of the proposed perceptual image restoration. Let denotes the original high-quality image and the degraded image produced through arbitrary fashion ,
The restoration from degraded image seeks for the pseudo-inverse model that gives . The inverse model could be ill-posed and hence we also expected the restored images lies in the natural image manifold for sake of realistic,
In many cases, such degradation model may not be deterministic. For example, a Guassian blurring model would involve stochastic noise. Therefore, a more general degradation model comes as,
denotes the probability distribution of all possible images degraded from. The set of degraded images form a manifold, in which images would share similar content with the original image. We assume such correspondence and consistence can be fully captured by the conditional distribution , which is uniquely determined by the degradation model. The degradation manifold for a specific image can be blurred images with Gaussian noise, or compressed images with unknown ratios, to name a few.
Similarly, we still expect the restored images to be natural. We seek for images with high probability within the natural image manifold,
Till now we have introduced the overall framework of our perceptual image restoration.
3.1 Explicit Constraint on Latent Space Manifold
When it comes to an accessible model that constrained within the natural image manifold, the generative model is mostly referred. There has been much work exploring the latent space of a generative model and we also get start with those work. Typically, a pretrained generative model is supposed to output a high-quality natural image from a latent vector . Here, denotes the d-dimensional latent embedding. For the restored image to fall into natural image manifold , most work tend to find a latent vector and adopt as desired output. Recent advances show that, by leveraging a generative model, we are possible to find restored images with much better visual quality. However, as noted before, the transformation from original image space to latent space does not fully bypass the obstacle. It is general knowledge that the visual quality of generated images are guaranteed only when the corresponding latent embedding lies in area with high probability.
Again, the problem remains how to restrict the feasible region within given manifold. This is a non-trival problem, even when the explicit formulation of the prior distribution is known. As noted in [2020PULSE, Bora2017CompressedSU], the direct likelihood loss always force the latent embedding to the point with highest probability, instead of a feasible region. [2020PULSE] proposed to restrict the searching space within a sphere of radius for high-dimensional Gaussian, near which lies most of the mass. However, this is a rather rough and handcrafted solution, which is applicable only to Gaussian prior, not even the space in StyleGAN [stylegan, stylegan2]. As our known, no previous work has ever given a proper constraint on the searching process in space.
We start with the StyleGAN model with the space as feasible region. As pointed out by [Shen2020InterpretingTL, Abdal2019Image2StyleGANHT], space achieves much better performance over space for embedding images. Specifically, the space consists of several latent vectors that defines the styles of output image. In original StyleGAN, the latent vectors are identical . For generalization to unseen images, they are allowed to change separately. But we still expect them fall into the same manifold. We treat as independent samples from a distribution . Let denotes the prior distribution. We expect lie as close to as possible,
This is achieved with the Maximum Mean Discrepancy (MMD) [Gretton2012AKT], a non-parametric distance metric for two distributions,
where is the kernel function. Such non-parametric metric avoids extra training efforts. To achieve this, we take several samples from the Gaussian prior distribution and forward them through the mapping network to obtain samples in the space, namely . With the Gaussian kernel, the distance turns into,
where denotes the bandwidth of the Gaussian kernel. We empirically select the bandwidth according to the sampling statistics in space.
3.2 Degradation Estimation
Another essential part for image restoration is the degradation model. Some degradation models are straightforward (like additive noise) while others are not. As note in [2020PULSE, 2018To, Zhao2018UnsupervisedDL], real-world degradation models are complicated and lack explicit formulation. Previous work [2018To, Zhao2018UnsupervisedDL] attempts to learn a degradation model using generative adversarial networks (GAN) from real-world images. However, a degradation model from high-quality images to low-quality images are not enough. What we actually need is kind of indicator on how possible the restored images could degrade to the input images. In [2020PULSE], this is accomplished with the norm under the simple downscale model. The norm is direct but makes no sense with a learned degradation model, because the smaller distance between degraded images does not guarantee a smaller distance between original images (objective or subjective).
We model the degradation processpixelcnn, pixelcnn++] to accomplish this subtask. The illustration is shown in Figure 1. We adopt the pixelcnn++ framework [pixelcnn++], but with a rather small receptive field. We keep the gated unit and the discretized mixture of univariate Gaussian model used in [pixelcnn++], but discard the downsampling and upsampling branch. An accurate estimation of high-dimensional distribution requires numerous of samples. As pointed out in [MultivariateDE], the growth in sample size is at least exponential with dimension in order to attain an equivalent amount of accuracy in terms of mean integrated squared error (MISE). For images with size , corresponding sample size would be an enormous number ( in exponential). As a result, we only model the local degradation which performs better with limited samples. Otherwise the model would suffer from severe overfiting, as previously noted in [pixelcnn, pixelcnn++]. We would emphasize that the local design is not troublsome, since image restoration is usually a low-level problem and does not involve much global information. Lots of degradations such as noise, blur, and JPEG (block based) are local from origin. We provide the condition (high-quality image) for each "Conditional Block" in Figure 1, so that the final output has direct access to local pixel information as well as a larger context from previous layers.
As noted in [stylegan2], the original StyleGAN model suffers from unneglectable artifacts. Therefore, we replace StyleGAN with the improved version in [stylegan2] as backbone model. Following previous work, we test out method in the CelebA HQ dataset [celeba-hq] with 100 random samples. We keep the spherical gradient descent and the cross loss proposed in [2020PULSE]. Differently, we do not set the threshold for convergence and run 100 steps for all test images. Besides, we optimize the latent code in space directly instead of the space as in [2020PULSE]. The optimization initializes from the mean embedding of space , the same strategy used in [Abdal2019Image2StyleGANHT, stylegan2]. Previous study indicates that such initialization works reliably for face images. As for the MMD loss, we take 1000 samples in the space using the mapping network. We set the bandwidth to 512. The degradation model contains 6 conditional blocks after the downward and rightward stream with channel number set to 100.
4.2 Image Super-resolution
To demonstrate the effectiveness of the MMD loss, we firstly conduct experiments on image super-resolution. We adopt the same method used in [2020PULSE] for this task. The only difference is that we replace the StyleGAN with the improved version for a better baseline model. The examples are shown in Figure 2. The low-resolution images are with size while the high-resolution is . The results of PULSE is optimized with the spherical gradient descent [2020PULSE], while our method incorporates the additional MMD term. All other settings are the same. We can see that even with a better StyleGAN model, PULSE still produces unnatural images. The failures make it unreliable in practical systems. The failure could either origin from the divergence with prior distribution or the unconsistency between styles (they should be identical in original models). Whatever the reason, our model successfully fixes it. With the MMD loss restricting the latent embedding, we are able to generate more natural and high-quality images.
4.3 Degradation Estimation
We train the degradation model in a split of the CelebA HQ dataset [celeba-hq], which contains 20,000 randomly selected images. The rest is referred as test set. The corresponding degraded images are generated with the JPEG [JPEG] standard. For memory issues, we resize both the conditional and degraded images to .
Effectiveness. Figure 3 shows the loss curve in terms of average negative log likelihood (NLL) computed on test set. In existence of conditional image, the model achieves 0.5 bits/dim after about 20 epochs. To demonstrate the effective of the degradation model on distinguishing the matched images from unmatched, we compute the average NLL score on paired and unpaired images and report corresponding results in Table 1. As shown, the model achieves 0.39 bits/dim on paired (high,degraded) images and 2.26 bits/dim on unmatched pairs. An example is given in Figure 4. The degradation model can to some extent estimate the similarity between degraded image and conditional image , and give the explicit probability of . To further test the model ability in distinguishing small differences, we run the baseline algorithm with/without the MMD loss to produce different high-quality images similar to the degraded image. And then compute the NLL for corresponding pairs. The result is shown in Figure 4. We can easily tell that Figure 4(f) is more likely the original image than Figure 4(e) due to the same "smile" attribute with the reference image. The degradation model also prefers Figure 4(f) through a smaller NLL score. We attribute part of the ability in detecting small differences to the local property of the degradation model. The local design enables a bottom to top fashion and models the overall probability as the aggregation of local similarities.
Sampling from Degradation Model. An explicit modeling of the degradation distribution allows for sampling from corresponding process. We give several samples drawn from the trained model in Figure 5. The pixel by pixel sampling takes hours to sample single image and is only used for visualization. Compared to the compressed images, we find that the sampled images do exhibit similar patterns, like the degraded color in the faces. Moreover, the original structure and textures are well preserved. They are both the desired properties. However, the samples lack kind of spatial consistence and look noisy. On one hand, the pixel by pixel sampling lack explicit constraint on continuity. On the other hand, this is the direct result of the local design. Another consequence of the pixel by pixel sampling is the avalanche-like area shown in the last sample in Figure 5. A sampling on pixels with very low probability would leads to a chain reaction, from a point to half of the image. We find sampling from degradation model gives a very clear illustration of such mechanism, which is not available in previous work [pixelcnn, pixelcnn++]. Such phenomenon is not troublesome here since our model is not intended for generating images.
The NLL issue. During the experiments, we find that a lower NLL score does not necessarily imply a better degradation model. We test the model trained on 29 epoch and 379 epoch, with NLL 0.45 bits/dim and 0.18 bits/dim respectively. We find the restored images looking strange with the latter model, as shown in Figure 3. Moreover, adding the degradation model does not bring a lower NLL. We attribute this to two reasons. Firstly, previous work has realized that the NLL is not consistent with the visual quality [pixelcnn++]. Secondly, as noted before, the autoregressive model suffers from severe overfitting. Therefore, a overtrained model may not generalize well, especially for a conditional version.
|StyleGAN V||mean Init||MMD||Spherical||pixelcnn coff||Epoch||step||mean||median||mean||median|
4.4 Visual quality
We conduct extensive experiments regrading the visual quality to demonstrate the effectiveness of the proposed method. For better comparison, we evaluate the performance from three aspects, namely the qualitative results, the user study and the no-reference image quality assessment. The experiments presented below deal with the compression degradation. The baseline PULSE method is performed with the loss, while our model incorporates the degradation model detailed above.
4.4.1 Qualitative Image Results
In Figure 6 we give some examples of restored images from different methods. We display the results obtained from the PULSE method [2020PULSE] with both the original StyleGAN and the improved version. We can find that the original StyleGAN contains artifacts in somewhere of the image. The improved StyleGAN successfully gets rid of such artifacts, which is in accordance with expectation. In general, our method produces the images with highest quality.
4.4.2 User Study
Following common practice, the quantitative assessment can be obtained from a user study. Therefore, we hereby conduct a user study to verify the effectiveness of the proposed method. Specifically, we randomly select 100 images, compress them with the JPEG, and restored with the above methods. We ask the raters to select the one that is more natural and with better visual quality. The preference of our method is shown in Table 3. The averaged result shows that 81% of the users think our results are more natural and high-quality compared with PULSE.
4.4.3 No-reference Image Quality Assessment
Besides the subjective score, we also give comparsion between above methods leveraging recent advances on no-reference image quality assessment (NR-IQA). We choose the state-of-the-art model RankIQA [RankIQA] to accomplish this task. Specifically, we adopt the model pretrained in the LIVE [LIVE] dataset. For a particular image, we adopt two strategies computing the quality score. Firstly, we randomly crop 100 patches with size 224x224 and report the average and median scores as the quality index. Secondly, we resize the images to and re-run the first method. The second strategy deals with the unmatched size on which the RankIQA is trained, and ensures the model has the picture of the whole image. The score ranges from 0 to 100 in LIVE dataset and smaller index indicates better visual quality. We conduct extensive ablation study to further demonstrate the improvements of different modules. The result is shown in Table 2. We find the proposed method achieve better results on both the two strategies.