Can we perform face hallucination using limited set of unaligned pairs?
Real-world image super-resolution is a challenging image translation problem. Low-resolution (LR) images are often generated by various unknown transformations rather than by applying simple bilinear down-sampling on HR images. To address this issue, this paper proposes a novel Style-based Super-Resolution Variational Autoencoder network (SSRVAE) that contains a style Variational Autoencoder (styleVAE) and a SR Network. To get realistic real-world low-quality images paired with the HR images, we design a styleVAE to transfer the complex nuisance factors in real-world LR images to the generated LR images. We also use mutual information estimation (MI) to get better style information. For our SR network, we firstly propose a global attention residual block to learn long-range dependencies in images. Then another local attention residual block is proposed to enforce the attention of SR network moves to local areas of images in which texture detail will be filled. It is worth noticing that styleVAE is presented in a plug-and-play manner and thus can help to promote the generalization and robustness of our SR method as well as other SR methods. Extensive experiments demonstrate that our SSRVAE surpasses the state-of-the-art methods, both quantitatively and qualitatively.READ FULL TEXT VIEW PDF
In this paper, we propose a novel reference based image super-resolution...
Most image super-resolution (SR) methods are developed on synthetic
Super-resolution (SR) has traditionally been based on pairs of
Most of the recent literature on image super-resolution (SR) assumes the...
Benefited from the deep learning, image Super-Resolution has been one of...
Super-resolution (SR) is a one-to-many task with multiple possible solut...
Reference-based Super-Resolution (Ref-SR) has recently emerged as a prom...
Can we perform face hallucination using limited set of unaligned pairs?
Single image super-resolution (SISR) aims to infer a natural high-resolution (HR) image from the degraded low-resolution (LR) input. SISR technology is widely used in a variety of computer vision tasks, e.g., security surveillance, medical image processing and face recognition. Recently, many deep learning based super-resolution (SR) methods have been greatly developed and achieved promising results. These methods are mostly trained on paired LR and HR images, while the LR images are usually obtained by performing a predefined degradation method on the HR images, e.g., bicubic interpolation. Different from them, we argue that it will hinder the generalization and robustness of a SR method if it is trained on man-made image pairs, especially when confronting real-world LR images.
In fact, there is a huge difference between the LR images after bicubic interpolation and real-world LR images. There are various nuisance factors leading to image quality degeneration, e.g., motion blur, lens aberration and sensor noise. Moreover, these nuisance factors are usually unknown and mixed up with each other, making the real-world SR task blind and thus stunning challenging. The LR generated manually can only simulate limited patterns and methods trained on them inherently lack the ability of dealing with real SR issues. As presented in the second row of Figure 1. The performance of SRGAN drops dramatically when fed with real-world LR images and natural HR images can not be produced anymore.
Despite these, relatively less attention has been paid to the problem of how to super-resolved real-world LR images until now. Some Generative adversarial network (GAN) based methods [4, 41] has made some explorations, but the effects are primary and there is still much room for improvement. A straightforward idea is to capture a large number of different resolution images simultaneously in the same scene. However, it is not a reasonable solution considering human and material constraints. In order to solve this problem in an efficient manner, this paper propose a generative network based on variational autoencoders (VAEs) to synthesize realistic real-world LR images and we focus on face hallucination.
The essential idea is derived from the separable property of image style and image content, which has been widely explored in image style transfer [8, 18]. It means that one can change the style of an image while preserving its content. Based on these, we propose to consider the fore-mentioned nuisance factors as a special case of image styles. We then design styleVAE to transfer the complex nuisance factors in real-world LR images to HR images. In this manner, realistic LR images as well as LR-HR pairs can be generated automatically. Furthermore, styleVAE is presented as a plug-and-play component and can also be applied to existing SR methods to improve their generalization and robustness.
In addition to styleVAE, we build a SR network for real-world super-resolution. Most CNN-based SR methods impel element-wise features to be equal, which limits the ability of a deep network to represent different types of information, e.g. low and high frequency information. The progress in biology inspires us with the observation that human visual perception systems follow the principle of global priority. Following this principle, our proposed SR network consists mainly of two modules. On the one hand, we develop a global attention residual block (GARB) to capture long-range dependency correlations, helping SR network to focus on global topology information. On the other hand, we also introduce a local attention residual block (LARB) for better feature learning, which is essential to infer high-frequency information in images.
Our major contributions in this paper can be summarized as follows:
We propose to generate realistic real-world LR images by transfering various degradation modes in reality to HR images and obtain paired LR and HR images with a newly designed styleVAE.It is worth noticing that styleVAE is presented in a plug-and-play manner, promoting the generalization of existing SR methods.
A SR network is developed for real-world super-resolution. There are two main components in FHN. We propose GARB to adaptively rescale features via considering inter-dependencies among feature elements. At the same time, LANB is introduced to maximize the recovery of high frequency information.
Extensive experiments on real-world LR images demonstrate that styleVAE effectively facilitates SR methods and the proposed SR network achieves state-of-the-art results in terms of visual and numerical quality.
Super-Resolution Neural Networks
Last few years, machine learning has enjoyed another Renaissance based on deep learning. Super-resolution field has achieved more surprising results by adopting deep learning-based methods.
firstly attempt to use deep convolutional neural networks (CNNs) called SRCNN. design a deeper network by using residual learning to get more network capacity.  calculate mean square error (MSE) on feature maps to improve the perceptual quality of the output.  propose an up-sample mode called sub-pixel convolution layer, which could super-resolve the given LR images in real-time.  firstly adopt GAN-based networks  to reconstruct photo-realistic images. To address the problem of lacking high-frequency textures,  propose a novel application of automated texture synthesis.  propose a novel residual dense network (RDN), which could make full use of the hierarchical features from the original LR images.  propose a novel multi-scale residual network to utilize the images features fully.  exploit the feedback mechanism in the super-resolution field.
Face super-resolution Unlike general image super-resolution, face super-resolution focuses more on face-specific information where low-resolution face images in the training set are smaller, usually or . Some previous methods make use of the specific static information of face images obtained by face analysis technique.  utilize the dense correspondence field estimation to help recover textual details. Meanwhile, some other methods use the face image prior knowledge obtained by convolutional neural networks (CNNs) or GAN-based network. For example, [34, 35, 36] adopt a GAN-based network to hallucinate face images. 
use a spatial transformer network to address the problem of face scale and misalignment.[6, 3] utilize the geometric priors of face, such as parsing maps or face landmark heatmaps, to super-resolve low-resolution face images. Moreover, some wavelet-based methods have also been proposed. [15, 16] introduce a method combined with wavelet transformation to predict the corresponding wavelet coefficients.
Variational Autoencoder Variational Autoencoder  consists of two networks, which are the inference network () and the generator network (). The inference network () encodes the variable into a latent code which is supposed to approximate the prior . samples the variable from the given latent code . VAE should maximize the variational lower bound (also called evidence lower bound, ELBO):
where the first item denotes the process of reconstructing the from the posterior
. The second item means Kullback-Leibler divergence between the prior and the posterior distribution.
Figure 2 shows the overall architecture of the proposed SSRVAE that consists of two stages. In the first stage, a styleVAE is proposed to generate realistic real-world LR images. In the second stage, the generated LR images paired with the corresponding HR images are fed into SR network for super-resolution. After training, our proposed method SSRVAE can produce pleasing reconstructed images for real-world scene SISR. In this section, we firstly introduce how to generate real-world LR images by using styleVAE and then detail the SR network.
In this section, we describe how to simulate real-world LR images by our proposed styleVAE in detail. We can get more stable training process using VAE rather than GAN . Moreover, with the help of style transfer, both the style of real-world LR images and the content of HR images are well preserved. To better encode style information, we also maximize the mutual information between origin LR images and generated LR images by styleVAE. Some generated examples are shown in Figure 7.
Style Transfer There are various nuisance factors in real-world LR images, including motion blur, lens aberration and sensor noise and so on. We wish to transfer the nuisance factors in real-world LR images to generated LR images. We choose Adaptive Instance Normal (AdaIN) to achieve our purpose due to its suitableness, efficiency and compact representation. Figure 2 show the architecture of our proposed styleVAE, there are two inference networks (the upper left corner) and (the lower left corner), and one generator (the bottom right corner) in our proposed styleVAE. and project input real-world LR images and HR images into two latent spaces, representing the style information and the content information, respectively. The two latent codes produced by the inference networks and are combined in a style transfer way (AdaIN) rather than directly being concatenated. The learned affine transformations then specialize latent code obtained by inference network to styles y to control AdaIN [8, 18] operation after each residual block in generated network . The AdaIN can be calculated as follows:
where denotes the feature maps from the previous layer. means the style information from . and are obtained from through a fully connected layer. Note that the AdaIN calculated for each feature map is independent.
Following VAE, we use the Kullback-Leibler (KL) divergence to regularize the latent space obtained by
. The distribution of the latent space is supposed to approximate Gaussian distribution. Thus, sampling from the standard Gaussian distribution provides us the possibility to produce LR images with diversity. Thebranch has two output variables, i.e., and . To a reparameterization trick, we have , where , means Hadamard product; and
denote the mean and the standard deviation, respectively. Givendata samples, the posterior distribution is constrained through Kullback-Leibler divergence:
where is the inference network . The prior is the standard multivariate Gaussian distribution. is the dimension of .
The generator in the styleVAE is required to generate LR images from the latent space and the learned distribution
. The reconstruction loss function is expected to ensure that generated LR images could retain the content of corresponding HR images:
where and denote generated LR images and the corresponding HR images. We resize the size of to match that of .
Mutual information maximization The purpose of inference network is to extract the style information, which is essential for the subsequent generator. To gain style representation better, the mutual information between origin real-world LR images and generated LR images is required to be maximized. Inspired by , we estimate the mutual information between two random high dimensional variables by using the gradient descent over neural networks. It could be mathematically formulated as follows:
where denotes a static deep neural network parameterized by . The inputs of the
are empirically drawn from the joint distributionand the product of the marginal . More details could be found in our supplementary materials.
According to all the losses mentioned above, the overall loss to optimize the proposed styleVAE network could be formulated as:
where , , and are the trade-off factors.
As illustrate in , TP-GAN simultaneously perceive global topology information and local texture information and they achieve pleasing results. Different from the method in , our proposed SR network firstly build global topology structure or a sketch by global attention residual block (GARB) . Then our SR network focus on filling local details into coarse SR images using local attention residual block (LARB) . We train our SR network with for better performance rather than .
SR Network Architecture The network architecture of our proposed SR network is illustrated in Figure 2. It accepts LR images produced by styleVAE as inputs. Our proposed SR network mainly consists of five parts: shallow feature extractor implementation by one convolution layer, global attention residual network (GARN), local attention residual network (LARN), upscale model and reconstruction layer. GARN and LARN are composed of 8 GARBs and LARBs, respectively.
Global Attention Residual Block We now describe our proposed global attention residual block (GARB) in detail. Most existing models depend heavily on convolution layers to learn the dependencies of different regions of input images to capture global topology structure information. But this way would lead to incorrect geometry or structural patterns. we propose a global attention residual block to learn long-range dependencies by adapting global attention module (GA) [37, 32]. It can maintain efficiency in calculation and statistics. There are skip connection in the structure of GARB due to the success of residual blocks (RBs)  (See Figure 4 (a)). The non-local attention module can be formulated as follows:
where (, ) represents that the feature maps of the former hidden layer is projected into two latent spaces to obtain the attention value. indicates the degree of attention that the position receives when generating the area. The output of the attention layer is defined as:
where , . The above , , , are implemented by a convolution layer with kernel size . We connect and in a residual way, so the final output is shown as below:
where is a learnable scalar.
Local Attention Residual Block SISR can be seen as the process of filling local details into coarse SR images. we wish SR network can generate as more local details as possible. Convolution variants also process aggregation computation in a local neighborhood. But the performance of them are usually saturated with the size of neighborhoods larger than . Thus we propose a local attention residual block to capture local details by integrating local attention module (LA) 
. local attention model forms local pixel pairs with a flexible bottom-up way, which can efficiently deal with visual patterns with increasing size and complexity. The structure of LARB is similar to the that of GARB (See Figure4 (b)). We use a general method of relational modeling to calculate local attention model. The local attention map can be defined as:
where obtain a representation at one pixel by computing the composability between it (target pixel ) and a pixel in its visible position range. Transformation functions and are implemented by convolution layer. The function is chosen the squared difference, where,
In this section, we first introduce the datasets and implementation in detail. Then we evaluated our proposed method from both qualitative and quantitative aspects.
Training dataset As illustrated in , we select the following four datasets to build a HR training dataset that contains 180k faces. The first is a subset of VGGFace2  that contains images with 10 large poses for each identity (9k identities). The second is a subset of Celeb-A  that contains 60k faces. The third is the whole AFLW  that contains 25k faces originally used for facial landmark localization. The last is a subset of LS3D-W  that contains faces with various poses. LS3D-W is a large-scale dataset used for face alignment.
We also utilize the WIDER FACE  to build real-world LR dataset. WIDER FACE is a face detection benchmark dataset that consists of more than 32k images affected by various of noise and degradation types. We randomly select 90% images LR training dataset.
Testing dataset Another 10% images from WIDER FACE described in the latest section is selected as real-world LR testing dataset. We conduct experiments on it to verify the performance of the proposed method. We use the Fréchet Inception Distance (FID)  to numerically evaluate the quality of the generated images because of no corresponding reference HR images. Besides, we also conduct common experiments on the whole LFW [14, 24, 12, 13] to provide PSNR results. The test images from LFW are obtained by bicubic interpolation using Matlab.
Implementation Details Our proposed styleVAE is trained on the unpaired training HR and LR dataset. In styleVAE, all the size of convolutional layers are set as
and the dimension of the latent code is set as 256. We set the number of residual blocks of reference and generate networks in styleVAE as 8 and 2, respectively. Our proposed styleVAE is trained for 10 epochs. After that, we build the paired dataset during training of the SR network: each HR image is feed into styleVAE to obtain realistic real-world LR image. The size of SR network is the same as that of styleVAE. We train our styleVAE and SR network with ADAM optimizer with , . The learning rate is initially set to
and remains unchanged during the training. We use PyTorch framework to implement our models and train them on NVIDIA Titan Xp GPUs.
In this section, We conduct experiments on real-world images from real-world LR testing dataset described in Section 4.1. In order to evaluate the performance of our proposed method, we compare with other state-of-the-art methods both numerically and qualitatively.  firstly attempt to use deep convolutional neural networks (CNNs) for SR, known as SRCNN. , dubbed SRGAN, utilizes the Generative Adversarial Network (GAN) based methods.  is the best paper award of the NTIRE2017 challenge on SISR, known as EDSR.  proposes a CNN-based SR method that make full use of hierarchical features from LR images, called RDN. We utilize the public released codes with the default configurations described in their respective papers and retrain all these comparable methods for the sake of fairness on our HR training dataset described in Section 4.1. Note that LR images are produced by applying bicubic kernel to corresponding HR images.
In numerical terms, we use Fréchet Inception Distance (FID)  to measure the quality of the generated images since there are no corresponding HR images. The quantitative results of different SR methods on our testing dataset are summarized in Table 2 (with the factor ). It clearly demonstrates that our proposed method is superior to other prominent approaches and achieves the best performance on our testing dataset. We also discover that the performance of compared methods trained on bicubic-downsampled LR images is degraded when applied to real-world LR images. The main reason is that nuisance factors, e.g. motion blur, lens aberration and sensor noise, are not taken into synthetic LR images by bicubic interpolation. By training on LR images generated on our proposed styleVAE, our method is superior to them all, reducing FID by 157.17.
In Figure 3, we also visually demonstrate the qualitative comparisons results on our testing dataset with scale. There are significant artifacts in HR images generated by shallower networks, e.g. SRCNN  and SRGAN . Serious mesh phenomenon are found in reconstructed images by SRCNN. We can also discover that generated images of EDSR  and RDN  are usually distorted. On the contrary, reconstructed images by our proposed method are more realistic, since LR images produced by styleVAE exceedingly resemble real-world LR images.
In order to verify the performance of the proposed method on realistic LR images with unknown degradation modes, we conduct experiment on synthetic real-world LR images obtained by styleVAE with scale. We utilize images from LFW [14, 24, 12, 13] as the HR image inputs of styleVAE to generate realistic LR images that are invisible during training the SR network. Table 1 reports the PSNR and SSIM results of different SR methods. We find that the performances of compared methods are very limited, even lower than that of directly bicubic up-sampling. It also clearly demonstrates that simulating realistic real-world LR images is an effective way to improve performance when applied to real-world LR images.
To further validate the effectiveness of our proposed styleVAE, we design two pipelines with the help of plug-and-play framework. We can simply plug our proposed styleVAE into SR networks to replace bicubic down-sampled LR images that are used in many previous SR methods. We choose two of the comparison methods as the plugged SR networks: a shallower SR network SRCNN  and a deeper SR network EDSR. Thus there are four versions of SR networks: SRCNN-B and EDSR-B, trained on bicubic down-sampled LR images, SRCNN-S and EDSR-S, trained on LR images generated by styleVAE. FID quantitative evaluations on our real-world LR testing dataset are reported in Table 3. As can be seen from Table 3, the FID values of SRCNN-B and EDSR-B (the second row of the Table 3) are higher than those of EDSR-S and EDSR-S (the last row of the Table 3). By simulating real-world LR images using styleVAE, SRCNN can gain improvement 58.3 (the third column of Table 3) and EDSR can gain 22.3 improvement (the last column of Table 3).
We also demonstrate the visual results in Figure 8. As shown in Figure 8, compared (a) with (b), SRCNN-S can effectively eliminate the mesh phenomenon in the image generated by SRCNN-B. When training on LR images generated by styleVAE, EDSR-S can produce more pleasing results (d) rather than distorted reconstructed images (c) by EDSR-B. Compared (b), (d) and (e), our proposed method is able to generate sharper images than other SR networks which are trained on LR images generated by styleVAE.
StyleVAE In order to investigate the effectiveness of the mutual information estimation (MI) and Adaptive Instance Normalization (AdaIN) used in styleVAE, we train several other variants of styleVAE: remove MI or/and AdaIN. To evaluate the performance of these variants of styleVAE, we measure the FID between LR images generated by these variants and real-world LR images from our testing dataset. The FID results are provided in Table 5. When both AdaIN and MI are removed, the FID value is relatively high. After arbitrarily adding one of the two, the value of the FID is decreased. For both MI and AdaIN used in styleVAE, the FID result is the lowest. We also evaluate how similar the synthetic LR images by bicubic down-sampling and real-world LR images from WIDER FACE. Their FID result is found as 31.20. This results faithfully indicates that AdaIN and MI are essential for styleVAE in generating images that clearly resemble real-world LR images.
SR network To investigate the importance of the global attention module and the local attention module, we conduct another user study under similar setting to the previous. Similar to the ablation investigation of styleVAE, we also train several variant networks of the proposed SR network: remove the non-local or/and local network from SR network. These several variants are trained on LR images that produced by performing bicubic interpolation on corresponding HR images. In Table 4, when both non-local and local-attention are removed, the PSNR values on LFW (with upscale factor ) is the lowest. When local-attention is added, the PSNR value is increased by 0.1 dB. After adding non-local attention, the performance reaches 30.27 dB. When both attention models are added to the SR network, the performance is the best, with a PSNR of 30.43 dB. These experimental results clearly demonstrate that these two attention models are necessary for the SR network and can greatly improve the performance of the SR network.
We propose a novel two-stage process to address the challenging problem of super-resolving real-world LR images. The SSRVAE unifies a style-based Variational Autoencoder (styleVAE) and a SR network. Due to the participation of nuisance factor transfer and VAE, the proposed styleVAE generates realistic real-world LR images. Then the generated LR images paired with the corresponding HR images are fed into SR network. Our SR network firstly learn long-range dependencies by GARB. Then the attention of SR network moves to local areas of images in which texture detail will be filled out using LARB. Extensive experiments show our superiority over existing state-of-the-art SR methods and the ability of styleVAE to facilitate method generalization and robustness to real-world cases.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 109–117. Cited by: §2.
Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.