Pytorch implementation of Deep Face Super-Resolution with Iterative Collaboration between Attentive Recovery and Landmark Estimation (CVPR 2020)
Recent works based on deep learning and facial priors have succeeded in super-resolving severely degraded facial images. However, the prior knowledge is not fully exploited in existing methods, since facial priors such as landmark and component maps are always estimated by low-resolution or coarsely super-resolved images, which may be inaccurate and thus affect the recovery performance. In this paper, we propose a deep face super-resolution (FSR) method with iterative collaboration between two recurrent networks which focus on facial image recovery and landmark estimation respectively. In each recurrent step, the recovery branch utilizes the prior knowledge of landmarks to yield higher-quality images which facilitate more accurate landmark estimation in turn. Therefore, the iterative information interaction between two processes boosts the performance of each other progressively. Moreover, a new attentive fusion module is designed to strengthen the guidance of landmark maps, where facial components are generated individually and aggregated attentively for better restoration. Quantitative and qualitative experimental results show the proposed method significantly outperforms state-of-the-art FSR methods in recovering high-quality face images.READ FULL TEXT VIEW PDF
Pytorch implementation of Deep Face Super-Resolution with Iterative Collaboration between Attentive Recovery and Landmark Estimation (CVPR 2020)
In recent years, face super-resolution (FSR), also known as face hallucination, has attracted much attention of the computer vision community. FSR aims to restore high-resolution (HR) face images from the low-resolution (LR) counterparts, which plays an important role in many applications such as video surveillance and face enhancement. Moreover, facial analysis techniques including face recognition and face alignment can also benefit a lot from the quality improvement brought by FSR.
FSR is a special case of the task of single image super-resolution (SISR) [44, 29, 34, 28, 35], which is a challenging problem since it is highly ill-posed due to the ambiguity of the super-resolved pixels. Compared to SISR, FSR only considers facial images instead of arbitrary scenes. Therefore, the specific facial configuration can be strong prior knowledge for the generation, so that global structures and local details can be recovered accordingly. Hence FSR methods perform better than SISR on higher upscaling factors (e.g., ). A number of methods for face super-resolution [24, 38, 11, 12, 9, 4, 22, 33, 14]
have been proposed recently. Furthermore, the advent of deep learning techniques has greatly boosted the performance of face hallucination because of the powerful generative ability of deep convolutional neural networks (DCNNs).
Facial priors have been utilized in existing FSR methods. Dense correspondence field is used in  to capture the information of face spatial configuration. Facial component heatmaps are predicted in  to provide localizations of facial components for improving the SR quality. An end-to-end trained network  introduces facial landmark heatmaps and parsing maps simultaneously to boost the recovery performance. However, there are some limitations with such methods. On the one hand, they have difficulty in estimating accurate prior information for the reason that the localization and alignment processes are applied on LR input images or coarse SR images which are of low quality and far from final results. Hence given inexact priors, the guidance for SR may be erroneous. On the other hand, most methods just optimize the recovery and prior prediction as a problem of multi-task learning and incorporate the prior information by a simple concatenation operation. However, such guidance is not direct and clear enough since the structural variations of different components may not be fully captured and exploited. Therefore, more powerful schemes to utilize facial priors should be explored.
In this paper, we propose a deep iterative collaboration method for face super-resolution to mitigate the above issues. Firstly, we design a new framework including two branches, one for face recovery and the other for landmark estimation. Different from previous methods, we let the face SR and alignment processes facilitate each other progressively. The idea is inspired by the fact that the SR branch can generate high-fidelity face images with the guidance of accurate landmark maps and the alignment branch also benefits a lot from high-quality input images. To achieve this goal, we build a recurrent architecture instead of very deep generative models for SR while designing a recurrent hourglass network for face alignment, rather than conventional stacked hourglass networks . In each recurrent step, previous outputs of each branch are fed into the other branch in the following step, so that both branches collaborate with each other for better performance. Moreover, the feedback schemes implemented in two branches both increase the efficiency of the whole framework. Secondly, we propose a new attentive fusion module to integrate the landmark information instead of the concatenation operation. Specifically, we utilize the estimated landmark maps to generate multiple attention maps, each of which reveals the geometric configuration of one facial key component. Benefiting from the component-specific attention mechanism, features for each component can be extracted individually, which can be easily accomplished by group convolutions. Experimental results on two popular benchmark datasets, CelebA  and Helen , demonstrate the superiority of our method in super-resolving high-quality face images over state-of-the-art FSR methods.
Face Super-Resolution: Recently, deep learning based methods have achieved remarkable progress in various computer vision tasks including face super-resolution. Yu et al.  introduce a deep discriminative generative network that can super-resolve very low face images. Huang et al.  turn to wavelet domain and propose a network that predicts wavelet coefficients of HR images. Besides, Yu et al.  embed attributes in the process of face super-resolution. Zhang et al.  introduce a super-identity loss to measure the identity difference. Some face SR methods also divide the solution into global and local parts. Tuzel et al.  design a network that contains two sub-networks: the first one reconstructs face images based on global constraints while the second one enhances local details. Cao et al. 
propose to use reinforcement learning to specify attended regions and use a local enhancement network for recovery sequentially.
Since face hallucination is a domain-specific task, facial priors are utilized in some FSR methods. Yu et al.  concatenate facial component heatmaps with features in the middle of the network. Chen et al.  concatenate facial landmark heatmaps and parsing maps with features. Kim et al.  design a facial attention loss based on facial landmark heatmaps and use it to train a progressive generator. Zhu et al.  propose a deep bi-network which conducts face hallucination and face correspondence alternatively to refine both processes progressively. However, the architecture of the cascaded framework is redundant and inflexible, restricting the efficiency of the model. Moreover, the lack of ability to estimate accurate dense corresponding fields may also lead to severe distortions.
Single Image Super-Resolution As a pioneer of using deep networks in single image super-resolution (SISR), Dong et al. 
propose SRCNN to learn a mapping from bicubic-interpolated images to HR images. Kimet al.  propose VDSR by using a 20-layer VGG-net  to learn the residual of LR and HR images. Methods mentioned above mainly focus on PSNR and SSIM. Their results are mostly blurry. Recently, perceptual quality of SR images is drawing more and more attention. SRGAN et al.  is the first to generate photo-realistic images with the adversarial loss and the perceptual loss . Rad et al.  extend the perceptual loss with a targeted perceptual loss.
Recently, recurrent networks have also been utilized for SISR. Kim et al.  propose DRCN, a deep recursive CNN, and obtain outstanding performance compared to previous work. Tai et al.  use residual units to build deep and concise networks with recursive blocks. Zhang et al.  follow the idea of DenseNet  and design a residual dense block to fuse hierarchical features. Han et al.  design a dual-state recurrent network that exploits LR and HR signals jointly. Li et al.  introduce a new feedback block where features are iteratively upsampled and downsampled. While recursive networks promote the development of SISR, few methods have employed their generative power in face super-resolution. Hence it remains an attractive direction to exploit the potential ability of recurrent mechanisms for FSR.
In face super-resolution, we aim to recover the facial details of input LR face images and get the SR results . We design a deep iterative collaboration network which estimates high-quality SR images and landmark maps iteratively and progressively with the input LR images. In order to enhance the collaboration between the SR and alignment processes, we design a novel attentive fusion module that integrates two sources of information effectively. Finally, we apply an adversarial loss to supervise the training of the framework and produce enhanced SR faces with high-fidelity details.
Given an LR face image , facial landmarks are important for the recovery procedure. However, prior estimation via LR faces is unreliable since a lot of details are missing. Such information may provide inaccurate guidance to SR effects. Therefore, our method alleviates this issue by an iterative collaborative scheme as shown in Figure 2. In this framework, face recovery and landmark localization are performed simultaneously and recursively. We can get better SR images by accurate landmark maps as landmarks are estimated more correctly if the input faces have higher quality. Both processes can enhance each other and achieve better performance progressively. Finally, we can get accurate SR results and landmark heatmaps with enough steps.
The recurrent SR branch consists of a low-resolution feature extractor , a recursive block and high-resolution generation layers . includes an attentive fusion module and a recurrent SR module. Similar to the SR branch, the recurrent alignment branch includes a pre-processing block , a recursive hourglass block and a post-processing block . For the th step where , the SR branch recovers SR images by using the alignment results and the feedback information from the previous step , denoted as and
, respectively. Besides, LR inputs are also important in each step. Hence LR features extracted byare also fed into the recursive block. Therefore, the face SR process can be formulated by:
where denotes an upsampling operation. Similarly, the face alignment branch utilizes the recurrent features from the previous step and the SR features extracted by from the SR images as the guidance for estimating landmarks more accurately, as follows:
After steps, we get and where outputs become more satisfactory as increases. In the beginning, there is no recurrent feature and landmark map from the previous step. Therefore, we use an extra similar SR module which takes only the LR features as input before the first step to get as an initialization for the following steps. Meanwhile, we make to initialize the face alignment branch.
For the purpose of achieving more powerful optimization, we impose loss functions to each output ofsteps. By this means, the SR and alignment are strengthened in every step and the inaccurate factors are corrected gradually by mutual supervision. Here, the pixel-wise loss functions are defined as follows:
where and are the loss functions for the face SR and landmark estimation, respectively. and are the ground-truth HR images and landmark heatmaps. We use SR images in the last step as the final outputs, which can be formulated as .
In existing methods, straight-forward ways of utilizing facial prior knowledge are to concatenate facial priors with SR features and treat the whole optimization procedure as a problem of multi-task learning. However, facial structures may not be fully exploited since features of different facial parts are usually extracted by a shared network. Hence the specific structural configuration priors existing in different facial components may be neglected by the networks. Therefore, different facial parts should be recovered separately for better performance.  has exploited the global interdependency of facial parts by reinforcement learning. However, the sequential patch reconstruction cannot utilize facial priors explicitly and efficiently, which also limits the specialized generation for different facial components.
Differently, we achieve the above goals by a new structure-aware attentive fusion module so as to make full use of the guidance of landmarks . We assume each landmark heatmap has channels indicating the locations of landmarks. The landmarks can be grouped into subsets, belonging to facial components including left eye, right eye, nose, mouth and jawline. Channels in each group are added together to form the heatmap for the corresponding facial component, denoted as and shown in Figure 3. The reason to do so rather than directly fuse the learned landmarks is in two aspects: (1) We explicitly highlight the local structure of each facial parts to perform differential recovery; (2) The number of channels is largely reduced by the grouping process so as to improve the efficiency of the framework. Then we can compute corresponding attention maps by the softmax function along the channel dimension of these heatmaps, as below:
where represent the spatial coordinates of attention map . Instead of using multiple models for different facial components, we apply group convolutions to generate individual features . The flow chart is depicted as Figure 3. In order to make each group of convolutions concentrate on the corresponding parts, we define an attentive fusion as:
where denotes the output features of the proposed attentive fusion module. Note that the attentive fusion module is a part of the recurrent SR branch, so that the gradients can be back-propagated to both the SR and alignment branches in a recursive manner. Moreover, the landmark estimation can be supervised by not only the loss imposed on the recurrent alignment branch, but also by the revision of FSR results through the attentive fusion module.
Adversarial Loss: Recently GAN [20, 35, 5] has been successful in generative tasks, and is proven effective in recovering high-fidelity images. Hence we introduce the adversarial loss  to generate photo-realistic face images. We build a discriminator to differentiate the ground-truth and the super-resolved counterparts by minimizing
Meanwhile, the generator tries to fool the discriminator and minimizes
Perceptual Loss: We also apply a perceptual loss to enhance the perceptual quality of SR images, similar to [20, 5]. We employ a pretrained face recognition model, LightCNN  to extract features for images. The loss improves the perceptual similarity by reducing the euclidean distances between the features of SR and HR images, and . Hence we define the perceptual loss as:
Overall Objective: The generator is optimized by minimizing the following overall objective function:
where and denote the trade-off parameters for the adversarial loss and the perceptual loss, respectively. Since the recurrent alignment module is optimized as a part of the whole framework, the overall objective also includes this term of loss weighted by . For the training of our PSNR-oriented model DIC, we set . Then complete losses are used to obtain the perceptual-pleasing model DICGAN.
We conduct experiments on two widely used face datasets: CelebA  and Helen . For both datasets we use OpenFace [2, 42, 1] to detect 68 landmarks as ground-truth. Based on the estimated landmarks, we crop square regions in each image to remove the background and resize them to 128128 pixels without any pre-alignment. Then we downsample these HR images into 1616 LR inputs with bicubic degradation. For CelebA dataset, we use 168854 images for training and 1000 images for testing. For Helen dataset, we use 2005 images for training and 50 images for testing.
SR results are evaluated with PSNR and SSIM . They are computed on the Y channel of transformed YCbCr space. We also use face alignment as a metric to measure the accuracy of face recovery. We use a pretrained HourGlass network to detect the face landmarks and use Normalized Root Mean Squared Error (NRMSE) to evaluate landmark estimation results. In our experiment, NRMSE is normalized by the width of the face.
Training Setting The architecture of the recurrent SR module follows the feedback block in . We set the number of groups to 6, the number of steps to 4 and the number of feature channels to 48. For Helen, data augmentation is performed on training images, which are randomly rotated by , , and flipped horizontally. We train the PSNR-oriented model with the pixel loss and the alignment loss weighted by . For GAN training, we use the pretrained PSNR-oriented parameters as initialization and train the model with and . The model is trained by ADAM optimizer  with and . The initial learning rate is and is halved at
iterations. Our experiments are implemented on Pytorch with NVIDIA RTX 2080Ti GPUs.
Comparison with the State-of-the-Arts: We compare our proposed DIC method with state-of-the-art FSR methods. Table 1 tabulates the quantitative results on CelebA and Helen. It can be observed that our DIC method achieves the best PSNR and SSIM performance on both datasets. It is noteworthy that DIC outperforms FSRNet by a large margin. Therefore, our method obtains better inference by the progressive collaboration between the SR and alignment processes. Moreover, DICGAN gets comparable performance with FSRNet which is a PSNR-oriented method. This indicates that our DICGAN method is able to preserve pixel-wise accuracy while increasing perceptual quality of the super-resolved images.
We visualize some SR results of different methods as shown in Figure 4. We see that DIC recovers correct details while other methods fail in giving pleasant results. This indicates that our method is able to produce more stable SR results than other methods. Note that our method has a significant advantage in handling large pose and rotation variations. The reason is that the iterative alignment block can predict progressively more accurate landmarks to guide the reconstruction in each step. Therefore our method performs better in preserving facial structures and generating better details even though faces have large pose and rotation. Furthermore, DICGAN produces more realistic textures of images while other methods yield severe artifacts and distortions. Therefore, the qualitative comparison with state-of-the-art face SR methods demonstrates the powerful generative ability of our methods.
Similar to , we conduct face alignment as a measurement to evaluate the quality of the super-resolved images. We adopt a pretrained face alignment model with four stacked hourglass modules . The alignment accuracy is reflected by a widely used metric NRMSE. Lower NRMSE values reveal better alignment accuracy and higher quality of SR images. Table 2 shows the NRMSE values of our methods and other compared SR methods. We can see our DICGAN method outperforms other methods on both datasets. While other SR methods also use facial priors such as landmarks and component maps, the prior information is estimated from the input LR face images or coarsely recovered ones where facial structures are severely unclear and degraded. Hence such facial priors can provide limited guidance to the reconstruction procedure. Consequently, recovered images may also contain corresponding structural incorrectness. Differently, our method revises the landmark estimation in every step for providing more accurate auxiliary information to the SR branch. Meanwhile, the attentive fusion module can integrate the prior guidance effectively to boost the final performance.
User Study: We also conduct a user study as a subjective assessment to further evaluate our SR quality compared to previous face SR methods. Details are described in the supplementary material.
|2.5 Metric||Step 1||Step 2||Step 3||Step 4|
|2.5 Metric||Step 1||Step 2||Step 3||Step 4|
Study of Iterative Learning: To better show the merits of the proposed scheme of iterative collaboration, we also evaluate the quality of the SR outputs. As mentioned above, we use PSNR, SSIM and NRMSE as measurement metrics. Differently, in this experiment, NRMSE is computed by the landmarks estimated by the alignment branch in the corresponding steps. The performance on CelebA and Helen is presented in Table 3 and Table 4, respectively. We can see from step 1 to step 4, the performance gets better progressively. It is noteworthy that the NRMSE values in Table 3 and Table 4 are much lower than those in Table 2. In fact, in our alignment branch, the parameters are much fewer than the stacked hourglass model which is used to estimate landmarks in Table 2. The reason why our model gets more accurate alignment results with fewer parameters is that our model can learn to capture face structures in different-level super-resolved images. Due to this ability, our model can provide relatively accurate landmarks in each step for better collaboration. Therefore, the comparison proves that our method is able to achieve progressively better SR quality and landmark estimation simultaneously.
Furthermore, visual comparison of different steps are shown in Figure 5. The results show the generation of facial components are improved step by step. In the last step, our model obtains geometric-pleasing and high-fidelity SR images. From the PSNR, SSIM and NRMSE values in each step, we can also see the consistent improvement of our scheme of iterative collaboration. Moreover, from Table 3, Table 4 and Figure 5, three steps may be a suitable choice for good enough recovery and efficient computation.
Effects of Attentive Fusion: We implement another experiment to better investigate the effectiveness of our proposed attentive fusion module. Since we use group convolution layers to extract specialized representations for different facial parts, we only remain the representation of one part and visualize the SR results as shown in Figure 6. For a certain component, we remove features of the other components by setting the corresponding attention maps to 0. By this means, the final outputs only contain accurate information for one facial component. From Figure 6, we can indeed see different parts can be recovered separately by the representations. The results demonstrate the advantages of the proposed attentive fusion module, which can explicitly guide the component-specialized generation in an efficient and flexible way.
Ablation Study: We further implement an ablation study to measure the effectiveness of the iterative collaboration framework and the attentive fusion module. On the one hand, in order to validate the effects of facial priors, we remove the alignment branch and the attentive fusion module. This model is called DIC-NL, which is equivalent to a recurrent network for single image super-resolution without the prior information of landmark maps. On the other hand, we remove the attentive fusion module and concatenate landmarks (CL) to evaluate the effects of the proposed fusion module quantitatively. This model is denoted as DIC-CL. PSNR and SSIM performance on the dataset of CelebA is presented in Table 5. From the table we can see when the SR network loses the guidance provided by face landmarks, SR quality is degraded severely since its ability to capture facial structural configuration is weakened. Moreover, DIC-CL has an advantage over DIC-NL since it incorporates the prior information by concatenation. A large enhancement can also be observed due to the integration. However, the SR performance of DIC-CL is still far from that of the DIC method. The reason is that concatenating landmark maps is an implicit knowledge to face SR and is limited in providing adequate guidance. Differently, our DIC method not only integrates the structural knowledge, but also explicitly induces the component-specialized feature extraction for more photo-realistic SR images. Hence the results prove the superiority of the proposed method.
In this paper, we have proposed a deep iterative collaboration network for face super-resolution. Specifically, a recurrent SR branch collaborates with a recurrent alignment branch to recover high-quality face SR images iteratively and progressively. In each step, the SR process utilizes the estimated landmarks from the alignment branch to produce better face images which are important for the alignment branch to estimate more accurate landmarks. Furthermore, we have proposed a new attentive fusion module to exploit attention maps and extract individual features for each facial component according to the estimated landmarks. Quantitative and qualitative results of face SR on two widely-used benchmark datasets have demonstrated the effectiveness of the proposed method.
This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700802, in part by the National Natural Science Foundation of China under Grant 61822603, Grant U1813218, Grant U1713214, and Grant 61672306, in part by the Shenzhen Fundamental Research Fund (Subject Arrangement) under Grant JCYJ20170412170602564, and in part by Tsinghua University Initiative Scientfic Research Program.
Stacked hourglass networks for human pose estimation. In ECCV, pp. 483–499. Cited by: Appendix A, §1, §4.3.
Here we describe more details of our recurrent networks. Table 6 shows the detailed architecture of the SR branch. Given input LR images, LR features are extracted by and are subsequently concatenated with the feedback features. Then through , which consists of a convolutional layer, an attentive fusion module and a recurrent SR module, the obtained features are used as both the feedback signals and the features for the following generation. Finally, SR images are recovered by the generation layers and the addition operation. is comprised of a deconvolutional layer with a kernel size of 8 and a convolutional layer.
except that the batch normalization layers are removed. The recurrent hourglass module has similar architecture to the single hourglass module in. Differently, the input and output of both include two components. The input is obtained by concatenating the pre-processing feature with the feedback feature while the output is split into two parts, a feedback feature and a feature for the final landmark estimation.
We conduct a user study to further evaluate the visual quality of the super-resolved images. We randomly select 30 images from the testing set of CelebA  and display the corresponding SR results of our DICGAN, FSRGAN , PFSR  and the HR images in a random order. 39 human raters are asked to rank these four versions of images in terms of perceptual satisfaction. The results are shown in Figure 7. As expected, most of the HR images are regarded as the best among the four versions. Moreover, our DICGAN obtains much more votes of rank-1 and rank-2 than FSRGAN and PFSR, which means the proposed method outperforms the state-of-the-art face SR methods by a large margin. We observe that PFSR scores the worst among three FSR methods. We think the reason is that PFSR mainly focuses on well-aligned face images. Hence when the input faces are with large variations of pose and rotation, PFSR fails to present satisfactory SR results.
In Figure 8 and Figure 9 (the next pages), we present more qualitative comparison with state-of-the-art FSR methods including RDN , FSRNet , FSRGAN  and PFSR . The results demonstrate the effectiveness of our proposed method.
|Attentive Fusion ()|
|Recurrent SR Module ()|
|Recurrent HourGlass ()|