Deep Face Super-Resolution with Iterative Collaboration between Attentive Recovery and Landmark Estimation

by   Cheng Ma, et al.
Tsinghua University

Recent works based on deep learning and facial priors have succeeded in super-resolving severely degraded facial images. However, the prior knowledge is not fully exploited in existing methods, since facial priors such as landmark and component maps are always estimated by low-resolution or coarsely super-resolved images, which may be inaccurate and thus affect the recovery performance. In this paper, we propose a deep face super-resolution (FSR) method with iterative collaboration between two recurrent networks which focus on facial image recovery and landmark estimation respectively. In each recurrent step, the recovery branch utilizes the prior knowledge of landmarks to yield higher-quality images which facilitate more accurate landmark estimation in turn. Therefore, the iterative information interaction between two processes boosts the performance of each other progressively. Moreover, a new attentive fusion module is designed to strengthen the guidance of landmark maps, where facial components are generated individually and aggregated attentively for better restoration. Quantitative and qualitative experimental results show the proposed method significantly outperforms state-of-the-art FSR methods in recovering high-quality face images.


page 1

page 3

page 4

page 6

page 7

page 8

page 12

page 13


FSRNet: End-to-End Learning Face Super-Resolution with Facial Priors

Face Super-Resolution (SR) is a domain-specific super-resolution problem...

Super-FAN: Integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with GANs

This paper addresses two challenging tasks: improving the quality of rea...

IFR: Iterative Fusion Based Recognizer For Low Quality Scene Text Recognition

Although recent works based on deep learning have made progress in impro...

Progressive Face Super-Resolution via Attention to Facial Landmark

Face Super-Resolution (SR) is a subfield of the SR domain that specifica...

Joint Super-Resolution and Alignment of Tiny Faces

Super-resolution (SR) and landmark localization of tiny faces are highly...

TANet: A new Paradigm for Global Face Super-resolution via Transformer-CNN Aggregation Network

Recently, face super-resolution (FSR) methods either feed whole face ima...

Pro-UIGAN: Progressive Face Hallucination from Occluded Thumbnails

In this paper, we study the task of hallucinating an authentic high-reso...

Code Repositories


Pytorch implementation of Deep Face Super-Resolution with Iterative Collaboration between Attentive Recovery and Landmark Estimation (CVPR 2020)

view repo

1 Introduction

Figure 1: Data flow of the proposed method. FSR outputs in different steps are shown in the top row while the detected facial landmarks are displayed on HR images accordingly in the bottom row. The pink arrows denote the face alignment process while the blue ones denote the face recovery process with the attentive fusion of landmarks. The black arrows represent the recurrent schemes in two branches. Through our framework, the quality of SR images becomes better progressively and the estimated landmarks (red) also get closer to the ground-truth (green).

In recent years, face super-resolution (FSR), also known as face hallucination, has attracted much attention of the computer vision community. FSR aims to restore high-resolution (HR) face images from the low-resolution (LR) counterparts, which plays an important role in many applications such as video surveillance and face enhancement. Moreover, facial analysis techniques including face recognition and face alignment can also benefit a lot from the quality improvement brought by FSR.

FSR is a special case of the task of single image super-resolution (SISR) [44, 29, 34, 28, 35], which is a challenging problem since it is highly ill-posed due to the ambiguity of the super-resolved pixels. Compared to SISR, FSR only considers facial images instead of arbitrary scenes. Therefore, the specific facial configuration can be strong prior knowledge for the generation, so that global structures and local details can be recovered accordingly. Hence FSR methods perform better than SISR on higher upscaling factors (e.g., ). A number of methods for face super-resolution [24, 38, 11, 12, 9, 4, 22, 33, 14]

have been proposed recently. Furthermore, the advent of deep learning techniques has greatly boosted the performance of face hallucination because of the powerful generative ability of deep convolutional neural networks (DCNNs).

Facial priors have been utilized in existing FSR methods. Dense correspondence field is used in [46] to capture the information of face spatial configuration. Facial component heatmaps are predicted in [39] to provide localizations of facial components for improving the SR quality. An end-to-end trained network [5] introduces facial landmark heatmaps and parsing maps simultaneously to boost the recovery performance. However, there are some limitations with such methods. On the one hand, they have difficulty in estimating accurate prior information for the reason that the localization and alignment processes are applied on LR input images or coarse SR images which are of low quality and far from final results. Hence given inexact priors, the guidance for SR may be erroneous. On the other hand, most methods just optimize the recovery and prior prediction as a problem of multi-task learning and incorporate the prior information by a simple concatenation operation. However, such guidance is not direct and clear enough since the structural variations of different components may not be fully captured and exploited. Therefore, more powerful schemes to utilize facial priors should be explored.

In this paper, we propose a deep iterative collaboration method for face super-resolution to mitigate the above issues. Firstly, we design a new framework including two branches, one for face recovery and the other for landmark estimation. Different from previous methods, we let the face SR and alignment processes facilitate each other progressively. The idea is inspired by the fact that the SR branch can generate high-fidelity face images with the guidance of accurate landmark maps and the alignment branch also benefits a lot from high-quality input images. To achieve this goal, we build a recurrent architecture instead of very deep generative models for SR while designing a recurrent hourglass network for face alignment, rather than conventional stacked hourglass networks [25]. In each recurrent step, previous outputs of each branch are fed into the other branch in the following step, so that both branches collaborate with each other for better performance. Moreover, the feedback schemes implemented in two branches both increase the efficiency of the whole framework. Secondly, we propose a new attentive fusion module to integrate the landmark information instead of the concatenation operation. Specifically, we utilize the estimated landmark maps to generate multiple attention maps, each of which reveals the geometric configuration of one facial key component. Benefiting from the component-specific attention mechanism, features for each component can be extracted individually, which can be easily accomplished by group convolutions. Experimental results on two popular benchmark datasets, CelebA [23] and Helen [19], demonstrate the superiority of our method in super-resolving high-quality face images over state-of-the-art FSR methods.

2 Related Work

Face Super-Resolution: Recently, deep learning based methods have achieved remarkable progress in various computer vision tasks including face super-resolution. Yu et al[41] introduce a deep discriminative generative network that can super-resolve very low face images. Huang et al[10] turn to wavelet domain and propose a network that predicts wavelet coefficients of HR images. Besides, Yu et al. [40] embed attributes in the process of face super-resolution. Zhang et al[43] introduce a super-identity loss to measure the identity difference. Some face SR methods also divide the solution into global and local parts. Tuzel et al[32] design a network that contains two sub-networks: the first one reconstructs face images based on global constraints while the second one enhances local details. Cao et al[3]

propose to use reinforcement learning to specify attended regions and use a local enhancement network for recovery sequentially.

Since face hallucination is a domain-specific task, facial priors are utilized in some FSR methods. Yu et al[39] concatenate facial component heatmaps with features in the middle of the network. Chen et al[5] concatenate facial landmark heatmaps and parsing maps with features. Kim et al[15] design a facial attention loss based on facial landmark heatmaps and use it to train a progressive generator. Zhu et al[46] propose a deep bi-network which conducts face hallucination and face correspondence alternatively to refine both processes progressively. However, the architecture of the cascaded framework is redundant and inflexible, restricting the efficiency of the model. Moreover, the lack of ability to estimate accurate dense corresponding fields may also lead to severe distortions.

Figure 2: Overall framework of the proposed deep iterative collaboration method. The architecture is comprised of two branches, a recurrent SR branch and a recurrent alignment branch. Two branches collaborate with each other and obtain better SR images and more accurate landmarks step by step. “” and “” denote concatenation and addition respectively.

Single Image Super-Resolution As a pioneer of using deep networks in single image super-resolution (SISR), Dong et al[6]

propose SRCNN to learn a mapping from bicubic-interpolated images to HR images. Kim

et al[16] propose VDSR by using a 20-layer VGG-net [30] to learn the residual of LR and HR images. Methods mentioned above mainly focus on PSNR and SSIM. Their results are mostly blurry. Recently, perceptual quality of SR images is drawing more and more attention. SRGAN et al[20] is the first to generate photo-realistic images with the adversarial loss and the perceptual loss [13]. Rad et al[27] extend the perceptual loss with a targeted perceptual loss.

Recently, recurrent networks have also been utilized for SISR. Kim et al[17] propose DRCN, a deep recursive CNN, and obtain outstanding performance compared to previous work. Tai et al[31] use residual units to build deep and concise networks with recursive blocks. Zhang et al[45] follow the idea of DenseNet [8] and design a residual dense block to fuse hierarchical features. Han et al[7] design a dual-state recurrent network that exploits LR and HR signals jointly. Li et al[21] introduce a new feedback block where features are iteratively upsampled and downsampled. While recursive networks promote the development of SISR, few methods have employed their generative power in face super-resolution. Hence it remains an attractive direction to exploit the potential ability of recurrent mechanisms for FSR.

3 Approach

In face super-resolution, we aim to recover the facial details of input LR face images and get the SR results . We design a deep iterative collaboration network which estimates high-quality SR images and landmark maps iteratively and progressively with the input LR images. In order to enhance the collaboration between the SR and alignment processes, we design a novel attentive fusion module that integrates two sources of information effectively. Finally, we apply an adversarial loss to supervise the training of the framework and produce enhanced SR faces with high-fidelity details.

3.1 Deep Iterative Collaboration

Figure 3: The left part illustrates the method to extract attention maps from landmark maps. The right part shows the flowchart of the attentive fusion module. The input feature is expanded by a convolutional layer. Then component-specific features are extracted by a series of group convolutional layers under the guidance of attention maps. We multiply (“”) the features with the attention maps which are broadcast through the channel dimension. Finally, weighted features are added together to form the output.

Given an LR face image , facial landmarks are important for the recovery procedure. However, prior estimation via LR faces is unreliable since a lot of details are missing. Such information may provide inaccurate guidance to SR effects. Therefore, our method alleviates this issue by an iterative collaborative scheme as shown in Figure 2. In this framework, face recovery and landmark localization are performed simultaneously and recursively. We can get better SR images by accurate landmark maps as landmarks are estimated more correctly if the input faces have higher quality. Both processes can enhance each other and achieve better performance progressively. Finally, we can get accurate SR results and landmark heatmaps with enough steps.

The recurrent SR branch consists of a low-resolution feature extractor , a recursive block and high-resolution generation layers . includes an attentive fusion module and a recurrent SR module. Similar to the SR branch, the recurrent alignment branch includes a pre-processing block , a recursive hourglass block and a post-processing block . For the th step where , the SR branch recovers SR images by using the alignment results and the feedback information from the previous step , denoted as and

, respectively. Besides, LR inputs are also important in each step. Hence LR features extracted by

are also fed into the recursive block. Therefore, the face SR process can be formulated by:


where denotes an upsampling operation. Similarly, the face alignment branch utilizes the recurrent features from the previous step and the SR features extracted by from the SR images as the guidance for estimating landmarks more accurately, as follows:


After steps, we get and where outputs become more satisfactory as increases. In the beginning, there is no recurrent feature and landmark map from the previous step. Therefore, we use an extra similar SR module which takes only the LR features as input before the first step to get as an initialization for the following steps. Meanwhile, we make to initialize the face alignment branch.

For the purpose of achieving more powerful optimization, we impose loss functions to each output of

steps. By this means, the SR and alignment are strengthened in every step and the inaccurate factors are corrected gradually by mutual supervision. Here, the pixel-wise loss functions are defined as follows:


where and are the loss functions for the face SR and landmark estimation, respectively. and are the ground-truth HR images and landmark heatmaps. We use SR images in the last step as the final outputs, which can be formulated as .

3.2 Attentive Fusion Module

In existing methods, straight-forward ways of utilizing facial prior knowledge are to concatenate facial priors with SR features and treat the whole optimization procedure as a problem of multi-task learning. However, facial structures may not be fully exploited since features of different facial parts are usually extracted by a shared network. Hence the specific structural configuration priors existing in different facial components may be neglected by the networks. Therefore, different facial parts should be recovered separately for better performance. [3] has exploited the global interdependency of facial parts by reinforcement learning. However, the sequential patch reconstruction cannot utilize facial priors explicitly and efficiently, which also limits the specialized generation for different facial components.

Differently, we achieve the above goals by a new structure-aware attentive fusion module so as to make full use of the guidance of landmarks . We assume each landmark heatmap has channels indicating the locations of landmarks. The landmarks can be grouped into subsets, belonging to facial components including left eye, right eye, nose, mouth and jawline. Channels in each group are added together to form the heatmap for the corresponding facial component, denoted as and shown in Figure 3. The reason to do so rather than directly fuse the learned landmarks is in two aspects: (1) We explicitly highlight the local structure of each facial parts to perform differential recovery; (2) The number of channels is largely reduced by the grouping process so as to improve the efficiency of the framework. Then we can compute corresponding attention maps by the softmax function along the channel dimension of these heatmaps, as below:


where represent the spatial coordinates of attention map . Instead of using multiple models for different facial components, we apply group convolutions to generate individual features . The flow chart is depicted as Figure 3. In order to make each group of convolutions concentrate on the corresponding parts, we define an attentive fusion as:


where denotes the output features of the proposed attentive fusion module. Note that the attentive fusion module is a part of the recurrent SR branch, so that the gradients can be back-propagated to both the SR and alignment branches in a recursive manner. Moreover, the landmark estimation can be supervised by not only the loss imposed on the recurrent alignment branch, but also by the revision of FSR results through the attentive fusion module.

3.3 Objective Functions

Adversarial Loss: Recently GAN [20, 35, 5] has been successful in generative tasks, and is proven effective in recovering high-fidelity images. Hence we introduce the adversarial loss [20] to generate photo-realistic face images. We build a discriminator to differentiate the ground-truth and the super-resolved counterparts by minimizing


Meanwhile, the generator tries to fool the discriminator and minimizes


Perceptual Loss: We also apply a perceptual loss to enhance the perceptual quality of SR images, similar to [20, 5]. We employ a pretrained face recognition model, LightCNN [37] to extract features for images. The loss improves the perceptual similarity by reducing the euclidean distances between the features of SR and HR images, and . Hence we define the perceptual loss as:


Overall Objective: The generator is optimized by minimizing the following overall objective function:

where and denote the trade-off parameters for the adversarial loss and the perceptual loss, respectively. Since the recurrent alignment module is optimized as a part of the whole framework, the overall objective also includes this term of loss weighted by . For the training of our PSNR-oriented model DIC, we set . Then complete losses are used to obtain the perceptual-pleasing model DICGAN.

4 Experiments

4.1 Datasets and Metrics

2.5 CelebA Helen
Bicubic 23.58 0.6285 23.89 0.6751
SRResNet [20] 25.82 0.7369 25.30 0.7297
URDGN [41] 24.63 0.6851 24.22 0.6909
RDN [45] 26.13 0.7412 25.34 0.7249
PFSR [15] 24.43 0.6991 24.73 0.7323
FSRNet [5] 26.48 0.7718 25.90 0.7759
FSRGAN [5] 25.06 0.7311 24.99 0.7424
DIC 27.37 0.7962 26.69 0.7933
DICGAN 26.34 0.7562 25.96 0.7624
Table 1: Comparison of PSNR and SSIM performance with state-of-the-art FSR methods. The best and second best performance is highlighted in red and blue, respectively.

valign=t Bicubic RDN [45] FSRNet [5] FSRGAN [5] PFSR [15] DIC DICGAN HR

Figure 4: Visual comparison with state-of-the-art FSR methods. Other FSR methods may either produce structural distortions on key facial parts or present undesirable artifacts. Our proposed DIC and DICGAN methods have a significant advantage in handling large pose and rotation variations. The qualitative comparison indicates the proposed method outperforms other FSR methods. Best viewed on screen.

We conduct experiments on two widely used face datasets: CelebA [23] and Helen [19]. For both datasets we use OpenFace [2, 42, 1] to detect 68 landmarks as ground-truth. Based on the estimated landmarks, we crop square regions in each image to remove the background and resize them to 128128 pixels without any pre-alignment. Then we downsample these HR images into 1616 LR inputs with bicubic degradation. For CelebA dataset, we use 168854 images for training and 1000 images for testing. For Helen dataset, we use 2005 images for training and 50 images for testing.

SR results are evaluated with PSNR and SSIM [36]. They are computed on the Y channel of transformed YCbCr space. We also use face alignment as a metric to measure the accuracy of face recovery. We use a pretrained HourGlass network to detect the face landmarks and use Normalized Root Mean Squared Error (NRMSE) to evaluate landmark estimation results. In our experiment, NRMSE is normalized by the width of the face.

4.2 Implementation Details

Training Setting The architecture of the recurrent SR module follows the feedback block in [21]. We set the number of groups to 6, the number of steps to 4 and the number of feature channels to 48. For Helen, data augmentation is performed on training images, which are randomly rotated by , , and flipped horizontally. We train the PSNR-oriented model with the pixel loss and the alignment loss weighted by . For GAN training, we use the pretrained PSNR-oriented parameters as initialization and train the model with and . The model is trained by ADAM optimizer [18] with and . The initial learning rate is and is halved at

iterations. Our experiments are implemented on Pytorch 

[26] with NVIDIA RTX 2080Ti GPUs.

4.3 Results and Analysis

Comparison with the State-of-the-Arts: We compare our proposed DIC method with state-of-the-art FSR methods. Table 1 tabulates the quantitative results on CelebA and Helen. It can be observed that our DIC method achieves the best PSNR and SSIM performance on both datasets. It is noteworthy that DIC outperforms FSRNet by a large margin. Therefore, our method obtains better inference by the progressive collaboration between the SR and alignment processes. Moreover, DICGAN gets comparable performance with FSRNet which is a PSNR-oriented method. This indicates that our DICGAN method is able to preserve pixel-wise accuracy while increasing perceptual quality of the super-resolved images.

We visualize some SR results of different methods as shown in Figure 4. We see that DIC recovers correct details while other methods fail in giving pleasant results. This indicates that our method is able to produce more stable SR results than other methods. Note that our method has a significant advantage in handling large pose and rotation variations. The reason is that the iterative alignment block can predict progressively more accurate landmarks to guide the reconstruction in each step. Therefore our method performs better in preserving facial structures and generating better details even though faces have large pose and rotation. Furthermore, DICGAN produces more realistic textures of images while other methods yield severe artifacts and distortions. Therefore, the qualitative comparison with state-of-the-art face SR methods demonstrates the powerful generative ability of our methods.

valign=t Bicubic Step 1 Step 2 PSNR/SSIM 24.73/0.7605 27.94/0.8667 NRMSE 0.0265 0.0211 HR Step 3 Step 4 PSNR/SSIM 28.32/0.8782 28.32/0.8791 NRMSE 0.0204 0.0194 Bicubic Step 1 Step 2 PSNR/SSIM 26.35/0.7690 27.38/0.8212 NRMSE 0.0293 0.0266 HR Step 3 Step 4 PSNR/SSIM 27.90/0.8316 28.07/0.8350 NRMSE 0.0260 0.0249

Figure 5: Visual comparison of different steps. With the iterative collaboration, visual quality and quantitative measurement both get better progressively.

Similar to [5], we conduct face alignment as a measurement to evaluate the quality of the super-resolved images. We adopt a pretrained face alignment model with four stacked hourglass modules [25]. The alignment accuracy is reflected by a widely used metric NRMSE. Lower NRMSE values reveal better alignment accuracy and higher quality of SR images. Table 2 shows the NRMSE values of our methods and other compared SR methods. We can see our DICGAN method outperforms other methods on both datasets. While other SR methods also use facial priors such as landmarks and component maps, the prior information is estimated from the input LR face images or coarsely recovered ones where facial structures are severely unclear and degraded. Hence such facial priors can provide limited guidance to the reconstruction procedure. Consequently, recovered images may also contain corresponding structural incorrectness. Differently, our method revises the landmark estimation in every step for providing more accurate auxiliary information to the SR branch. Meanwhile, the attentive fusion module can integrate the prior guidance effectively to boost the final performance.

User Study: We also conduct a user study as a subjective assessment to further evaluate our SR quality compared to previous face SR methods. Details are described in the supplementary material.

2.5 Method CelebA Helen
Bicubic 0.3385 0.4577
RDN [45] 0.1415 0.4437
PFSR [15] 0.1917 0.3498
FSRNet [5] 0.1430 0.3723
FSRGAN [5] 0.1463 0.3408
DIC 0.1320 0.3674
DICGAN 0.1319 0.3336
Table 2: Comparison of NRMSE performance with state-of-the-art FSR methods. The best and second best performance is highlighted in red and blue, respectively
2.5 Metric Step 1 Step 2 Step 3 Step 4
PSNR 24.41 25.71 26.30 26.34
SSIM 0.6688 0.7180 0.7521 0.7561
NRMSE 0.0322 0.0306 0.0285 0.0273
Table 3: Quantitative comparison of different steps on CelebA. The best results are highlighted.
2.5 Metric Step 1 Step 2 Step 3 Step 4
PSNR 24.88 25.45 25.96 25.96
SSIM 0.7094 0.7332 0.7587 0.7624
NRMSE 0.1057 0.0854 0.0837 0.0520
Table 4: Quantitative comparison of different steps on Helen. The best results are highlighted.

Study of Iterative Learning: To better show the merits of the proposed scheme of iterative collaboration, we also evaluate the quality of the SR outputs. As mentioned above, we use PSNR, SSIM and NRMSE as measurement metrics. Differently, in this experiment, NRMSE is computed by the landmarks estimated by the alignment branch in the corresponding steps. The performance on CelebA and Helen is presented in Table 3 and Table 4, respectively. We can see from step 1 to step 4, the performance gets better progressively. It is noteworthy that the NRMSE values in Table 3 and Table 4 are much lower than those in Table 2. In fact, in our alignment branch, the parameters are much fewer than the stacked hourglass model which is used to estimate landmarks in Table 2. The reason why our model gets more accurate alignment results with fewer parameters is that our model can learn to capture face structures in different-level super-resolved images. Due to this ability, our model can provide relatively accurate landmarks in each step for better collaboration. Therefore, the comparison proves that our method is able to achieve progressively better SR quality and landmark estimation simultaneously.

Furthermore, visual comparison of different steps are shown in Figure 5. The results show the generation of facial components are improved step by step. In the last step, our model obtains geometric-pleasing and high-fidelity SR images. From the PSNR, SSIM and NRMSE values in each step, we can also see the consistent improvement of our scheme of iterative collaboration. Moreover, from Table 3, Table 4 and Figure 5, three steps may be a suitable choice for good enough recovery and efficient computation.

Effects of Attentive Fusion: We implement another experiment to better investigate the effectiveness of our proposed attentive fusion module. Since we use group convolution layers to extract specialized representations for different facial parts, we only remain the representation of one part and visualize the SR results as shown in Figure 6. For a certain component, we remove features of the other components by setting the corresponding attention maps to 0. By this means, the final outputs only contain accurate information for one facial component. From Figure 6, we can indeed see different parts can be recovered separately by the representations. The results demonstrate the advantages of the proposed attentive fusion module, which can explicitly guide the component-specialized generation in an efficient and flexible way.

Ablation Study: We further implement an ablation study to measure the effectiveness of the iterative collaboration framework and the attentive fusion module. On the one hand, in order to validate the effects of facial priors, we remove the alignment branch and the attentive fusion module. This model is called DIC-NL, which is equivalent to a recurrent network for single image super-resolution without the prior information of landmark maps. On the other hand, we remove the attentive fusion module and concatenate landmarks (CL) to evaluate the effects of the proposed fusion module quantitatively. This model is denoted as DIC-CL. PSNR and SSIM performance on the dataset of CelebA is presented in Table 5. From the table we can see when the SR network loses the guidance provided by face landmarks, SR quality is degraded severely since its ability to capture facial structural configuration is weakened. Moreover, DIC-CL has an advantage over DIC-NL since it incorporates the prior information by concatenation. A large enhancement can also be observed due to the integration. However, the SR performance of DIC-CL is still far from that of the DIC method. The reason is that concatenating landmark maps is an implicit knowledge to face SR and is limited in providing adequate guidance. Differently, our DIC method not only integrates the structural knowledge, but also explicitly induces the component-specialized feature extraction for more photo-realistic SR images. Hence the results prove the superiority of the proposed method.

valign=t left eye attention right eye attention nose attention mouth attention HR left eye image right eye image nose image mouth image SR

Figure 6: Visual effects of the proposed attentive fusion module. The first row displays the attention maps and the ground-truth image. The second row presents the SR outputs recovered by the features of the corresponding facial components. The component-specialized generation demonstrates the effectiveness of the proposed attentive fusion module.
2.5 Method PSNR SSIM
DIC-NL 26.31 0.7526
DIC-CL 26.93 0.7811
DIC 27.37 0.7962
Table 5: Quantitative comparison of different models. The best results are highlighted. (NL: no landmarks, CL: concatenated landmarks.)

5 Conclusion

In this paper, we have proposed a deep iterative collaboration network for face super-resolution. Specifically, a recurrent SR branch collaborates with a recurrent alignment branch to recover high-quality face SR images iteratively and progressively. In each step, the SR process utilizes the estimated landmarks from the alignment branch to produce better face images which are important for the alignment branch to estimate more accurate landmarks. Furthermore, we have proposed a new attentive fusion module to exploit attention maps and extract individual features for each facial component according to the estimated landmarks. Quantitative and qualitative results of face SR on two widely-used benchmark datasets have demonstrated the effectiveness of the proposed method.


This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700802, in part by the National Natural Science Foundation of China under Grant 61822603, Grant U1813218, Grant U1713214, and Grant 61672306, in part by the Shenzhen Fundamental Research Fund (Subject Arrangement) under Grant JCYJ20170412170602564, and in part by Tsinghua University Initiative Scientfic Research Program.


  • [1] T. Baltrusaitis, P. Robinson, and L. Morency (2013) Constrained local neural fields for robust facial landmark detection in the wild. In ICCVW, pp. 354–361. Cited by: §4.1.
  • [2] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L. Morency (2018) Openface 2.0: facial behavior analysis toolkit. In FG, pp. 59–66. Cited by: §4.1.
  • [3] Q. Cao, L. Lin, Y. Shi, X. Liang, and G. Li (2017) Attention-aware face hallucination via deep reinforcement learning. In CVPR, pp. 690–698. Cited by: §2, §3.2.
  • [4] A. Chakrabarti, A. Rajagopalan, and R. Chellappa (2007) Super-resolution of face images using kernel pca-based prior. TMM 9 (4), pp. 888–892. Cited by: §1.
  • [5] Y. Chen, Y. Tai, X. Liu, C. Shen, and J. Yang (2018) Fsrnet: end-to-end learning face super-resolution with facial priors. In CVPR, pp. 2492–2501. Cited by: Appendix B, Appendix C, §1, §2, §3.3, §3.3, Figure 4, §4.3, Table 1, Table 2.
  • [6] C. Dong, C. C. Loy, K. He, and X. Tang (2014) Learning a deep convolutional network for image super-resolution. In ECCV, pp. 184–199. Cited by: §2.
  • [7] W. Han, S. Chang, D. Liu, M. Yu, M. Witbrock, and T. S. Huang (2018) Image super-resolution via dual-state recurrent networks. In CVPR, pp. 1654–1663. Cited by: §2.
  • [8] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In CVPR, pp. 4700–4708. Cited by: §2.
  • [9] H. Huang, H. He, X. Fan, and J. Zhang (2010) Super-resolution of human face image using canonical correlation analysis. PR 43 (7), pp. 2532–2543. Cited by: §1.
  • [10] H. Huang, R. He, Z. Sun, and T. Tan (2017) Wavelet-srnet: a wavelet-based cnn for multi-scale face super resolution. In ICCV, pp. 1689–1697. Cited by: §2.
  • [11] K. Jia and S. Gong (2008) Generalized face super-resolution. TIP 17 (6), pp. 873–886. Cited by: §1.
  • [12] J. Jiang, R. Hu, Z. Wang, and Z. Han (2014) Face super-resolution via multilayer locality-constrained iterative neighbor embedding and intermediate dictionary learning. TIP 23 (10), pp. 4220–4231. Cited by: §1.
  • [13] J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In ECCV, pp. 694–711. Cited by: §2.
  • [14] C. Jung, L. Jiao, B. Liu, and M. Gong (2011) Position-patch based face hallucination using convex optimization. SPL 18 (6), pp. 367–370. Cited by: §1.
  • [15] D. Kim, M. Kim, G. Kwon, and D. Kim (2019) Progressive face super-resolution via attention to facial landmark. arXiv preprint arXiv:1908.08239. Cited by: Appendix B, Appendix C, §2, Figure 4, Table 1, Table 2.
  • [16] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Accurate image super-resolution using very deep convolutional networks. In CVPR, pp. 1646–1654. Cited by: §2.
  • [17] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Deeply-recursive convolutional network for image super-resolution. In CVPR, pp. 1637–1645. Cited by: §2.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • [19] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang (2012) Interactive facial feature localization. In ECCV, pp. 679–692. Cited by: §1, §4.1.
  • [20] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, pp. 4681–4690. Cited by: §2, §3.3, §3.3, Table 1.
  • [21] Z. Li, J. Yang, Z. Liu, X. Yang, G. Jeon, and W. Wu (2019) Feedback network for image super-resolution. In CVPR, pp. 3867–3876. Cited by: §2, §4.2.
  • [22] C. Liu, H. Shum, and W. T. Freeman (2007) Face hallucination: theory and practice. IJCV 75 (1), pp. 115–134. Cited by: §1.
  • [23] Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In ICCV, pp. 3730–3738. Cited by: Appendix B, §1, §4.1.
  • [24] X. Ma, J. Zhang, and C. Qi (2010) Hallucinating face by position-patch. PR 43 (6), pp. 2224–2236. Cited by: §1.
  • [25] A. Newell, K. Yang, and J. Deng (2016)

    Stacked hourglass networks for human pose estimation

    In ECCV, pp. 483–499. Cited by: Appendix A, §1, §4.3.
  • [26] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.2.
  • [27] M. S. Rad, B. Bozorgtabar, U. Marti, M. Basler, H. K. Ekenel, and J. Thiran (2019) SROBB: targeted perceptual loss for single image super-resolution. In ICCV, pp. 2710–2719. Cited by: §2.
  • [28] M. S. Sajjadi, B. Scholkopf, and M. Hirsch (2017) Enhancenet: single image super-resolution through automated texture synthesis. Conference Proceedings In ICCV, pp. 4491–4500. Cited by: §1.
  • [29] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, pp. 1874–1883. Cited by: §1.
  • [30] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.
  • [31] Y. Tai, J. Yang, and X. Liu (2017) Image super-resolution via deep recursive residual network. In CVPR, pp. 3147–3155. Cited by: §2.
  • [32] O. Tuzel, Y. Taguchi, and J. R. Hershey (2016) Global-local face upsampling network. arXiv preprint arXiv:1603.07235. Cited by: §2.
  • [33] N. Wang, D. Tao, X. Gao, X. Li, and J. Li (2014) A comprehensive survey to face hallucination. IJCV 106 (1), pp. 9–30. Cited by: §1.
  • [34] X. Wang, K. Yu, C. Dong, and C. Change Loy (2018) Recovering realistic texture in image super-resolution by deep spatial feature transform. In CVPR, pp. 606–615. Cited by: §1.
  • [35] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. C. Loy (2018) Esrgan: enhanced super-resolution generative adversarial networks. In ECCV, pp. 63–79. Cited by: §1, §3.3.
  • [36] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. (2004) Image quality assessment: from error visibility to structural similarity. TIP 13 (4), pp. 600–612. Cited by: §4.1.
  • [37] X. Wu, R. He, Z. Sun, and T. Tan (2018) A light cnn for deep face representation with noisy labels. TIFS 13 (11), pp. 2884–2896. Cited by: §3.3.
  • [38] C. Yang, S. Liu, and M. Yang (2013) Structured face hallucination. In CVPR, pp. 1099–1106. Cited by: §1.
  • [39] X. Yu, B. Fernando, B. Ghanem, F. Porikli, and R. Hartley (2018) Face super-resolution guided by facial component heatmaps. In ECCV, pp. 217–233. Cited by: §1, §2.
  • [40] X. Yu, B. Fernando, R. Hartley, and F. Porikli (2018) Super-resolving very low-resolution face images with supplementary attributes. In CVPR, pp. 908–917. Cited by: §2.
  • [41] X. Yu and F. Porikli (2016) Ultra-resolving face images by discriminative generative networks. In ECCV, pp. 318–333. Cited by: §2, Table 1.
  • [42] A. Zadeh, Y. Chong Lim, T. Baltrusaitis, and L. Morency (2017) Convolutional experts constrained local model for 3d facial landmark detection. In ICCV, pp. 2519–2528. Cited by: §4.1.
  • [43] K. Zhang, Z. Zhang, C. Cheng, W. H. Hsu, Y. Qiao, W. Liu, and T. Zhang (2018) Super-identity convolutional neural network for face hallucination. In ECCV, pp. 183–198. Cited by: §2.
  • [44] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. Conference Proceedings In ECCV, pp. 286–301. Cited by: §1.
  • [45] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2018) Residual dense network for image super-resolution. In CVPR, pp. 2472–2481. Cited by: Appendix C, §2, Figure 4, Table 1, Table 2.
  • [46] S. Zhu, S. Liu, C. C. Loy, and X. Tang (2016) Deep cascaded bi-network for face hallucination. In ECCV, pp. 614–630. Cited by: §1, §2.

Supplementary Material

Appendix A More Details on Network Architecture

Here we describe more details of our recurrent networks. Table 6 shows the detailed architecture of the SR branch. Given input LR images, LR features are extracted by and are subsequently concatenated with the feedback features. Then through , which consists of a convolutional layer, an attentive fusion module and a recurrent SR module, the obtained features are used as both the feedback signals and the features for the following generation. Finally, SR images are recovered by the generation layers and the addition operation. is comprised of a deconvolutional layer with a kernel size of 8 and a convolutional layer.

Besides, Table 7 presents the details of our recurrent alignment branch. and are the pre-processing and post-processing blocks, which have the same architecture as those in [25]

except that the batch normalization layers are removed. The recurrent hourglass module has similar architecture to the single hourglass module in 

[25]. Differently, the input and output of both include two components. The input is obtained by concatenating the pre-processing feature with the feedback feature while the output is split into two parts, a feedback feature and a feature for the final landmark estimation.

Appendix B User Study

We conduct a user study to further evaluate the visual quality of the super-resolved images. We randomly select 30 images from the testing set of CelebA [23] and display the corresponding SR results of our DICGAN, FSRGAN [5], PFSR [15] and the HR images in a random order. 39 human raters are asked to rank these four versions of images in terms of perceptual satisfaction. The results are shown in Figure 7. As expected, most of the HR images are regarded as the best among the four versions. Moreover, our DICGAN obtains much more votes of rank-1 and rank-2 than FSRGAN and PFSR, which means the proposed method outperforms the state-of-the-art face SR methods by a large margin. We observe that PFSR scores the worst among three FSR methods. We think the reason is that PFSR mainly focuses on well-aligned face images. Hence when the input faces are with large variations of pose and rotation, PFSR fails to present satisfactory SR results.

Appendix C Visual Results

In Figure 8 and Figure 9 (the next pages), we present more qualitative comparison with state-of-the-art FSR methods including RDN [45], FSRNet [5], FSRGAN [5] and PFSR [15]. The results demonstrate the effectiveness of our proposed method.

Layer Output size
Conv ()
PixelShuffle ()
Conv ()
Attentive Fusion ()
Recurrent SR Module ()
Deconv ()
Conv ()
Table 6: Detailed architecture of the recurrent SR branch.
Layer Output size
Conv ()
Recurrent HourGlass ()
Table 7: Detailed architecture of the recurrent alignment branch.
Figure 7: Results of the user study. Our method performs better than state-of-the-art FSR methods in recovering perceptual-pleasant face images.

valign=t 201448 from CelebA Bicubic RDN FSRNet FSRGAN PFSR DIC DICGAN HR 201475 from CelebA Bicubic RDN FSRNet FSRGAN PFSR DIC DICGAN HR 201589 from CelebA Bicubic RDN FSRNet FSRGAN PFSR DIC DICGAN HR 202085 from CelebA Bicubic RDN FSRNet FSRGAN PFSR DIC DICGAN HR 202301 from CelebA Bicubic RDN FSRNet FSRGAN PFSR DIC DICGAN HR 201936 from CelebA Bicubic RDN FSRNet FSRGAN PFSR DIC DICGAN HR

Figure 8: Qualitative comparison with state-of-the-art face super-resolution methods.

valign=t 201937 from CelebA Bicubic RDN FSRNet FSRGAN PFSR DIC DICGAN HR 201940 from CelebA Bicubic RDN FSRNet FSRGAN PFSR DIC DICGAN HR 201941 from CelebA Bicubic RDN FSRNet FSRGAN PFSR DIC DICGAN HR 201953 from CelebA Bicubic RDN FSRNet FSRGAN PFSR DIC DICGAN HR 3219692565_1 from Helen Bicubic RDN FSRNet FSRGAN PFSR DIC DICGAN HR 3255054809_1 from Helen Bicubic RDN FSRNet FSRGAN PFSR DIC DICGAN HR

Figure 9: Qualitative comparison with state-of-the-art face super-resolution methods.