Face photographs record long-lasting precious memories of individuals and historical moments of human civilization. Yet the limited conditions in the acquisition, storage, and transmission of images inevitably involve complex, heterogeneous degradations in real-world scenarios, including discrete sampling, additive noise, lossy compression, and beyond. With great application and research value, face restoration has been widely concerned by industry and academia, with an plethora of works(Maru and Parikh, 2017)(Sahu et al., 2019)(Liu et al., 2019a) devoted to address specific types of image degradation. Yet under more generalized, unconstrained application scenarios, few existing works can report satisfactory restoration results.
For face restoration, most existing methods typically work in a “non-blind” fashion with specific degradation of prescribed type and intensity, leading to a variety of sub-tasks including super resolution(Yang et al., 2018)(Dong et al., 2014)(Lim et al., 2017)(Wang et al., 2018c), hallucination (Rajput et al., 2018)(Li et al., 2019a), denoising(Anwar and Barnes, 2019)(Yue et al., 2019), deblurring (Sahu et al., 2019)(Kupyn et al., 2017)(Kupyn et al., 2019) and compression artifact removal (Liu et al., 2019a)(Dong et al., 2015)(Mao et al., 2018). However, task-specific methods typically exhibit poor generalization over real-world images with complex and heterogeneous degradations. A case in point shown in Fig. 1 is a historic group photograph taken at the Solvay Conference, 1927, that super-resolution methods, ESRGAN (Wang et al., 2018c) and Super-FAN (Bulat and Tzimiropoulos, 2018), tend to introduce additional artifacts, while other three task-specific restoration methods barely make any difference in suppressing degradation artifacts or replenishing fine details of hair textures, wrinkles, etc., revealing the impracticality of task-specific restoration methods.
When it comes to blind image restoration (Newman and Hildebrandt, 1987), researchers aim to recover high-quality images from their degraded observation in a “single-blind” manner without a priori knowledge about the type and intensity of the degradation. It is often challenging to reconstruct image contents from artifacts without degradation prior, necessitating additional guidance information such as categorial (Anwar et al., 2017) or structural prior (Chen et al., 2017) to facilitate the replenishment of faithful and photo-realistic details. For blind face restoration (Lin et al., 2018)(Chen et al., 2018), facial landmarks (Bulat and Tzimiropoulos, 2018), parsing maps (Wang et al., 2018b), and component heatmaps (Yu et al., 2018) are typically utilized as external guidance labels. In particular, Li et.al. explored the guided face restoration problem (Li et al., 2018)(Li et al., 2020a), where an additional high-quality face is utilized to promote fine-grained detail replenishment. However, it often leads to limited feasibility for restoring photographs without ground truth annotations. Furthermore, for real-world images with complex background, introducing unnecessary guidance could lead to inconsistency between the quality of renovated faces and unattended background contents.
In this paper, we formally propose “Face Renovation”(FR), an extra challenging, yet more practical task for photorealistic face restoration in a “dual-blind” condition, lifting the requirements of both the degradation and structural prior for training. Specifically, we formulate FR as a semantic-guided face synthesis problem, and propose to tackle this problem with a collaborative suppression and replenishment(CSR) framework. To implement FR, we propose HiFaceGAN, a generative framework with several nested CSR units to implement FR in a multi-stage fashion with hierarchical semantic guidance. Each CSR unit contains a suppression module for extracting layered semantic features with content-adaptive convolution, which are utilized to guide the replenishment of corresponding semantic contents. Extensive experiments are conducted on both the synthetic FFHQ (Karras et al., 2018b) and real-world photographs against competitive degradation-specific baselines, highlighting the challenges in proposed face renovation and the superiority of our proposed HiFaceGAN. In summary, our contributions are threefold:
We present a challenging, yet practical task, termed as “Face Renovation (FR)” to tackle unconstrained face restoration problem in a “dual-blind” fashion, lifting the requirements on the both degradation and structural prior.
We propose a collaborative suppression and replenishment (CSR) framework “HiFaceGAN” with a nested architecture for multi-stage face renovation with hierarchical semantic guidance. Specifically, the extracted semantic hierarchy, the working mechanism of HiFaceGAN, and its advantages over existing restoration methods are thoroughly explained with illustrative examples.
Extensive experiments are conducted on both synthetic and real face images with significant performance gain over a variety of “non-blind” and “single-blind” baselines, verifying the versatility, robustness and generalization capability of our proposed HiFaceGAN.
2. Related Works
2.1. Non-Blind Face Restoration
Image restoration consists of a variety of subtasks, such as denoising (Anwar and Barnes, 2019)(Yue et al., 2019), deblurring (Kupyn et al., 2017)(Kupyn et al., 2019) and compression artifact removal (Dong et al., 2015)(Mao et al., 2018). In particular, image super resolution (Dong et al., 2014)(Lim et al., 2017)(Ledig et al., 2017)(Wang et al., 2018c) and its counterpart for faces, hallucination (Rajput et al., 2018)(Grm et al., 2019)(Shi and Zhao, 2019)(Li et al., 2019a), can be considered as a specific type of restoration against downsampling. However, existing works often works in a “non-blind” fashion by prescribing the degradation type and intensity during training, leading to dubious generalization ability over real images with complex, heterogeneous degradations. In this paper, we perform face renovation by replenishing facial details based on hierarchical semantic guidance that are more robust against mixed degradations, and achieves superior performance over a wide range of restoration subtasks against state-of-the-art “non-blind” baselines.
2.2. Blind Face Restoration
Blind image restoration (Newman and Hildebrandt, 1987) (Bai et al., 2019)(Li et al., 2020b) aims to directly learn the restoration mapping based on observed samples. However, most existing methods for general natural images are still sensitive to the degradation profile (Elron et al., 2020) and exhibit poor generalization over unconstrained testing conditions. For category-specific (Anwar et al., 2017) (face) restoration, it is commonly believed that incorporating external guidance on facial prior would boost the restoration performance, such as semantic prior (Liu et al., 2019b), identity prior (Grm et al., 2019), facial landmarks (Bulat and Tzimiropoulos, 2018)(Chen et al., 2017) or component heatmaps (Yu et al., 2018). In particular, Li et.al. (Li et al., 2018) explored the guided face restoration scenario with an additional high-quality guidance image to help with the generation of facial details. Other works utilize objectives related to subsequent vision tasks to guide the restoration, such as semantic segmentation (Liu et al., 2017) and recognition (Zhang et al., 2011). In this paper, we further explore the “dual-blind” case targeting at unconstrained face renovation in real-world applications. Particularly, we reveal an astonishing fact that with collaborative suppression and replenishment, the dual-blind face renovation network can even outperform state-of-the-art “single-blind” methods due to the increased capability for enhancing non-facial contents, which provides new insights for tackling unconstrained face restoration problem from a generative view.
2.3. Deep Generative Models for Face Images
Deep generative models, especially GANs (Goodfellow et al., 2014) have greatly facilitated conditional image generation tasks (Isola et al., 2016)(Zhu et al., 2017a), especially for high-resolution faces (Karras et al., 2018a)(Karras et al., 2018b)(Karras et al., 2019). Existing methods can be roughly summarized into two categories: semantic-guided methods, utilizing parsing maps (Wang et al., 2018b), edges (Wang et al., 2018a), facial landmarks (Bulat and Tzimiropoulos, 2018) or anatomical action units (Pumarola et al., 2018) to control the layout and expression of generated faces, and style-guided generation (Karras et al., 2018b)(Karras et al., 2019), utilizing adaptive instance normalization (Huang and Belongie, 2017) to inject style guidance information into generated images. Also, combining semantic and style guidance together leads to multi-modal image generation (Zhu et al., 2017b), enabling separable pose and appearance control of the output images. Inspired by SPADE (Park et al., 2019) and SEAN (Peihao Zhu and Wonka, 2019) for semantic-guided image generation based on external parsing maps, our HiFaceGAN utilizes the SPADE layers to implement collaborative suppression and replenishment for multi-stage face renovation, which progressively replenishes plausible details based on hierarchical semantic guidance, leading to an automated renovation pipeline without external guidance.
3. Face Renovation
Generally, the acquisition and storage of digitized images involves many sources of degradations, including but not limited to discrete sampling, camera noise and lossy compression, as shown in Fig LABEL:fig:intro. Non-blind face restoration methods typically focus on reversing a specific source of degradation, such as super resolution, denoising and compression artifact removal, leading to limited generalization capability over varying degradation types, Fig 1. On the other hand, blind face restoration often relies on the structural prior or external guidance labels for training, leading to quality inconsistency between foreground and background contents. To resolve the issues in existing face restoration works, we present face renovation to explore the capability of generative models for “dual-blind” face restoration without degradation prior and external guidance. Although it would be ideal to collect authentic low-quality and high-quality image pairs of real persons for better degradation modeling, the associated legal issues concerning privacy and portraiture rights are often hard to circumvent. In this work, we perturb a challenging, yet purely artificial face dataset (Karras et al., 2018b)
with heterogeneous degradation in varying types and intensities to simulate the real-world scenes for FR. Thereafter, the methodology and comprehensive evaluation metrics for FR are analyzed in detail.
3.1. Degradation Simulation
With richer facial details, more complex background contents, and higher diversity in gender, age, and ethnic groups, the synthetic dataset FFHQ (Karras et al., 2018b) is chosen for evaluating FR models with sufficient challenges. We simulate the real-world image degradation by perturbing the FFHQ dataset with different types of degradations corresponding to respective face processing subtasks, which will be also evaluated upon our proposed framework to demonstrate its versatility. For FR, we superimpose four types of degradation (except 16x mosaic) over clean images in random order with uniformly sampled intensity to replicate the challenge expected for real-world application scenarios. 111The python script will be provided in supplementary materials. Fig. 2 displays the visual impact of each type of degradation upon a clean input face. It is evident that mosaic is the most challenging due to the severe corruption of facial boundaries and fine-grained details. Blurring and down-sampling are slightly milder, with the structural integrity of the face almost intact. Finally, JPEG compression and additive noise are the least conceptually obtrusive, where even the smallest details (such as hair bang) are clearly discernable. As will be evidenced later in Sec. 5.1, the visual impact is consistent with the performance of the proposed face renovation model. Finally, the full degradation for FR is more complex and challenging than all subtasks (except 16x mosaic), with both additive noises/artifacts and detail loss/corruption. We believe the proposed degradation simulation can provide sufficient yet still manageable challenge towards real-world FR applications.
With the single dominated type of degradation, existing methods are devoted to fit an inverse transformation to recover the image content. When it comes to real-world scenes, the low-quality facial images usually contain unidentified heterogeneous degradation, necessitating a unified solution that can simultaneously address common degradations without prior knowledge. Given a severely degraded facial image, the renovation can be reasonably decomposed into two steps, 1) suppressing the impact of degradations and extracting robust semantic features; 2) replenishing fine details in a multi-stage fashion based on extracted semantic guidance. Generally speaking, a facial image can be decomposed into semantic hierarchies, such as structures, textures, and colors, which can be captured within different receptive fields. Also, noise and artifacts to be suppressed need to be adaptively identified according to different scale information. This motivates the design of HiFaceGAN, a multi-stage renovation framework consisting of several nested collaborative suppression and replenishment(CSR) units that is capable of resolving all types of degradation in a unified manner. Implementation details will be introduced in the following section.
3.3. Evaluation Criterion
For real-world applications, the evaluation criterion for face renovation should be more consistent with human perception rather than machine judgement. Therefore, besides commonly-adopted PSNR and SSIM (Wang et al., 2004)(Wang et al., 2003)
metrics, the evaluation criterion for FR should also reflect the semantic fidelity and perceptual realism of renovated faces. For semantic fidelity, we measure the feature embedding distance (FED) and landmark localization error (LLE) with a pretrained face recognition model(Kazemi and Sullivan, 2014), where the average L2 norm between feature embeddings is adopted for both metrics. For perceptual realism, we introduce FID (Heusel et al., 2017) and LPIPS (Zhang et al., 2018) to evaluate the distributional and elementwise distance between original and generated samples in the respective perceptual spaces: For FID it is defined by a pre-trained Inception V3 model (Szegedy et al., 2015), and for LPIPS, an AlexNet (Krizhevsky et al., 2012). Also, the NIQE (Mittal et al., 2013) metric adopted for the 2018 PIRM-SR challenge (Wang et al., 2018c) is introduced to measure the naturalness of renovated results for in-the-wild face images. Moreover, we will explain the trade-off between statistical and perceptual scores with ablation study detailed in Sec. 5.3.
4. The Proposed HiFaceGAN
In this section, we detail the architectural design and the working mechanism of the proposed HiFaceGAN. As shown in Fig. 3, the suppression modules aim to suppress heterogeneous degradation and encode robust hierarchical semantic information to guide the subsequent replenishment module to reconstruct the renovated face with corresponding photorealistic details. Further, we will illustrate the multi-stage renovation procedure and the functionality of individual units in Fig. 5 justifying the proposed methodology as well as providing new insights to the face renovation task.
4.1. Network Architecture
We propose a nested architecture containing several CSR units that each attend to a specific semantic aspect. Concretely, we cascade the front-end suppression modules to extract layered semantic features, in an attempt to capture the semantic “hierarchy” of the input image. Accordingly, the corresponding multi-stage renovation pipeline is implemented via several cascaded replenishment modules that each attend to the incoming layer of semantics. Note that the resulted renovation mechanism differs from the commonly-perceived “coarse-to-fine” strategy as in progressive GAN (Karras et al., 2018a)(Kim et al., 2019). Instead, we allow the proposed framework to automatically learn a reasonable semantic decomposition and the corresponding face renovation procedure in a completely data-driven manner, maximizing the collaborative effect between the suppression and replenishment modules. More evidence will be provided in Sec. 4.3.
Suppression Module A key challenge for face renovation lies in the heterogeneous degradation mingled within real-world images, where a conventional CNN layer with fixed kernel weights could suffer from the limited ability to discriminate between image contents and degradation artifacts. Specifically, we take a look at a conventional spatial convolution with kernel :
where are 2D spatial coordinates, is the sliding window centering at , is the offset between and that is used for indexing elements in . The key observation from Eqn. (1
) is that the conventional CNN layer shares the same kernel weights over the entire image, making the feature extraction pipelinecontent-agnostic. In other words, both the image content and degradation artifacts will be treated in an equal manner and aggregated into the final feature representation, with potentially negative impacts to the renovated image. Therefore, it is highly desirable to select and aggregate informative features with content-adaptive filters, such as LIP (Gao et al., 2019) or PAC (Su et al., 2019). In this work, we implement the suppression module as shown in Fig. 4 to replace the conventional convolution operation in Eqn. (1), which helps select informative feature responses and filter out degradation artifacts through adaptive kernels. Mathematically,
where aims to modulate the weight of convolution kernels with respect to the correlations between neighborhood features. Intuitively, one would expect a correlation metric to be symmetric, i.e. , which can be fulfilled via the following parameterized inner-product function:
carries the raw input feature vectorinto the D-dimensional correlation space, which can help reduce the redundancy of raw input features between channels, and is a non-linear activation layer to adjust the range of the output, such as sigmoid or tanh. In practice, we implement
Replenishment Module Having acquired semantic features from the front-end suppression module, we now focus on utilizing the encoded features for guided detail replenishment. Existing works on semantic-guided generation has achieved remarkable progress with spatial adaptive denormalization (SPADE) (Park et al., 2019), where semantic parsing maps are utilized to guide the generation of details that belongs to different semantic categories, such as the sky, sea, or trees. We leverage such progress by incorporating the SPADE block into our cascaded CSR units, allowing effective utilization of encoded semantic features to guide the generation of fine-grained details in a hierarchical fashion. In particular, the progressive generator contains several cascaded SPADE blocks, where each block receives the output from the previous block and replenish new details following the guidance of the corresponding semantic features encoded with the suppression module. In this way, our framework can automatically capture the global structure and progressively filling in finer visual details at proper locations even without the guidance of additional face parsing information.
4.2. Loss Functions
Most face restoration works aims to optimize the mean-square-error (MSE) against target images (Dong et al., 2014)(Kim et al., 2015)(Lim et al., 2017), which often leads to blurry outputs with insufficient amount of details (Wang et al., 2018c). Corresponding to the evaluation criterion in Sec. 3.3, it is crucial that the renovated image exhibits high semantic fidelity and visual realism, while slight signal-level discrepancies are often tolerable. To this end, we follow the adversarial training scheme (Goodfellow et al., 2014) with an adversarial loss to encourage the realism of renovated faces. Here we adopt the LSGAN variant (Mao et al., 2016) for better training dynamics:
Understanding of HiFaceGAN To illustrate what each CSR unit can generate at the corresponding stage and how they work cooperatively to perform face renovation outstandingly, we provide an illustrative example shown in Fig. 5
(a), where we ablate certain units by replacing the corresponding semantic feature map (blue dots) with a constant tensor (hollow circles), leading to a plain grey background to better isolate the contents generated at each individual stage. Given a 16x down-sampled low-quality facial image, we first sequentially utilize single semantic guidance from the inner stage to the outer stage, the upper row in Fig.5, and then show the results of the accumulation of semantic guidance in the lower row of Fig. 5
. It is impressive that single semantic guidance from a specific stage leads the corresponding replenishment module to generate a hierarchical layer, which from the inner stage to the outer stage focuses on facial landmarks, edges and textures, shades and reflections, tune and illumination, colorization respectively. In details, by progressively adding semantic guidance, it can be found that with larger receptive field and high-level semantic features, our HiFaceGAN sketches the rough face boundary and localizes facial landmarks, allowing the subsequent CSR unit to replenish fine details upon the basic facial structure when the receptive field goes small and the resolution raises up. The step-by-step face renovation process acts like a hierarchical layer-by-layer overlaying of contents generated with replenishment modules in a semantic-guided fashion, which gradually enhances the visual quality and realism of the renovated image. So far, the progressive face renovation process with logically reasonable ordered steps has justified our heuristics in the network architecture design and illustrate the efficacy and interpretability of HiFaceGAN in a convincing manner.
Additionally, the advantages of the proposed HiFaceGAN can be summarized in three aspects:
Versatility Without degradation prior and explicit guidance, our HiFaceGAN works in a “dual-blind” fashion which can resolve subtasks of face restoration in a unified fashion without tweaking the training configurations, Table 1.
Robustness Our HiFaceGAN can withstand severe degradations even more intense than the training data, Fig. 7.
Generalization Ability With content-adaptive suppression and hierarchical semantic guided replenishment, our HiFaceGAN performs well in real-world scenarios, Sec. 5.2.
Comparison with Blind Face Restoration To better clarify the distinctions between our face renovation framework and existing restoration methods, we compare the residual maps generated with our HiFaceGAN and the state-of-the-art blind face restoration network GFRNet (Li et al., 2018). As shown in Fig. 6(b), the residual map generated with GFRNet packs heavier noise and less semantically meaningful details, indicating a higher focus on “suppression” and insufficient attention to “replenishment”. This could be attributed to the PSNR-oriented optimization objective, where additive noises contribute a large proportion of the signal discrepancy. In contrast, HiFaceGAN can simultaneously suppress degradation artifacts and replenish semantic details, leading to semantic-aware residual maps and more refined renovation results. Also, HiFaceGAN can renovate background contents and foreground faces together, leading to consistent quality improvement across the entire image. This justifies the rationale of the “dual-blind” setting towards real-world applications with images containing rich non-facial contents.
|EDSR (Lim et al., 2017)||30.188||0.824||0.961||0.0843||2.003||20.605||0.2475||13.636|
|SRGAN (Ledig et al., 2017)||27.494||0.735||0.935||0.1097||2.269||4.396||0.1313||7.378|
|Face Super||ESRGAN (Wang et al., 2018c)||27.134||0.741||0.935||0.1107||2.261||3.503||0.1221||6.984|
|Resolution||SRFBN (Li et al., 2019b)||29.577||0.827||0.953||0.0984||2.066||20.032||0.2406||13.901|
|(4x, Bicubic)||Super-FAN (Bulat and Tzimiropoulos, 2018)||25.463||0.729||0.913||0.1416||2.333||14.811||0.2357||8.719|
|WaveletCNN (Huang et al., 2017)||28.750||0.806||0.952||0.0964||2.072||16.472||0.2443||12.217|
|Denoising||RIDNet (Anwar and Barnes, 2019)||25.432||0.731||0.891||0.2128||2.465||36.515||0.3864||13.002|
|1/3 Poisson,||VDNet (Yue et al., 2019)||27.718||0.797||0.928||0.1551||2.297||15.826||0.2458||14.262|
|Deblurring||DeblurGAN (Kupyn et al., 2017)||25.304||0.718||0.894||0.1786||3.219||14.331||0.2574||12.697|
|(1/2 Motion blur||DeblurGANv2 (Kupyn et al., 2019)||26.908||0.773||0.913||0.1043||3.036||10.285||0.2178||13.729|
|1/2 Gaussian blur)||HiFaceGAN||28.928||0.793||0.954||0.0913||2.156||2.580||0.0874||7.426|
|ARCNN (Dong et al., 2015)||33.021||0.879||0.972||0.0845||1.959||9.761||0.1551||14.827|
|JPEG artifact||EPGAN (Mao et al., 2018)||32.780||0.882||0.976||0.0814||1.979||10.250||0.1638||13.729|
|GFRNet (Li et al., 2018)||25.227||0.686||0.854||0.2524||3.371||48.229||0.4591||20.777|
In this section, we demonstrate the versatility, robustness and generalization ability of our proposed HiFaceGAN over a wide range of related face restoration sub-tasks, both on synthetic images and real-world photographs. Furthermore, an thorough ablation study is performed to verify our major contributions and stimulate future research directions. Detailed configurations are provided in supplementary materials to facilitate the reproduction.
5.1. Comparison with state-of-the-art methods
We first evaluate our framework on five subtasks, including super resolution, hallucination, denoising, deblurring and compression artifact removal. For each subtask, the dataset is prepared by performing task-specific degradation upon raw images from FFHQ (Karras et al., 2018b), Fig. 2. Finally, five most competitive task-specific methods, along with the state-of-the-art blind face restoration baseline (Li et al., 2018), are chosen to compete with our HiFaceGAN over the most challenging and practical FR task.
Comparison with Task-Specific Baselines Overall, HiFaceGAN outperforms all baselines by a huge margin on perceptual performance, with 3-10 times gain on FID and 50%-200% gain on LPIPS, Table 1. Furthermore, HiFaceGAN even outperform real images in terms of naturalness, as reflected by the NIQE metric. Generally, our generative approach is better suited for tasks with heavy structural degradation, such as face hallucination, denoising and deblurring. For super-resolution and JPEG artifact removal, the structural degradation is considerably milder(Fig. 2), leading to narrowed gaps between task-specific solutions and our generalized framework, especially on statistical scores. This is reasonable since the training functions are more perceptually inclined for FR. Nevertheless, it is still possible to trade-off between perceptual and statistical performance, as will be discussed in ablation study.
For qualitative comparison, we showcase the representative results on corresponding tasks in Fig. 6. For all subtasks, our HiFaceGAN can replenish rich and convincing visual details, such as hair bangs, beards and wrinkles, leading to consistent, photo-realistic renovation results. In contrast, other task-specific methods either produce over-smoothed or color-shifted results (WaveletCNN, SIDNet), or incur severe systematic artifacts during detail replenishment (ESRGAN, Super-FAN). Moreover, our dual-blind setting is equally effective in enhancing details for non-facial contents, such as the interweaving grids on the microphone. In summary, HiFaceGAN can resolve all types of degradation in a unified manner with stunning renovation performances, verifying the efficacy of the proposed methodology and architectural design. More results are provided in the supplementary material.
Dual-Blind vs. Single-Blind To discuss the impact of external guidance, we compare our HiFaceGAN with state-of-the-art “single-blind” baseline GFRNet (Li et al., 2018) over the fully-degraded FFHQ datset, where the ground truth image is provided as the high-quality guidance during testing. As shown in column 7-9 of Fig. 6, even with the strongest guidance, GFRNet is still less effective in suppressing noises and replenishing fine-grained details than our network, indicating its limitation in feature utilization and generative capability. Consistent with our observation in Sec. 4.3, the performance gain of GFRNet against other baselines is mainly statistical, where the semantic and perceptual scores are less competitive, Table 1. Our empirical study suggests that 1) the lack of explicit guidance does not necessarily lead to inferior performance of face renovation; 2) the ability to replenish plausible details is most crucial for high-quality face renovation.
5.2. Historic Photograph Renovation
The historic group photograph of famous physicists at the contemporary age taken at the 5th Solvay Conference in 1927 is utilized to evaluate generalization capability of state-of-the-art models for real-world face renovation, Fig. 1. We crop face patches from the original image and resize them to
with bicubic interpolation for input. Apparently, compared to others, our HiFaceGAN can successfully suppress complex degradation in real old photos to generate faces with high definition, high fidelity, and fewer artifacts, while replenishing realistic details, such as facial luster, fine hair, clear facial features, and photo-realistic wrinkles. More outstanding renovation results are displayed in Fig.6. Inevitably, the renovated faces contain minor artifacts that mostly occur at shading regions, where degradation artifacts have severely corrupted the underlying contents. Nevertheless, the renovated high-resolution person portraits still possess much better visual and artistic quality than the original input, which simultaneously demonstrates the capability of our model and the challenge in real-world applications.
5.3. Ablation Study
We perform an ablation study over the most challenging 16x face hallucination task. Four ablation methods are designed to verify our major contributions, as described below:
16xFace replaces the semantic parsing map in SPADE with degraded faces containing 16-pixel mosaics.
FixConv retains the nested CSR architectue of HiFaceGAN with the normal content-agnostic convolution layer in Eqn (1).
L1 adds an additional L1 loss upon default HiFaceGAN to adjust between statistical and perceptual scores.
The evaluation scores are reported in Table 2. Although face parsing maps provide much finer spatial guidance, it is evident that face renovation relies more on semantic features, as reflected by the huge performance gap between SPADE and 16xFace. Also, FixConv achieves visible performance gain by extracting hierarchical semantic features and applying multi-stage face renovation, verifying the proposed nested architecture. Moreover, incorporating the content-adaptive suppression module further improves the feature selection and degradation suppression ability, leading to substantial gain over FixConv on perceptual and semantic scores. Finally, adding an L1 loss term makes the model statistically inclined, with superior PSNR/SSIM and inferior FID/LPIPS/NIQE scores, verifying the flexibility of our framework to trading off between statistical and perceptual performances.
Pressure Test To verify the robustness of HiFaceGAN, we conduct two sets of pressure test targeting at the suppression and replenishment module respectively. For the suppression test, we add random noises upon clean images with up to 140% peak amplitude (twice the energy) of the training data; For the replenishment test, we evaluate a 16x super-resolution model with images downsampled up to 64x ratio. Fig. 7 displays the renovation results of our HiFaceGAN under extreme degradations. The proposed suppression module is effective for extracting robust semantic features under heavy noise, and the replenishment module can still recover plausible faces even for input beyond human recognition, where most signal discrepancies are smartly “hidden” in the more stochastic hair regions, thus mitigating the negative impact on the naturalness and perceptual realism of renovated faces. Overall, the pressure test demonstrates the impressive efficacy of the proposed collaborative suppression and replenishment framework.
6. Conclusion and Future Work
In this paper, we present a challenging, yet more practical task towards real-world photo repairing applications, termed as “face renovation”. Particularly, we propose HiFaceGAN, a collaborative suppression and replenishment framework that works in a “dual-blind” fashion, lifting the usual requirements of degradation prior or structural guidance for training. Extensive experiments on both synthetic face images and real-world historic photographs have demonstrated the versatility, robustness and generalization capability over a wide range of face restoration tasks, outperforming current state-of-the-art by a large margin. Furthermore, the working mechanism of HiFaceGAN, and the rationality of the “dual-blind” setting are justified in a convincing manner with illustrative examples, bringing fresh insights to the subject matter. In the future, we envision that the proposed HiFaceGAN would serve as a solid stepping stone towards the expectations of face renovation. Specifically, the severe degradation often lead to content ambiguity for renovation, like the Afro haircut appeared in Fig. 6 where our method misjudged as normal straight hairs, which motivates us to increase the diversity and balance between different ethnic groups during data collection. Also, it is still a huge challenge for the renovation of objects with regular geometric shapes (such as glasses) and partially-occluded faces — a typical case where external structural guidance could be beneficial. Therefore, exploring multi-modal generation networks with both structural and semantical guidance is another possibility.
- Anwar and Barnes (2019) Saeed Anwar and Nick Barnes. 2019. Real Image Denoising With Feature Attention. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), 3155–3164.
- Anwar et al. (2017) Saeed Anwar, Fatih Murat Porikli, and Cong Phuoc Huynh. 2017. Category-Specific Object Image Denoising. IEEE Transactions on Image Processing 26 (2017), 5506–5518.
- Bai et al. (2019) Yuanchao Bai, Gene Cheung, Xianming Liu, and Wen Gao. 2019. Graph-Based Blind Image Deblurring From a Single Photograph. IEEE Transactions on Image Processing 28 (2019), 1404–1418.
Adrian Bulat and
Georgios Tzimiropoulos. 2018.
Super-FAN: Integrated Facial Landmark Localization
and Super-Resolution of Real-World Low Resolution Faces in Arbitrary Poses
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(2018), 109–117.
- Chen et al. (2017) Yu Chen, Ying Tai, Xiaoming Liu, Chunhua Shen, and Jian Yang. 2017. FSRNet: End-to-End Learning Face Super-Resolution with Facial Priors. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017), 2492–2501.
- Chen et al. (2018) Zhibo Chen, Jianxin Lin, Tiankuang Zhou, and Feng Wu. 2018. Sequential Gating Ensemble Network for Noise Robust Multi-Scale Face Restoration. IEEE transactions on cybernetics (2018).
- Dong et al. (2015) Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou Tang. 2015. Compression Artifacts Reduction by a Deep Convolutional Network. 2015 IEEE International Conference on Computer Vision (ICCV) (2015), 576–584.
- Dong et al. (2014) Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. 2014. Image Super-Resolution Using Deep Convolutional Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2014), 295–307.
- Elron et al. (2020) Noam Elron, Shahar Yuval, Dmitry Rudoy, and Noam Levy. 2020. Blind Image Restoration without Prior Knowledge. ArXiv abs/2003.01764 (2020).
- Gao et al. (2019) Ziteng Gao, Limin Wang, and Gangshan Wu. 2019. LIP: Local Importance-Based Pooling. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), 3354–3363.
- Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In NIPS.
- Grm et al. (2019) Klemen Grm, Walter J Scheirer, and Vitomir Struc. 2019. Face hallucination using cascaded super-resolution and identity priors. IEEE Transactions on Image Processing 29, 1 (2019), 2150–2165.
- Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Nash Equilibrium. CoRR (2017).
- Huang et al. (2017) Huaibo Huang, Ran He, Zhenan Sun, and Tieniu Tan. 2017. Wavelet-SRNet: A Wavelet-Based CNN for Multi-scale Face Super Resolution. 2017 IEEE International Conference on Computer Vision (ICCV) (2017), 1698–1706.
- Huang and Belongie (2017) Xun Huang and Serge J. Belongie. 2017. Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization. CoRR abs/1703.06868 (2017).
- Isola et al. (2016) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2016. Image-to-Image Translation with Conditional Adversarial Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 5967–5976.
- Johnson et al. (2016) Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In ECCV.
- Karras et al. (2018a) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018a. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In ICLR.
- Karras et al. (2018b) Tero Karras, Samuli Laine, and Timo Aila. 2018b. A Style-Based Generator Architecture for Generative Adversarial Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018), 4396–4405.
- Karras et al. (2019) Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2019. Analyzing and Improving the Image Quality of StyleGAN. ArXiv abs/1912.04958 (2019).
- Kazemi and Sullivan (2014) Vahid Kazemi and Josephine Sullivan. 2014. One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1867–1874.
- Kim et al. (2019) Deokyun Kim, Minseon Kim, Gihyun Kwon, and Dae-Shik Kim. 2019. Progressive Face Super-Resolution via Attention to Facial Landmark. In Proceedings of the 30th British Machine Vision Conference (BMVC).
- Kim et al. (2015) Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. 2015. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 1646–1654.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS.
- Kupyn et al. (2017) Orest Kupyn, Volodymyr Budzan, Mykola Mykhailych, Dmytro Mishkin, and Jiri Matas. 2017. DeblurGAN: Blind Motion Deblurring Using Conditional Adversarial Networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017), 8183–8192.
- Kupyn et al. (2019) Orest Kupyn, Tetiana Martyniuk, Junru Wu, and Zhangyang Wang. 2019. DeblurGAN-v2: Deblurring (Orders-of-Magnitude) Faster and Better. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), 8877–8886.
- Ledig et al. (2017) Christian Ledig, Lucas Theis, Ferenc Huszár, José Antonio Caballero, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. 2017. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 105–114.
- Lee et al. (2019) Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. 2019. MaskGAN: Towards Diverse and Interactive Facial Image Manipulation. ArXiv abs/1907.11922 (2019).
- Li et al. (2019a) Mengyan Li, Yuechuan Sun, Zhaoyu Zhang, Haonian Xie, and Jun Yu. 2019a. Deep Learning Face Hallucination via Attributes Transfer and Enhancement. In 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 604–609.
- Li et al. (2020a) Xiaoming Li, Wenyu Li, Dongwei Ren, Hongzhi Zhang, Meng Wang, and Wangmeng Zuo. 2020a. Enhanced Blind Face Restoration with Multi-Exemplar Images and Adaptive Spatial Feature Fusion. In CVPR.
- Li et al. (2018) Xiaoming Li, Ming Liu, Yuting Ye, Wangmeng Zuo, Liang Lin, and Ruigang Yang. 2018. Learning Warped Guidance for Blind Face Restoration. In ECCV.
- Li et al. (2020b) Yuelong Li, Mohammad Tofighi, Junyi Geng, Vishal Monga, and Yonina C. Eldar. 2020b. Efficient and Interpretable Deep Blind Image Deblurring Via Algorithm Unrolling. IEEE Transactions on Computational Imaging 6 (2020), 666–681.
- Li et al. (2019b) Zhen Li, Jinglei Yang, Zheng Liu, Xiaomin Yang, Gwanggil Jeon, and Wei Wu. 2019b. Feedback Network for Image Super-Resolution. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 3862–3871.
- Lim et al. (2017) Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. 2017. Enhanced Deep Residual Networks for Single Image Super-Resolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
- Lin et al. (2018) Jianxin Lin, Tiankuang Zhou, and Zhibo Chen. 2018. Multi-Scale Face Restoration with Sequential Gating Ensemble Network. In AAAI.
- Liu et al. (2017) Ding Liu, Bihan Wen, Xianming Liu, and Thomas S. Huang. 2017. When Image Denoising Meets High-Level Vision Tasks: A Deep Learning Approach. In IJCAI.
- Liu et al. (2019a) Jiaying Liu, D. Liu, Wenhan Yang, Sifeng Xia, Xiaoshuai Zhang, and Yuanying Dai. 2019a. A Comprehensive Benchmark for Single Image Compression Artifacts Reduction. ArXiv abs/1909.03647 (2019).
- Liu et al. (2019b) Lu Liu, Shenghui Wang, and Lili Wan. 2019b. Component Semantic Prior Guided Generative Adversarial Network for Face Super-Resolution. IEEE Access 7 (2019), 77027–77036.
- Mao et al. (2018) Qi Mao, Shiqi Wang, Shanshe Wang, Xinfeng Zhang, and Siwei Ma. 2018. Enhanced Image Decoding via Edge-Preserving Generative Adversarial Networks. 2018 IEEE International Conference on Multimedia and Expo (ICME) (2018), 1–6.
- Mao et al. (2016) Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhixiang Wang, and Stephen Paul Smolley. 2016. Least Squares Generative Adversarial Networks. 2017 IEEE International Conference on Computer Vision (ICCV) (2016), 2813–2821.
- Maru and Parikh (2017) Monika Maru and Mehul C. Parikh. 2017. Image Restoration Techniques: A Survey.
- Mittal et al. (2013) Anish Mittal, Rajiv Soundararajan, and Alan C. Bovik. 2013. Making a “Completely Blind” Image Quality Analyzer. IEEE Signal Processing Letters 20 (2013), 209–212.
- Newman and Hildebrandt (1987) B. B. Newman and J. Hildebrandt. 1987. Blind Image Restoration. Australian Computer Journal 19 (1987), 126–133.
- Park et al. (2019) Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. Semantic Image Synthesis With Spatially-Adaptive Normalization. In CVPR.
- Peihao Zhu and Wonka (2019) Yipeng Qin Peihao Zhu, Rameen Abdal and Peter Wonka. 2019. SEAN: Image Synthesis with Semantic Region-Adaptive Normalization. ArXiv 1911.12861 (2019).
- Pumarola et al. (2018) Albert Pumarola, Antonio Agudo, Aleix M. Martínez, Alberto Sanfeliu, and Francesc Moreno-Noguer. 2018. GANimation: Anatomically-aware Facial Animation from a Single Image. European Conference on Computer Vision (ECCV) 11214 (2018), 835–851.
- Rajput et al. (2018) Shyam Singh Rajput, K. V. Arya, V. Jeba Singh, and Vijay Kumar Bohat. 2018. Face Hallucination Techniques: A Survey. 2018 Conference on Information and Communication Technology (CICT) (2018), 1–6.
- Sahu et al. (2019) Siddhant Sahu, Manoj Kumar Lenka, and Pankaj Kumar Sa. 2019. Blind Deblurring using Deep Learning: A Survey. ArXiv abs/1907.10128 (2019).
- Shi and Zhao (2019) Jingang Shi and Guoying Zhao. 2019. Face Hallucination via Coarse-to-Fine Recursive Kernel Regression Structure. IEEE Transactions on Multimedia 21, 9 (2019), 2223–2236.
- Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014).
- Su et al. (2019) Hang Su, Varun Jampani, Deqing Sun, Orazio Gallo, Erik G. Learned-Miller, and Jan Kautz. 2019. Pixel-Adaptive Convolutional Neural Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 11158–11167.
- Szegedy et al. (2015) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 2818–2826.
- Wang et al. (2018b) Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018b. High-Resolution Image Synthesis and Semantic Manipulation With Conditional GANs. In CVPR.
- Wang et al. (2018a) Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018a. Video-to-Video Synthesis. In NeurIPS.
- Wang et al. (2018c) Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. 2018c. ESRGAN: Enhanced super-resolution generative adversarial networks. In The European Conference on Computer Vision Workshops.
- Wang et al. (2004) Zhengjiang. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (2004), 600–612.
- Wang et al. (2003) Zhou Wang, Eero P. Simoncelli, and Alan C. Bovik. 2003. Multi-scale structural similarity for image quality assessment.
- Yang et al. (2018) Wenming Yang, Xuechen Zhang, Yapeng Tian, Wei Wang, Jing-Hao Xue, and Qingmin Liao. 2018. Deep Learning for Single Image Super-Resolution: A Brief Review. IEEE Transactions on Multimedia 21 (2018), 3106–3121.
- Yu et al. (2018) Xin Yu, Basura Fernando, Bernard Ghanem, Fatih Porikli, and Richard Hartley. 2018. Face super-resolution guided by facial component heatmaps. In Proceedings of the European Conference on Computer Vision (ECCV). 217–233.
- Yue et al. (2019) Zongsheng Yue, Hongwei Yong, Qian Zhao, L. M. Zhang, and Deyu Meng. 2019. Variational Denoising Network: Toward Blind Noise Modeling and Removal. In NeurIPS.
- Zhang et al. (2011) Haichao Zhang, Jianchao Yang, Yanning Zhang, Nasser M. Nasrabadi, and Thomas S. Huang. 2011. Close the loop: Joint blind image restoration and recognition with sparse representation prior. 2011 International Conference on Computer Vision (2011), 770–777.
Zhang et al. (2018)
Richard Zhang, Phillip
Isola, Alexei A. Efros, Eli Shechtman,
and Oliver Wang. 2018.
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. InCVPR. 586–595.
- Zhu et al. (2017a) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017a. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. 2017 IEEE International Conference on Computer Vision (ICCV) (2017), 2242–2251.
- Zhu et al. (2017b) Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. 2017b. Toward multimodal image-to-image translation. In NeurIPS.