Log In Sign Up

HifiFace: 3D Shape and Semantic Prior Guided High Fidelity Face Swapping

by   Yuhan Wang, et al.

In this work, we propose a high fidelity face swapping method, called HifiFace, which can well preserve the face shape of the source face and generate photo-realistic results. Unlike other existing face swapping works that only use face recognition model to keep the identity similarity, we propose 3D shape-aware identity to control the face shape with the geometric supervision from 3DMM and 3D face reconstruction method. Meanwhile, we introduce the Semantic Facial Fusion module to optimize the combination of encoder and decoder features and make adaptive blending, which makes the results more photo-realistic. Extensive experiments on faces in the wild demonstrate that our method can preserve better identity, especially on the face shape, and can generate more photo-realistic results than previous state-of-the-art methods.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8


SimSwap: An Efficient Framework For High Fidelity Face Swapping

We propose an efficient framework, called Simple Swap (SimSwap), aiming ...

Geometry Driven Progressive Warping for One-Shot Face Animation

Face animation aims at creating photo-realistic portrait videos with ani...

Password-conditioned Anonymization and Deanonymization with Face Identity Transformers

Cameras are prevalent in our daily lives, and enable many useful systems...

Sphere Face Model:A 3D Morphable Model with Hypersphere Manifold Latent Space

3D Morphable Models (3DMMs) are generative models for face shape and app...

Facial Attribute Transformers for Precise and Robust Makeup Transfer

In this paper, we address the problem of makeup transfer, which aims at ...

Learning Oracle Attention for High-fidelity Face Completion

High-fidelity face completion is a challenging task due to the rich and ...

One-shot Face Reenactment

To enable realistic shape (e.g. pose and expression) transfer, existing ...

1 Introduction

Face swapping is a task of generating images with the identity from a source face and the attributes (e.g., pose, expression, lighting, background, etc.) from a target image (as shown in Figure 1), which has attracted much interest with great potential usage in film industry [1] and computer games.

In order to generate high-fidelity face swapping results, there are several critical issues: () The identity of the result face including the face shape should be close to the source face. () The results should be photo-realistic which are faithful to the expression and posture of the target face and consistent with the details of the target image like lighting, background, and occlusion.

To preserve the identity of the generated face, previous works  [22, 21, 14] generated inner face region via DMM fitting or face landmark guided reenactment and blend it into the target image, as shown in Figure 2(a). These methods are weak in identity similarity because DMM can not imitate the identity details and the target landmarks contain the identity of target image. Also, the blending stage restricts the change of face shape. As shown in Figure 2(b), [19, 4] draw support from a face recognition network to improve the identity similarity. However, face recognition network focuses more on texture and is insensitive to the geometric structure. Thus, these methods can not preserve the exact face shape robustly.

As for generating photo-realistic results, [22, 21] used Poisson blending to fix the lighting, but it tended to cause ghosting and could not deal with complex appearance conditions. [14, 29, 17] designed an extra learning-based stage to optimize the lighting or occlusion problem, but they are fussy and can not solve all problems in one model.

To overcome the above defects, we propose a novel and elegant end-to-end learning framework, named HifiFace, to generate high fidelity swapped faces via D shape and semantic prior. Specifically, we first regress the coefficients of source and target face by a

D face reconstruction model and recombine them as shape information. Then we concatenate it with the identity vector from a face recognition network. We explicitly use the

D geometric structure information and use the recombined 3D face model with source’s identity, target’s expression, and target’s posture as auxiliary supervision to enforce precise face shape transfer. With this dedicated design, our framework can achieve more similar identity performance, especially on face shape.

Furthermore, we introduce a Semantic Facial Fusion (SFF) module to make our results more photo-realistic. The attributes like lighting and background require spatial information and the high image quality results need detailed texture information. The low-level feature in the encoder contains spatial and texture information, but also contains rich identity from the target image. Hence, to better preserve the attributes without the harm of identity, our SFF module integrates the low-level encoder features and the decoder features by the learned adaptive face masks. Finally, in order to overcome the occlusion problem and achieve perfect background, we blend the output to the target by the learned face mask as well. Unlike [21] that used the face masks of the target image for directly blending, HifiFace learns face masks at the same time under the guidance of dilated face semantic segmentation, which helps the model focus more on the facial area and make adaptive fusion around the edge. HifiFace handles image quality, occlusion, and lighting problems in one model, making the results more photo-realistic. Extensive experiments demonstrate that our results surpass other State-of-the-Art (SOTA) methods on wild face images with large facial variations.

Our contributions can be summarized as follows:

  1. We propose a novel and elegant end-to-end learning framework, named HifiFace, which can well preserve the face shape of the source face and generate high fidelity face swapping results.

  2. We propose a 3D shape-aware identity extractor, which can generate identity vector with exact shape information to help preserve the face shape of the source face.

  3. We propose a semantic facial fusion module, which can solve occlusion and lighting problems and generate results with high image quality.

2 Related Work

D-based Methods.

D Morphable Models (DMM) transformed the shape and texture of the examples into a vector space representation  [2].  [27] transferred expressions from source to target face by fitting a 3D morphable face model to both faces.  [22] transferred the expression and posture by DMM and trained a face segmentation network to preserve the target facial occlusions. These D-based methods follow a source-oriented pipeline like Figure 2(a) which only generates the face region by D fitting and blends it into the target image by the mask of the target face. They suffer from unrealistic texture and lighting because the DMM and the renderer can not simulate complex lighting conditions. Also, the blending stage limits the face shape. In contrast, our HifiFace accurately preserves the face shape via geometric information of DMM and achieves realistic texture and attributes via semantic prior guided recombination of both encoder and decoder feature.

Figure 2: The pipelines of previous works and our HifiFace. (a) Source-oriented pipeline uses D fitting or reenactment to generate inner face region and blend it into the target image, in which means the face region of the result. (b) Target-oriented pipeline uses a face recognition network to exact identity and combines encoder feature with identity in the decoder. (c) Our pipeline consists of four parts: the Encoder part, Decoder part, D shape-aware identity extractor, and SFF module. The encoder extracts features from , and the decoder fuses the encoder feature and the D shape-aware identity feature. Finally, the SFF module helps further improve the image quality.

GAN-based Methods.

GAN has shown great ability in generating fake images since it was proposed by  [10][13]

proposed a general image-to-image translation method, which proves the potential of conditional GAN architecture in swapping face, although it requires paired data.

The GAN-based face swapping methods mainly follow source-oriented pipeline or target-oriented pipeline.  [21, 14] followed source-oriented pipeline in Figure 2(a) which used face landmarks to compose face reenactment. But it may bring weak identity similarity, and the blending stage limited the change of face shape.  [19, 4, 17] followed target-oriented pipeline in Figure 2(b) which used a face recognition network to extract the identity and use decoder to fuse the encoder feature with identity, but they could not robustly preserve exact face shape and is weak in image quality. Instead, HifiFace in Figure 2(c) replaces the face recognition network with a D shape-aware identity extractor to better preserve identity including the face shape and introduces an SFF module after the decoder to further improve the realism.

Among these, FaceShifter [17] and SimSwap [4] follow target-oriented pipeline and can generate high fidelity results. FaceShifter [17] leveraged a two-stage framework and achieved state-of-the-art identity performance. But it could not perfectly preserve the lighting despite using an extra fixing stage. However, HifiFace can well preserve lighting and identity in one stage. Meanwhile, HifiFace can generate photo-realistic results with higher quality than FaceShifter.  [4] proposed weak feature matching loss to better preserve the attributes, but it harms the identity similarity. While HifiFace can better preserve attributes and do not harm the identity.

3 Approach

Let be the source images and the target images, respectively. We aim to generate result image with the identity of the and the attributes of . As illustrated in Figure 2(c), our pipeline consists of four parts: the Encoder part, Decoder part, D shape-aware identity extractor (Sec. 3.1), and SFF module (Sec. 3.2). First, we set as the input of the encoder and use several res-blocks [11] to get the attribute feature. Then, we use the D shape-aware identity extractor to get D shape-aware identity. After that, we use res-block with adaptive instance normalization [15] in decoder to fuse the D shape-aware identity and attribute feature. Finally, we use the SFF module to get higher resolution and make the results more photo-realistic.

3.1 3D Shape-Aware Identity Extractor

Most GAN-based methods only use a face recognition model to obtain identity information in the face swapping task. However, the face recognition network focuses more on texture and is insensitive to the geometric structure. To get more exact face shape features, we introduce DMM and use a pre-trained state-of-the-art D face reconstruction model [7] as a shape feature encoder, which represents the face shape S by an affine model:


where is the average face shape; , are the PCA bases of identity and expression; and are the corresponding coefficient vectors for generating a 3D face.

As illustrated in Figure 3(a), we regress DMM coefficients and , containing identity, expression, and posture of the source and target face by the D face reconstruction model . Then, we generate a new D face model by with the source’s identity, target’s expression and posture. Note that posture coefficients do not decide face shape but may affect the D landmarks locations when computing the loss. We do not use the texture and lighting coefficients because the texture reconstruction still remains unsatisfactory. Finally, we concatenate the with the identity feature extracted by , a pre-trained state-of-the-art face recognition model [12], and get the final vector , called 3D shape-aware identity. Thus, HifiFace achieves well identity information including geometric structure, which helps preserve the face shape of the source image.

Figure 3: Details of D shape-aware identity extractor and SFF module. (a) D shape-aware identity extractor uses (D face reconstruction network) and (face recognition network) to generate shape-aware identity. (b) SFF module recombines the encoder and decoder feature by and makes the final blending by . The means the upsample Module.

3.2 Semantic Facial Fusion Module


The low-level feature contains rich spatial information and texture details, which may significantly help generate more photo-realistic results. Here, we propose the SFF module to not only make full use of the low-level encoder and decoder features, but also overcome the contradiction in avoiding harming the identity because of the target’s identity information in low-level encoder feature.

As shown in Figure 3(b), we first predict a face mask when the decoder features are of size of the target. Then, we blend by and get , formulated as:


where means the low-level encoder feature with size of the original and means a res-block [11].

The key design of SFF is to adjust the attention of the encoder and decoder, which helps disentangle identity and attributes. Specifically, the decoder feature in non-facial area can be damaged by the inserted source’s identity information, thus we replace it with the clean low-level encoder feature to avoid potential harm. While the facial area decoder feature, which contains rich identity information of the source face, should not be disturbed by the target, therefore we preserve the decoder feature in the facial area.

After the feature-level fusion, we generate to compute auxiliary loss for better disentangling the identity and attributes. Then we use a Upsample Module which contains several res-blocks to better fuse the feature maps. Based on , it’s convenient for our HifiFace to generate even higher resolution results (e.g., ).


In order to solve the occlusion problem and better preserve the background, previous works [21, 20] directly used the mask of the target face. However, it brings artifacts because the face shape may change. Instead, we use SFF to learn a sightly dilated mask and embrace the change of the face shape. Specifically, we predict a -channel and -channel , and blend to the target image by , formulated as:


In summary, HifiFace can generate photo-realistic results with high image quality and well preserve lighting and occlusion with the help of the SFF module. Note that these abilities still work despite the change of face shape, because the masks have been dilated and our SFF benefits from inpainting around the contour of predicted face.

3.3 Loss Function

3D Shape-Aware Identity (SID) Loss.

SID loss contains shape loss and ID loss. We use D landmark keypoints as geometric supervision to constrain the face shape, which is widely used in D face reconstruction [7]. First we use a mesh renderer to generate D face model by coefficients of source image identity and target image expression and posture. Then, we generate D face model of and by regressing DMM coefficients. Finally we project the D facial landmark vertices of reconstructed face shapes onto the image obtaining landmarks {}, {} and {}:


Also, we use identity loss to preserve source image’s identity:


where means the identity vectors generated by , and

means the cosine similarity of two vectors. Finally, our SID loss is formulated as:


where = and = .

Realism Loss.

Realism loss contains segmentation loss, reconstruction loss, cycle loss, perceptual loss, and adversarial loss. Specifically, and in the SFF module are both under the guidance of a SOTA face segmentation network HRNet [26]. We dilated the masks of the target image to eliminate the limitation in face shape change and get . The segmentation loss is formulated as:


where means the resize operation.

If and share the same identity, the predicted image should be the same as . So we use reconstruction loss to give pixel-wise supervision:


The cycle process can be conducted in the face swapping task too. Let as the re-target image and the original target image as the re-source image. In the cycle process, we hope to generate results with re-source image’s identity and re-target image’s attributes, which means it should be the same as the original target image. The cycle loss is a supplement of pixel supervision and can help generate high-fidelity results:


where means the whole generator of HifiFace.

To capture fine details and further improve the realism, we follow the Learned Perceptual Image Patch Similarity (LPIPS) loss in [28] and adversarial objective in [5]. Thus, our realism loss is formulated as:


where = , = , = and = .

Figure 4: Comparison with FSGAN, SimSwap and FaceShifter. Our results can well preserve the source face shape, target attributes and have higher image quality, even when handling occlusion cases.

Overall Loss.

Our full loss is summarized as follows:


4 Experiments

Method ID Pose Shape MAC  FPS
FaceSwap - -
SimSwap 1.53 55.69 31.17
Ours- 98.48 0.437
Table 1: Quantitative Experiments on FaceForensics++. FPS is tested under GPU V.
Figure 5: (a) Comparison with AOT. (b) Comparison with DF.

Implementation Details.

We choose VGGFace [3] and Asian-Celeb [6] as the training set. For our model with resolution (i.e., Ours-), we remove images with either size smaller than for better image quality. For each image, we align the face using landmarks and crop to  [17], which contains the whole face and some background regions. For our more precise model (i.e., Ours-), we adopt a portrait enhancement network [18] to improve the resolution of the training images to as supervision, and also correspondingly add another res-block in of SFF compared to Ours-. The ratio of training pairs with the same identity is %. ADAM [16] is used with = ; = and learning rate = . The model is trained with K steps, using V GPUs and batch size.

4.1 Qualitative Comparisons

First, we compare our method with FSGAN [21], SimSwap [4] and FaceShifter [17] in Figure 4, AOT [29] and DeeperForensics (DF) [14] in Figure 5.

As shown in Figure 4, FSGAN shares the same face shape with target faces and it can not well transfer the lighting of the target image either. SimSwap can not well preserve the identity of the source image especially for the face shape because it uses a feature matching loss and focuses more on the attributes. FaceShifter exhibits a strong identity preservation ability, but it has two limitations: (1) Attribute recovery, while our HifiFace can well preserve all the attributes like face color, expression, and occlusion. (2) Complex framework with two stages, while HifiFace presents a more elegant end-to-end framework with even better recovered images. As shown in Figure 5(a), AOT is specially designed to overcome the lighting problem but is weak in identity similarity and fidelity. As shown in Figure 5(b), DF has reduced the bad cases of style mismatch, but is weak in identity similarity too. In contrast, our HifiFace not only perfectly preserves the lighting and face style, but also well captures the face shape of the source image and generates high quality swapped faces. More results can be found in our .

Method FF++ DFDC
Ours- 38.97 41.54 62.29 59.99
Table 2: Results in terms of AUC and AP on FF++ and DFDC.
Figure 6: Face shape error between and in FF++ pairs with large shape difference. Samples are sorted by shape error of HifiFace. Same column index indicates the same source/target pair.

4.2 Quantitative Comparisons

Next, we conduct quantitative comparison on FaceForensics (FF)++  [23] dataset with respect to the following metrics: ID retrieval, pose error, face shape error, and performance on face forgery detection algorithms to again demonstrate the effectiveness of our HifiFace. For FaceSwap [9] and FaceShifter, we evenly sample frames from each video and compose a K test set. For SimSwap and our HifiFace, we generate face swapping results with the same source and target pairs above.

For ID retrieval and pose error, we follow the same setting in  [17, 4]. As shown in Table 1, HifiFace achieves the best ID retrieval score and is comparable with others in pose preservation. For face shape error, we use another D face reconstruction model [24] to regress coefficients of each test face. The error is computed by L distances of identity coefficients between the swapped face and its source face, and our HifiFace achieves the lowest face shape error. The parameter and speed comparisons are also shown in Table 1, and our HifiFace is faster than FaceShifter, along with higher generation quality.

To further illustrate the ability of HifiFace in controlling the face shape, we visualize the samplewise shape differences between HifiFace and FaceShifter [17] in Figure 6. The results show that, when the source and target differs much in face shape, HifiFace significantly outperforms Faceshifter with 95% samples having smaller shape errors.

Besides, we apply the models from FF++ [23] and DeepFake Detection Challenge (DFDC) [8, 25] to examine the realism performance of HifiFace. The test set contains K swapped faces and K real faces from FF++ for each method. As shown in Table 2, HifiFace achieves the best score, indicating higher fidelity to further help improve face forgery detection.

Figure 7: Ablation study for 3D shape-aware identity extractor.
Figure 8: Ablation study for SFF module.
Figure 9: Comparison with results using directly mask blending. ‘Blend-T’, ‘Blend-DT’, and ‘Blend-R’ mean blending bare results to the target image by the mask of target, the dilated mask of target and the mask of bare results, respectively.
Figure 10: Difference feature maps of SFF.

4.3 Analysis of HifiFace

3D Shape-Aware Identity.

To verify the effectiveness of shape supervision on face shape, we train another model Ours-nd, which replaces the shape-aware identity vector with the normal identity vector from . As shown in Figure 7, the results of Ours-nd can hardly change the face shape or have obvious artifacts, while the results of Ours- can generate results with much more similar face shape.

Semantic Facial Fusion.

To verify the necessity of the SFF module, we compare with baseline models: () ‘Bare’ that removes both the feature-level and image-level fusion. () ‘Blend’ that removes the feature-level fusion. () ‘Concat’ that replaces the feature-level fusion with a concatenate. As shown in Figure 8, ‘Bare’ can not well preserve the background and occlusion, ‘Blend’ lacks legibility, and ’Concat’ is weak in identity similarity, which proves that the SFF module can help preserve the attribute and improve the image quality without harming the identity.

Face Shape Preservation in Face Swapping.

Face shape preservation is quite difficult for face swapping, which is not just because of the difficulty in getting shape information, but also the challenge of inpainting when the face shape has changed. Blending is a valid way to preserve occlusion and background, but it is hard to be applied when the face shape changes. As shown in Figure 9, when the source face is fatter than the target face (row ), it may limit the change of face shape in Blend-T. If we use Blend-DT or Blend-R, it can not well handle the occlusion. When the source face is thinner than the target (row ), it is easy to bring artifacts around the face in Blend-T and Blend-DT and may cause a double face in Blend-R. In contrast, our HifiFace can apply the blending without the above issue, because our SFF module has the ability to inpaint the edge of the predicted mask.

To further illustrate how SFF addresses the problem, we show difference feature maps of every stage in the SFF module, named SFF-, between the input of (,) and (,), where (,) obtains Ours- and (,) achieves target itself. In Figure 10, the bright area means where the face shape changes or contains artifacts. SFF module recombines the feature between the face region and non-face area and focuses more on the contour of the predicted mask, which brings great benefits for inpainting areas where the shape changes.

5 Conclusions

In this work, we propose a high fidelity face swapping method, named HifiFace, which can well preserve the face shape of the source face and generate photo-realistic results. A 3D shape-aware identity extractor is proposed to help preserve identity including face shape. An SFF module is proposed to achieve a better combination in feature-level and image-level for realistic image generation. Extensive experiments demonstrate that our method can generate higher fidelity results than previous SOTA face swapping methods both quantitatively and qualitatively. Last but not least, HifiFace can also be served as a sharp spear, which contributes to the development of the face forgery detection community.


  • [1] O. Alexander, M. Rogers, W. Lambeth, M. Chiang, and P. Debevec (2009) Creating a photoreal digital actor: the digital emily project. In 2009 Conference for Visual Media Production, pp. 176–187. Cited by: §1.
  • [2] V. Blanz and T. Vetter (1999) A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 187–194. Cited by: §2.
  • [3] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman (2018) Vggface2: a dataset for recognising faces across pose and age. In FG, pp. 67–74. Cited by: §4.
  • [4] R. Chen, X. Chen, B. Ni, and Y. Ge (2020) SimSwap: an efficient framework for high fidelity face swapping. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 2003–2011. Cited by: §1, §2, §2, §4.1, §4.2.
  • [5] Y. Choi, Y. Uh, J. Yoo, and J. Ha (2020) Stargan v2: diverse image synthesis for multiple domains. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 8188–8197. Cited by: §3.3.
  • [6] DeepGlint (2020) Http:// Accessed: 2020-12-20. Cited by: §4.
  • [7] Y. Deng, J. Yang, S. Xu, D. Chen, Y. Jia, and X. Tong (2019)

    Accurate 3d face reconstruction with weakly-supervised learning: from single image to image set

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §3.1, §3.3.
  • [8] B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. C. Ferrer (2019) The deepfake detection challenge (dfdc) preview dataset. arXiv preprint arXiv:1910.08854. Cited by: §4.2.
  • [9] FaceSwap (2020) Https:// tree/master/dataset/faceswapkowalski. Accessed: 2020-12-20. Cited by: §4.2.
  • [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §3.2, §3.
  • [12] Y. Huang, Y. Wang, Y. Tai, X. Liu, P. Shen, S. Li, J. Li, and F. Huang (2020) Curricularface: adaptive curriculum learning loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5901–5910. Cited by: §3.1.
  • [13] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134. Cited by: §2.
  • [14] L. Jiang, R. Li, W. Wu, C. Qian, and C. C. Loy (2020) Deeperforensics-1.0: a large-scale dataset for real-world face forgery detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2886–2895. Cited by: §1, §1, §2, §4.1.
  • [15] T. Karras, S. Laine, and T. Aila (2019)

    A style-based generator architecture for generative adversarial networks

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §3.
  • [16] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • [17] L. Li, J. Bao, H. Yang, D. Chen, and F. Wen (2019) Faceshifter: towards high fidelity and occlusion aware face swapping. arXiv preprint arXiv:1912.13457. Cited by: §1, §2, §2, §4, §4.1, §4.2, §4.2.
  • [18] X. Li, C. Chen, S. Zhou, X. Lin, W. Zuo, and L. Zhang (2020) Blind face restoration via deep multi-scale component dictionaries. In European Conference on Computer Vision, pp. 399–415. Cited by: §4.
  • [19] J. Liu, W. Li, H. Pei, Y. Wang, F. Qu, Y. Qu, and Y. Chen (2019) Identity preserving generative adversarial network for cross-domain person re-identification. IEEE Access 7, pp. 114021–114032. Cited by: §1, §2.
  • [20] R. Natsume, T. Yatagawa, and S. Morishima (2018) Fsnet: an identity-aware generative model for image-based face swapping. In Asian Conference on Computer Vision, pp. 117–132. Cited by: §3.2.
  • [21] Y. Nirkin, Y. Keller, and T. Hassner (2019) Fsgan: subject agnostic face swapping and reenactment. In ICCV, Cited by: §1, §1, §1, §2, §3.2, §4.1.
  • [22] Y. Nirkin, I. Masi, A. T. Tuan, T. Hassner, and G. Medioni (2018) On face segmentation, face swapping, and face perception. In FG, pp. 98–105. Cited by: §1, §1, §2.
  • [23] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner (2019) Faceforensics++: learning to detect manipulated facial images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1–11. Cited by: §4.2, §4.2.
  • [24] S. Sanyal, T. Bolkart, H. Feng, and M. J. Black (2019) Learning to regress 3d face shape and expression from an image without 3d supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7763–7772. Cited by: §4.2.
  • [25] selimsef (2020) Https:// fake_challenge. Accessed: 2021-01-10. Cited by: §4.2.
  • [26] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang, W. Liu, and J. Wang (2019) High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514. Cited by: §3.3.
  • [27] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner (2016) Face2face: real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2387–2395. Cited by: §2.
  • [28] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595. Cited by: §3.3.
  • [29] H. Zhu, C. Fu, Q. Wu, W. Wu, C. Qian, and R. He (2020) AOT: appearance optimal transport based identity swapping for forgery detection. Advances in Neural Information Processing Systems 33. Cited by: §1, §4.1.

Network Structures

Detailed structures of our HifiFace are given in Figure 12

. For all residual units, we use the Leaky ReLU (LReLU) as the activation function. Resample means the Average Pooling or the Upsampling, which is used to change the size of feature maps. Res-Blocks with the Instance Normalization (IN) are used in encoder, while Res-Blocks with the Adaptive Instance Normalization (AdaIN) are used in decoder.

More Results

To analyse the specific impacts of the shape information from 3D face reconstruction model and the identity information from face recognition model, we adjust the composition of SID to generate interpolated results. It is formulated as:


where , and means the D identity coefficients of source, target and interpolated image, , and means source, target and interpolated image’s identity vector from recognition model.

As we can see in Figure 11 rows , we first fix and , the face shape can still change but lake of identity detail. Then in rows , we fix and , and the identity becomes more similar. The results prove that the shape information control the basic of shape and identity, while the identity vector is helpful to the identity texture.

In the end, we download lots of wild face images from Internet and generate more face swapping results in Figure 13 and Figure 14 to demonstrate the strong capability of our methods. And more results can be found at

Figure 12: Architectural details of HifiFace.
Figure 13: More results on high resolution wild faces. The face in the target image is replaced by the face in the source image.
Figure 14: Some results on high quality real world photos. The face in the target image is replaced by the face in the source image. This is to show that the faces generated by our method can be very naturally integrated into high-resolution shot real world scenes.