DeepFaceLab: A simple, flexible and extensible face swapping framework

05/12/2020 ∙ by Ivan Perov, et al. ∙ proton mail USTC 17

DeepFaceLab is an open-source deepfake system created by iperov for face swapping with more than 3,000 forks and 13,000 stars in Github: it provides an imperative and easy-to-use pipeline for people to use with no comprehensive understanding of deep learning framework or with model implementation required, while remains a flexible and loose coupling structure for people who need to strengthen their own pipeline with other features without writing complicated boilerplate code. In this paper, we detail the principles that drive the implementation of DeepFaceLab and introduce the pipeline of it, through which every aspect of the pipeline can be modified painlessly by users to achieve their customization purpose, and it's noteworthy that DeepFaceLab could achieve results with high fidelity and indeed indiscernible by mainstream forgery detection approaches. We demonstrate the advantage of our system through comparing our approach with current prevailing systems. For more information, please visit: https://github.com/iperov/DeepFaceLab/.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 8

page 9

page 12

page 13

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since deep learning has empowered the realm of computer vision in recent years, the manipulation of digital image, especially manipulation of human portraits image, has improved rapidly and achieved photorealistic result in most cases. Face swapping, is an eye-catching task in generating fake content by transferring a source face to the destination while maintaining the facial movements and expression deformations of the source.

The key motivation behind face manipulation techniques is Generative Adversarial Networks (GANs) 

[8]. More and more faces synthesized by StyleGAN [14], StyleGAN2 [15] are becoming more and more realistic and completely indistinguishable to human vision system.

Numerous spoof videos synthesized by GAN-based face swapping methods are published in youtube and other video websites. Commercial mobile application such as ZAO222https://apps.apple.com/cn/app/id1465199127 and FaceApp333https://apps.apple.com/gb/app/faceapp-ai-face-editor/id1180884341 which allow general netizens to create fake images and videos effortlessly greatly boost the spreading of these swapping techniques, called deepfakes. MrDeepFakes, the most famous forum for people who talk about the cutting-edge progress in deepfakes technology itself either the set of skills for produce delicate face swapping videos, further accelerates the promotion of deepfakes-made videos in the Internet.

These content generation and modification technologies may affect the quality of public discourse and the safeguarding of human rights especially given that deepfakes may be used maliciously as a source of misinformation, manipulation, harassment, and persuasion. Identifying manipulated media is a technically demanding and rapidly evolving challenge that requires collaborations across the entire tech industry and beyond.

Researches on media anti-forgery detection is being invigorated and dedicating growing efforts to forgery face detection. DFDC

444https://deepfakedetectionchallenge.ai/ is a typical example, which is a million-dollar challenge launched by Facebook and Microsoft in 2019.

However, passive defense is never a good idea for detection of deepfakes. In our perspective, for both academia and the general public, it is better to know what deepfake is and how it could make a photorealistic video with source person changed to target person, rather than merely defend against it passively, as the old saying goes: "The best defence is a good offense". Making general netizens realizing the exsistence of deepfakes and strengthening their identification ability for spoof medias published in social network is much more important than agonizing the fact whether spoof media is true or not.

To the best of our knowledge, Synthesising Obama [26], FSGAN [20] and FaceShifter [17] are the most representative works of synthesizing facial-manipulated video. The problem of these works and other related works is that the authors of them do not fully open-source their code or release only part of the code, whereas the reproduction of these paper by the open-source communities is almost hard to produce a convincing result as demonstrated in the paper. "The devil in the details" is a motto that we all acknowledge in training generative models. Since the trend of face swapping algorithms have become increasingly complicated with more and more perplex processing procedures inserted, it seems like an unrealistic goal to completely fulfill a fantastic face swapping algorithm according merely based on papers.

Furthermore, these algorithms or systems need difficult human hand-picked operations or specified condition more or less, which raises the threshold for beginners who want to dig deeper. For example, Synthesizing Obama [26] needs a high quality 3D model of Obama and a canonical manually drawn mask, which means when you change to a video clip or reselect the source person you want change, you need to customize a new 3D model and draw a canonical mask for synthesizing texture. Obviously, it is heavy to carry out such plan.

As a whole pipeline of generating fake digital content, besides face swapping, more components are required to fill the whole framework: such as face detector module, face recognition module, face alignment module, face parsing module, face blending module etc. The current works with incomplete pipeline is somehow hindering the progress in this area and increasing the cost of learning for many novices.

To address these points, DeepFakes [4] has introduced a complete production pipeline in replacing a source person’s face to the target person’s along with the same facial expression such as eye movement, facial muscle movement. However, the results produced by DeepFakes are poor somehow, so are the results with Nirkin’s automatic face swapping [21].

This paper introduces DeepFaceLab, an easy-to-use, open-source system with clean-state design of pipeline, which can achieve photorealistic face swapping result without painful tuning. DeepFaceLab has turned out to be very popular with the public. For instance, many artists create DeepFaceLab-based videos and publish it into their youtube channels, among whom, five most popular of them with average subscriptions of over 200,000 and the sum over hits of these in DeepFaceLab made videos over 100 million.

The contribution of DeepFaceLab can be summarize as below:

  • A state-of-the-art framework consists of maturity pipeline is proposed, which aims in achieving photorealistic face swapping results.

  • DeepFaceLab open-sources the code in 2018 and always keep up to the progress in the computer vision area, making a positive contribution for defending deepfakes both actively and passively, which has drawn broad attention in the open-source community and VFX areas.

  • Some high-efficiency components and tools are introduced in DeepFaceLab hence users may want more flexibility in the DeepFaceLab workflow meanwhile find the problems in time.

2 Characteristics of DeepFaceLab

DeepFaceLab’s success stems from weaving previous ideas into a design that balances speed and ease of use as well as the booming of computer vision in face recognition, alignment, reconstruction, segmentation etc. There are four main characteristics behind our implementation:

Leras

Now DeepFaceLab provides a new high-level deep learning framework built on pure TensorFlow 

[1]

, which aims to bail out the unnecessary restrictions and extra overheads brought by some commonly-used high-level frameworks such as Keras 

[3] and plaidML [29]. iperov named it as Leras: the abbreviation for Lighter Keras. The main advantages of Leras are:

  • Simple and flexible model construction

    Leras alleviate the burden of researchers and practitioners by providing Pythonic style to do model work, similar to PyTorch (i.e. defining layers, composing neural models, writing optimizers), but in graph mode (no eager execution).

  • Performance focused implementation With the utilize of Leras instead of Keras, the training time is reduced by about 10 20% in average.

  • Fine-granularity tensor management

    The motivation for switching to pure Tensorflow is that Keras and plaidML are not flexible enough. In addition, they are largely outdated and do not give full control over how tensors are processed.

Put users first

DeepFaceLab strives to make the usage of its pipeline, including data loader and processing, model training and post-processing, as easy and productive as possible. Unlike other face swapping systems, DeepFaceLab provides a complete command line tools with every aspect of the pipeline could be executed in the way that users choose. Notably, the complexity inherent as well as many hand-picked features for fine-grained control such as the canonical face landmark for face alignment, should be handled internally and hidden behind DeepFaceLab. That is to say, people could achieve the smooth and photorealistic face swapping results without the need of hand-picked features if they follow the settings of the workflow, but only with the need of two folders: the source (src) and the destination (dst) without the need to pair the same facial expression between src and dst. To some extent, DeepFaceLab could function like a point-and-shoot camera.

Furthermore, according to many practical feedbacks from DeepFaceLab users, a highly flexible and customized face converter is needed since there are a lot of complexity need to be handle: floodlights, rain, separated by glass, face injuries and many other cases. Hence, interactive mode has been applied in the conversion phase, which relieved the workload for deepfake producers since interactive preview can assist them in observing the effects of all changes they make when changing various options and enabling/disabling various features.

Engineering support

To drain the full potential of CPU and GPU, some pragmatic measures were added to improve the performance: multi-gpu support, half-precision training, usage of pinned CUDA memory to improve throughput, use of multiple threads to accelerate graphics operations and data processing.

Extensibility and Scalability

To strengthen the flexibility of DeepFaceLab workflow and attract the interest of research community, users are free to replace any component of DeepFaceLab that does not meet the needs or performance requirements of their project for most of DeepFaceLab’s modules are designed to be interchangeable. For instance, people could provide a new face detector in order to achieve high performance in detecting face with extreme angles or far area. A general case is that many masters of DeepFaceLab tend to customize their network structrue and training paradigm, e.g. progressive training paradigm of PGGAN [13] combined with special loss design of LSGAN [19] or WGAN-GP [9].

3 Pipeline

In DeepFaceLab (DFL for short), we abstract the pipeline into three main components: Extraction, Training and Conversion. Those three parts are presented sequentially. Besides, a noteworthy thing is that DFL falls in a typical one-to-one face swapping paradigm, which means there are only two data folders: src and dst, the abbreviation for source and destination, are used in the following narrative. Furthermore, Unlike prior work, we can generate high resolution images and generalise to variant input resolutions.

3.1 Extraction

Figure 2: Overview of Extraction in DeepFaceLab (DFL).

Extraction is the first phase in DFL, which contains many algorithms and processing parts, i.e. face detection, face alignment, and face segmentation. After the procedure of Extraction, user will get the aligned faces with precise mask and facial landmarks from your input data folder, src is used here for illustration in this part. Plus, as DFL provides many face type (i.e, half face, full face, whole face), which represents the face coverage area of Extraction. Unless stated otherwise, full face is taken by default.

Face Detection

The first step in Extraction is to to find the target face in the given folders: src and dst. DFL use S3FD [34] as its default face detector. Obviously, you can choose any other face detection algorithm to replace S3FD for your specified target, i.e RetinaFace [5], MTCNN [33].

Face Alignment

The second step is face alignment, after numerous experiments and failures, we need to find a facial landmarks algorithm which could maintain stable over time, which is of essential importance in producing a successful succesive footage shot and film.

DFL provides two canonical type of facial landmark extraction algorithm to solve this: (a) heatmap-based facial landmark algorithm 2DFAN [2] (for faces with normal posture) and (b) PRNet [6] with 3D face priori information (for face with large Euler angle (yaw, pitch, roll), e.g. A face with large yaw angle, means one side of the face is out of sight). After facial landmarks was retrieved, we provide an optional function with configurable timestep to smooth facial landmarks of consecutive frames in a single shot.

Then we adopt a classical point pattern mapping and transformation method proposed by Umeyama [28] to calculate a similarity transformation matrix used for face alignment.

As Umeyama [28] method needs standard facial landmark templates in calculating similarity transformation matrix, DFL provides three canonical aligned facial landmark templates: front view and side views (left and right). The noteworthy thing is DFL could automatically determine the Euler angle according to the obtained facial landmarks, which can help Face Alignment in choosing the right facial landmark template without any manual intervention

Face Segmentation

After face alignment, a data folder with face of standard front/side-view aligned src is attained. We employ a fine-grained Face Segmentation network (TernausNet [10]) on top of aligned src, through which, a face with either hair, fingers or glasses could be segmented exactly. It is optionally but usefully, which designed to remove irregular occlusions to keep network in the Training process robust to hands, glasses and any other objects which may cover the face somehow.

However, since some state-of-the-art face segmentation model fails to generate fine-grained mask in some particular shots, the XSeg model was introduced in DFL. XSeg now allow everyone train their own model for the segmentation of a specific faceset (aligned src or aligned dst) through few-shot learning paradigm. For instance, if a faceset of around 2000 pictures, it is enough to mark the most representative 50-100 samples manually. XSeg then trains to achieve the desired segmentation quality on top of those manually labeled pairs and generalize to the whole faceset.

To be clear, XSeg(optional) is only necessary for those cases when whole_face type is being used or it is necessary to remove obstructions from the mask on the full_face type. The sketch of XSeg is list in  4.

As the above workflow executed sequentially, we got everything DFL needs in the next stage (Training): cropped faces with its correspond coordinates in its original images, facial landmarks, aligned faces and pixel-wise segmentation masks from src (Since the extraction procedure of dst is the same with src, hence there is no need to elaborate that in detail).

3.2 Training

(a) DF structure
(b) LIAE structure
Figure 3: Overview of Training in DeepFaceLab (DFL). structure DF and LIAE are both provided here for illustration,

represents the concatenation of latent vectors.

Training is the most vital role in achieving photorealistic face swapping results of DeepFaceLab.

With no request of facial expressions of aligned src and aligned dst being strictly matched, we are aiming at designing an easy and efficient algorithm paradigm to solve this unpaired problem along with maintaining high fidelity and perceptual quality of the generated face. As shown in Figure 3(a), DF consists of an Encoder as well as Inter with shared weights between src and dst, an another Decoder which belongs to src and dst seperately. The generalization of src and dst is achieved through the shared Encoder and Inter, that solves the aforementioned unpaired problem easily.

The Latent codes of src and dst are and , both extracted by Inter.

As depicted in Figure 3(b), LIAE is a more complex structure with a shared-weight Encoder, Decoder and two independent Inter models. Another difference compared to the DF is that InterAB is used to generate both latent code of src and dst while InterB only output the latent code of dst. Here, denotes the latent code of src produced by InterAB and we generalize this representation to , .

After getting all the latent codes from InterAB and InterB, LIAE then concatenate these feature maps through channel: was obtained for a new latent code representation of src and for dst as the same way.

Then and are put into the Decoder and hence we got the predicted src (dst) alongside with their masks. The motivation of concatenating with is to shift the direction of latent code in direction of the class (src or dst) we need, through which InterAB obtained a compact and well-aligned representation of src and dst in the latent space.

Except for the structure of the model, some useful tricks are effective for improving the quality of the generated face. Inspired by PRNet [7] and meanwhile driven by the need to make full use of face mask and landmark, a weighted sum mask loss in general SSIM [30] can be add to make each part of the face carry different weights under the AE training architecture, for example, we add more weights to the eye area than the cheek, which aims to make the network concentrate on generating a face with vivid eyes.

As for losses, DFL uses a mixed loss (DSSIM (structural dissimilarity) [18] + MSE) by default. The reason for this combination is to get benefits from both: DSSIM generalizes human faces faster meanwhile MSE provides better clarity. This combination loss serves to find a compromise between generalization and clarity.

Instead of writing too much boilerplate code, we reduce the burden of user to design their own training paradigm or the network structure, specifically, users could add extra intermediate model to mix up the latent representation of src and dst (i.e, LIAE), or when users choose to train with GAN paradigm, self-customized discriminator (i.e multi-scale Discriminator [12] or RealnessGAN Discriminator [31]) are allowed to put after the decoder to alleviate the semantic gap of generating face, particularly with limited dataset volume of src and dst.

In the case of src2dst (Figure 4), we use a fancy true face model for the generated face of better likeness to the dst in the Conversion phase. For LIAE, it aims to make the distribution of approaches to . And for DF, destination then turns to be and .

Also, unlike fixed resolution limitations of deepfakes and other face swapping frameworks, we can generate high resolution images and generalise to variant output resolutions through adjust the settings of model definition in Training part, which is rather easy by means of DFL’s clean and distinct interface.

Obviously, both LIAE and DF are support the above feature and those features are designed to be pluggable, further improving the flexibility of the DFL framework. For more details about the design of DF and LIAE, please refer to Appendix.

3.3 Conversion

Figure 4: Take src2dst and DF structure as an example to illustrate Conversion phase of DFL.

Finally, we come to Conversion phase, as depicted in Figure 4, users can swap face of src to dst and vice versa.

In the case of src2dst, the first step of the proposed face swapping scheme in Conversion is to transform the generated face alongside with its mask from dst Decoder to the original position of the target image in src due to the reversiblility of Umeyama [28].

The following piece is about blending, with the ambition for the re-aligned reenacted face seamlessly fit with the target image along the outer contour of . For the sake of remaining consistent complexion, DFL provides five more color transfer algorithms (i.e, reinhard color transfer: RCT [24], iterative distribution transfer: IDT [23] and etc.) to make more adaptable to the target image . On top of it, the result of blending can be obtained by combining two images: and .

(1)

Any blending must account for, especially at the junctions between with delimited region and , different skin tones, face shapes and illumination conditions. Here we define our Poisson blending [22] optimization as

(2)

It is easy to see from Equation 2 that we only need to minimize the facial part with since is a constant term.

Then we come to the last regular step of both DFL pipeline and Conversion workflow: sharpening

. A pre-trained face super resolution neural network (denote as

FaceEnhancer) was added to sharpen the blended face. Since it is noted that the generated faces in almost current state-of-the-art face swapping works, more or less, are smoothed and lack of minor details (i.e. moles, wrinkles).

If all goes well, we will get a view of HD fake image (put generated face seamlessly onto the designated part of target face and meanwhile adjust the skin tone of the generated face to the target face, then fit it back in the original picture according to its coordinates recorded in phase Extraction), which is hard to distinguish between the true and the false even with the help of frequency domain analysis.

4 Productivity tools in DeepFaceLab

In a general way, DFL serves as a productivity tool in the workflow of making videos when there are faces with a lot of face swapping conditions. Therefore, the demand for authenticity of synthesized fake image is far above ordinary consumer-level product, i.e, high resolution, complex occlusion and bad illumination.

In order to address the issues above, we provide some effective tools in achieving HD fake image of super high fidelity and reality.

(a) Manual face detection and facial landmarks extraction
(b) XSeg: Face segementation on the base of few-short learning
Figure 5: Two handy tools in Extraction phase of DeepFaceLab.

In Fig 5, there are two commonly-used tools in the Extraction phase of DFL. Fig 5(a) is a manual face detection and facial landmarks extraction tool which is designed for face with extreme Euler angle where commonly-used face detectors and facial landmark extractors fail.

Besides, when video jitters exist in synthesized video, this tool could help user refine/smooth the facial landmarks of target face through taking reference of adjacent frames.

Similarly, Fig 5(b) is the XSeg manual face segmentation editor for fine-grained control of the facial mask scope designed to avoid the interference of occlusions like hands, hair and etc.

Figure 6: Preview of Training phase of DeepFaceLab, column one and two are source face and its corresponding output of src decoder, column three and four are target face and its corresponding output of dst decoder, the last column represents a src2dst paradigm: target face input, generated face of src decoder output.

During Training, we provide detailed previews for researchers to test their new ideas without writing any extra code, the loss movements are indicated in the yellow and blue lines in Fig 6, which represent the loss history of src2src and dst2dst, providing valuable information for people to debug whether their unique model architecture is good or not .

5 Evaluation

In this section we compare the performance of DeepFaceLab with several other commonly-used face swapping frameworks and two state-of-the-art works, and find that DFL has competitive performance among them.

5.1 Noise and error analysis

Since most existing face swapping algorithms share the common step of blending a reenacted face into an existing background image, the problem of image discrepancies and discontinuities across the blending boundaries then arises inevitably.

The problem here is that this kind of discrepancies will be amplified in high-resolution images. We illustrate noise analysis 555https://29a.ch/photo-forensics/#noise-analysis and error level analysis in Figure 7.

(a) Non-DFL face swapping
(b) DFL: same skin colors
(c) DFL: different skin colors
Figure 7: Noise analysis (middle column) and error level analysis (right column) of real images and fake images, it’s clear that a rectangular part with uneven noise distribution manifested in Non-DFL face swapping method whereas these rectangular patterns are hard to be found in DFL’s results.

5.2 Qualitative results

Fig 8(a) offers face swapping results of representative open-source projects (DeepFakes [4], Nirkin et al. [21] and Face2Face [27]) taken from FaceForensics++ dataset [25]. Examples of different expression, face shapes, and illumnations are selected in our experiment. It’s clear from observing the video clips from FaceForensics++ that they are not only trained inadequately but chosen from models with low-resolution. To be fair in our comparison, Quick96 mode is taken: a lightweight model that DF structure underneath, which outputs the of 96 96 resolutions (without and ). The average training time is restrict within 3 hours. We use Adam [24] (=0.00005, = 0.5, = 0.999) to optimize our model. All of our networks were trained on a single NVIDIA GeForce 1080Ti GPU and an Intel Core i7-870 CPU.

(a) Compare DFL with representative open-source face swapping projects.
(b) Compare DFL with the latest state-of-art works.
Figure 8: Qualitative face swapping results on FaceForensics++ [35] face images.

5.3 Quantitative results

FaceForensics++ still under use during quantitative experiments. In practice, the naturalness and realness of the results of face swapping method is hard to describe with some specified quantitative indexes. However, pose and expression indeed embody valuable insights of the face swapping result. Besides, SSIM is used to compare the structure similarity as well as perceptual loss [11] is adopted to compare high level differences, like content and style discrepancies, between target subject and the swapped subject.

To measure the accuracy of pose, we calculate the Euclidean distance between the Euler angles (extracted through FSA-Net [32]) of and . Besides, the accuracy of the face expression is measured through the Euclidean distance between the 2D landmarks (2DFAN [2]). We use the default face verification method of DLIB [16] for the comparison of identities.

To be statistically significant, we compute the mean and variance of those measurements on the 100 frames (uniform sampling over time) of the first 500 videos in FaceForensics++, averaging them across the videos. Here, DeepFakes 

[4] and Nirkin et al. [21] are chosen as the baselines to compare. It should be noted that all the videos produced by DeepFaceLab were follow by the same settings with  5.2.

Method SSIM perceptual loss verification landmarks pose
DeepFakes 0.71 0.07 0.41 0.05 0.69 0.04 1.15 1.10 4.75 1.73
Nirkin et al. 0.65 0.08 0.50 0.08 0.66 0.05 0.35 0.18 6.01 3.21
DeepFaceLab 0.73 0.07 0.39 0.04 0.61 0.04 0.73 0.36 1.12 1.07
Table 1: Quantitative face swapping results on FaceForensics++ [25] face images.

From the indicators listed in Table 1, DeepFaceLab is more adept at retaining pose and expression than baselines. Besides, with the empowerment of super-resolution in Conversion, DFL often produces with vivide eyes and sharp teeth, but this phenomenon couldn’t be reflected clearly in the SSIM-like score for they only take small part of the whole face.

5.4 Ablation study

To compare the visual effects of different model choices, GAN settings and etc, we perform serveral ablation tests. The ablation study are conducted on top of three key parts: network structure, training paradigm and latent space constraint.

Aside from DF as well as LIAE, we enhance DF as well as LIAE to DFHD and LIAEHD

through adding more feature extraction layers, residual blocks compared to the original version, which serves to enriched the model structures for comparison. The qualitative results of different model structures can be seen in Figure 

9 and the qualitative results of different training paradigm are depicted in Figure 10.

Quantitative ablation results are reported in Table 2, the experiment settings of the training are almost the same with  5.2 except for the structure of model.

Verification results from Table 2 shows that source identities are preserved across networks with the same structure. With more shortcut connections has been add to the model (i.e. DF to DFHD, LIAE to LIAEHD), scores of landmarks and pose decrease without . Meanwhile the generated results could have a better chance to get rid of the influence of source face. In addition, we found that is effectively relieves the instability of , through which, a more photo-realistic result without much degradation then achieved. Besides, SSIM progressively increase with network with more shortcut connections, and also do good to it in varying degrees.

Figure 9: Ablation experiments of different model structures (with and ). (Here, we provide training previews instead of the converted faces, which aims to do a fair comparision in model architectures of DFL meanwhile avoid the impact of post-preprocessing from Conversion.)
Figure 10: Ablation experiments of different training paradigm: non GAN-based and GAN-based (The image on the left is the original face, a reconstruction image produced by model that trained without GAN listed to its right, far right is produced by model that trained with GAN). It can be seen clearly that GAN enforce the model become more sensible in capturing the sharp details, i.e, wrinkles and moles. Meanwhile significantly reduce the vagueness compared to the model without the empower of GAN.
Method SSIM verification landmarks pose
DF 0.73 0.07 0.61 0.04 0.73 0.36 1.12 1.07
DFHD 0.75 0.09 0.61 0.04 0.71 0.37 1.06 0.97
DFHD () 0.72 0.11 0.61 0.04 0.79 0.40 1.33 1.21
DFHD () 0.77 0.06 0.61 0.04 0.70 0.35 0.99 1.02
LIAE 0.76 0.06 0.58 0.03 0.66 0.32 0.91 0.86
LIAEHD 0.78 0.06 0.58 0.03 0.65 0.32 0.90 0.88
LIAEHD () 0.79 0.05 0.58 0.03 0.69 0.34 1.00 0.97
LIAEHD () 0.80 0.04 0.58 0.03 0.65 0.33 0.83 0.81
Table 2: Quantitative ablation results on FaceForensics++  [25] face images.

6 Conclusions

The rapid evolving DeepFaceLab has become a popular face swapping tool in the deep learning practitioner community through freeing people from laborious, complicated data processing, trivial detailed work in training and conversion part. While continuing to keep tight with the latest trends and advances in computer vision, in the future we plan to keep improving the speed and scalability of DeepFaceLab. Inspired by some distinguished researchers of this area: "Suppressing the publication of such methods would not stop their development, but rather make them only available to a limited number of experts and potentially blindside policy makers if it goes without any limits". As a leading, widely recognized and open-source face swapping tool, we found we are responsible to publish DeepFaceLab to the academia community formally.

References

  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016)

    Tensorflow: a system for large-scale machine learning

    .
    In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §2.
  • [2] A. Bulat and G. Tzimiropoulos (2017) How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision, pp. 1021–1030. Cited by: §3.1, §5.3.
  • [3] F. Chollet et al. (2015) Keras. Note: https://keras.io Cited by: §2.
  • [4] Deepfakes (2017) Deepfakes. Note: https://github.com/deepfakes/faceswap Cited by: §1, §5.2, §5.3.
  • [5] J. Deng, J. Guo, Y. Zhou, J. Yu, I. Kotsia, and S. Zafeiriou (2019) Retinaface: single-stage dense face localisation in the wild. arXiv preprint arXiv:1905.00641. Cited by: §3.1.
  • [6] Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou (2018) Joint 3d face reconstruction and dense alignment with position map regression network. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 534–551. Cited by: §3.1.
  • [7] Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou (2018) Joint 3d face reconstruction and dense alignment with position map regression network. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 534–551. Cited by: §3.2.
  • [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • [9] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §2.
  • [10] V. Iglovikov and A. Shvets (2018)

    Ternausnet: u-net with vgg11 encoder pre-trained on imagenet for image segmentation

    .
    arXiv preprint arXiv:1801.05746. Cited by: §3.1.
  • [11] J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694–711. Cited by: §5.3.
  • [12] A. Karnewar, O. Wang, and R. S. Iyengar (2019) MSG-gan: multi-scale gradient gan for stable image synthesis. CoRR, abs/1903.06048 6. Cited by: §3.2.
  • [13] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §2.
  • [14] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 4401–4410. Cited by: §1.
  • [15] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2019) Analyzing and improving the image quality of stylegan. arXiv preprint arXiv:1912.04958. Cited by: §1.
  • [16] D. E. King (2009) Dlib-ml: a machine learning toolkit. Journal of Machine Learning Research 10 (Jul), pp. 1755–1758. Cited by: §5.3.
  • [17] L. Li, J. Bao, H. Yang, D. Chen, and F. Wen (2019) FaceShifter: towards high fidelity and occlusion aware face swapping. arXiv preprint arXiv:1912.13457. Cited by: §1.
  • [18] A. Loza, L. Mihaylova, N. Canagarajah, and D. Bull (2006) Structural similarity-based object tracking in video sequences. In 2006 9th International Conference on Information Fusion, pp. 1–6. Cited by: §3.2.
  • [19] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley (2017) Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802. Cited by: §2.
  • [20] Y. Nirkin, Y. Keller, and T. Hassner (2019) Fsgan: subject agnostic face swapping and reenactment. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7184–7193. Cited by: §1.
  • [21] Y. Nirkin, I. Masi, A. T. Tran, T. Hassner, and G. Medioni (2018) On face segmentation, face swapping, and face perception. In IEEE Conference on Automatic Face and Gesture Recognition, Cited by: §1, §5.2, §5.3.
  • [22] P. Pérez, M. Gangnet, and A. Blake (2003) Poisson image editing. In ACM SIGGRAPH 2003 Papers, pp. 313–318. Cited by: §3.3.
  • [23] F. Pitié, A. C. Kokaram, and R. Dahyot (2007) Automated colour grading using colour distribution transfer. Computer Vision and Image Understanding 107 (1-2), pp. 123–137. Cited by: §3.3.
  • [24] E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley (2001) Color transfer between images. IEEE Computer graphics and applications 21 (5), pp. 34–41. Cited by: §3.3.
  • [25] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner (2019) Faceforensics++: learning to detect manipulated facial images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1–11. Cited by: §5.2, Table 1, Table 2.
  • [26] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman (2017) Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG) 36 (4), pp. 1–13. Cited by: §1, §1.
  • [27] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner (2016) Face2face: real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2387–2395. Cited by: §5.2.
  • [28] S. Umeyama (1991)

    Least-squares estimation of transformation parameters between two point patterns

    .
    IEEE Transactions on Pattern Analysis & Machine Intelligence, pp. 376–380. Cited by: §3.1, §3.1, §3.3.
  • [29] Vertex.ai (2017) PlaidML: open source deep learning for every platform. Note: https://www.intel.com/content/www/us/en/artificial-intelligence/plaidml.html Cited by: §2.
  • [30] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §3.2.
  • [31] Y. Xiangli, Y. Deng, B. Dai, C. C. Loy, and D. Lin (2020) Real or not real, that is the question. arXiv preprint arXiv:2002.05512. Cited by: §3.2.
  • [32] T. Yang, Y. Chen, Y. Lin, and Y. Chuang (2019)

    FSA-net: learning fine-grained structure aggregation for head pose estimation from a single image

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1087–1096. Cited by: §5.3.
  • [33] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. Cited by: §3.1.
  • [34] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li (2017) S3fd: single shot scale-invariant face detector. In Proceedings of the IEEE International Conference on Computer Vision, pp. 192–201. Cited by: §3.1.

7 Appendix: Dissecting the detailed structure of Df.

Figure 11: A detailed overview of DF in DeepFaceLab. Modules (Encoder, Inter and Decoder) of DF are completely same with LIAE, which means both InterAB and InterB of LIAE owns the same structure and settings.

The layout as well as every specified submodule of the DF are depicted in Fig 11. According to the result, it’s fairly easy to see the difference between the original DF and enhanced edition DFHD lies in that DFHD have more feature extraction layers and of varied stacking orders. Three typical traits of the structure are:

  • We use pixelshuffle (depth2space) to do upsampling instead of transposed convolution neither bilinear sampling followed by convolution, which aims to eliminate the artifact and checkboard effects.

  • Identity shortcut connection, which derived from Resnet, are frequently used in composing the module of Decoder. This is because model with more shortcut connections always have many independent effective paths at the same time, which makes the model with ensemble-like behaviour.

  • We normalize the images between 0 and 1 other than -1 to 1. Then Sigmoid as the last layer of the Decoder output rather than Tanh.