1 Introduction
Face swapping is a task to switch the person-identity of a given face image with another, preserving other attributes like facial expressions, head poses, and backgrounds. The task has been highlighted for its wide use of real-world applications, such as anonymization in privacy protection and the creation of new characters in the entertainment industry. With progress made over years [baoOpenSetIdentityPreserving2018b, natsumeRSGANFaceSwapping2018, nirkinFSGANSubjectAgnostic2019, liFaceShifterHighFidelity2019, chenSimSwapEfficientFramework2020, thiesDeferredNeuralRendering2019, zhuOneShotFace2021, wangHifiFace3DShape2021], state-of-the-art face-swapping models can generate a swapped image of decent quality using a single shot of a new source identity.
Despite the performance improvement, the existing models usually adopt complex model architectures and numerous loss functions to change the face shape. Changing face shape is crucial to make a faithful identity swapping, but it is a nontrivial task; it incurs a dramatic change of pixels, but no guidance can be given to the model due to the inherent absence of the ground-truth swapped images. Thus, previous studies have focused on using handcrafted components such as mask-based mixing [liFaceShifterHighFidelity2019, wangHifiFace3DShape2021] and 3D face-shape modeling [wangHifiFace3DShape2021]
. While such components are effective to change the face shape and improve the quality of the swapped images, the models get added complexity of hyperparameters and loss functions, which require careful tuning for successful training.
In this study, we postulate that the approaches based on handcrafted components are not the best way to resolve the difficulty of face-swapping. Instead, we propose a new identity embedding model having improved smoothness, which we assume to be related most to the gist of the problem. An identity embedding model, or an embedder, plays a key role during the training of the swapping model. It gives the gradient information for the generator, to which direction it has to tune to change the identity. If the embedder has a non-smooth space, the gradients can be erroneous or noisy. In our proposed model, Smooth-Swap, we consider a new identity embedder trained with supervised constrastive learning [khoslaSupervisedContrastiveLearning2020]. We find it has a smoother space than the ArcFace embedder [dengArcFaceAdditiveAngular2019], one used in the most of the existing models, and helps faster and stable training.
Through the smooth embedder, Smooth-Swap works without any handcrafted components. It adopts a simple U-Net [ronnebergerUNetConvolutionalNetworks2015]-based generator, and we train it using only three basic losses—identity change, target preserving, and adversarial (Fig. 2). While this setup is simpler than the existing models, we find that our model can still achieve comparable or superior performance by taking a data-driven approach and minimizing inductive bias. We can summarize the advantages of Smooth-Swap as follows:
-
Simple architecture: Smooth-Swap uses a simple U-Net [ronnebergerUNetConvolutionalNetworks2015]-based generator, which does not involve any handcrafted components as the existing models.
-
Simple loss functions: The Smooth-Swap generator can be trained using minimal loss functions for face-swapping—identity, pixel-level change, and adversarial loss.
-
Fast training: The smooth identity embedder allows faster training of the generator by providing more stable gradient information.

). Such designs induce complex architectures and various loss functions, which makes training difficult for balancing. On the contrary, our architecture is very simple without having task-related heuristics, and trained by only three typical losses.
2 Related Work
Approaches based on 3D Models and Segmentation
Earlier face-swapping models rely on external modules such as 3D Morphable Models (3DMM) [blanzMorphableModelSynthesis1999] and a face segmentation model. Face2Face [thiesFace2FaceRealtimeFace] and [nirkinFaceSegmentationFace2018] fits the source and the target images to 3DMM and transfers the expression (and the posture) parameters to synthesize the swapped image. RSGAN [natsumeRSGANFaceSwapping2018], FSNet [natsumeFSNetIdentityAwareGenerative2018], and FSGAN [nirkinFSGANSubjectAgnostic2019] use a segmentation model to separate the facial region from the background, generate the swapped image by switching and blending the regions. Despite the early success, these approaches do not produce high quality images since their performance depends on the non-trainable external modules.
Feature-based GAN models
In contrast with the approaches above, recent models consider end-to-end training, generating a face-swapped image based on the learned features. IPGAN [baoOpenSetIdentityPreserving2018b]
learns separate embedding vectors for the identity and the target attributes, switching and recombining them to generate a swapped image. FaceShifter
[liFaceShifterHighFidelity2019] considers multi-level mixing using an encoder-decoder architecture, which alleviates the information loss in the IPGAN. SimSwap [chenSimSwapEfficientFramework2020] proposes a weak feature matching to focus more on preserving the facial expression of the source, whereas HifiFace [wangHifiFace3DShape2021] proposes a method integrating 3D shape model to focus more on active shape change. MegaFS [zhuOneShotFace2021] utilizes a pretrained StyleGAN2 [karrasAnalyzingImprovingImage2020] to generate high-resolution face-swapped images. Although these models have continuously improved the performance of face-swapping, they tend to show weakness in identity change or involve complexity due to handcrafted components.3 Problem Formulation & Challenges
We first describe the problem formulation and main technical challenges of face-swapping. Then, we introduce how the smoothness of an identity embedder can alleviate it.
3.1 Problem Formulation
When a source and a target are given, a face-swapping model needs to generate the swap image, , which satisfies the following conditions:
-
[label=C0.]
-
It has the identity of the source image.
-
Other than the identity, it looks the same as the target image (having the same background, pose, etc.).
-
It looks realistic (indistinguishable from real images).
To meet these requirements, most of face-swapping models [liFaceShifterHighFidelity2019, wangHifiFace3DShape2021] consist of three components: an identity embedder for the source image, a generator for the swapped image, and a discriminator to improve the fidelity. Fig. 2 shows an overview of these face-swapping models including our approach. Note that the identity embedder is pre-trained and frozen during the training of other components, so the asterisk is included in the superscript.
3.2 Challenges for Changing Identity

The main difficulty for training a face-swapping model comes from the conflict between C1 and C2. Satisfying C1 makes move away from to change the identity, whereas satisfying C2 enforces it to stay around. If we can accurately extract the identity-irrelevant change of from and use it for the loss of C2, this conflict would have been relaxed. Unfortunately, designing such a loss function is difficult, thereby employing a perceptual [zhangUnreasonableEffectivenessDeep2018] or pixel-level loss, which prevents that a swapped image is not too much deviated from .
This causes that the face shape of is typically not the same as the one of , since shape-wise change such as round to sharp chin involves a geometric transformation, thus many changes on features and pixels happen. Previous work put much effort into changing a face shape correctly, because it is an important attribute to identify a person. In particular, a recently proposed HifiFace [wangHifiFace3DShape2021] model uses a complex 3D model to better capture the shape. Though using this 3D face model helps shape-wise change more accurately, additional complicated components to train a generator are introduced. In contrast, we hypothesize that the conflict can be relaxed not by adding new components but by introducing smoothness to an identity embedder. We will describe the details on this in the following section.
3.3 Importance of A Smooth Identity Embedder
Most of the previous face-swapping models use ArcFace [dengArcFaceAdditiveAngular2019]
as an identity embedder (embedder for short) since it is one of the state-of-the-art face recognition models. Feeding images into the embedder and comparing features from the last layer (called embedding vectors), it provides a decent similarity metric for the person-identities of face images. Using ArcFace or any other face recognition models, we typically deal with a highly non-smooth embedding space, because these are trained only by a discriminative task.
The smoothness of the embedder, however, is crucial during the training of a face-swapping model. When a model generates with a wrong identity amid training, the embedder has to give a good gradient direction to correct it. This gradient has to be accurate and consistent; otherwise easily goes back to by the loss for C2. If the embedding space is non-smooth, the gradient direction can be erroneous or noisy since gradients are only well-defined in a continuous space.
4 Method: Smooth-Swap
We explain our main model called Smooth-Swap. The model introduces a new identity embedder, trained using supervised contrastive learning [khoslaSupervisedContrastiveLearning2020] to improve the smoothness in the embedding space. It also introduces a simple U-Net style generator architecture, which is well suited to the new identity embedder.
Notations
Our identity embedder takes images and outputs the corresponding embedding vectors (e.g., ). The generator takes a target image and a source embedding , and produces the swap image: . takes and outputs a scalar ranging (close to for fake and for real).
4.1 Smooth Identity Embedder
As discussed in Sec. 3.3, we desire a smooth embedder for stable and effective training. To train such an embedder, we consult a supervised contrastive learning loss [khoslaSupervisedContrastiveLearning2020]:
where denotes a sample from the training dataset; and denote positive (images having the same identity as ) and negative (having a different identity) samples, respectively.
An important property of contrastive learning is that it makes the embedding vectors keep the maximal information [wangUnderstandingContrastiveRepresentation2020], and this is closely related to our need of a smooth embedder. If we have face images of the same identity but of a different age or of a different face shape (e.g., from a diet), discriminative embedders like ArcFace [dengArcFaceAdditiveAngular2019]
remove this information aggressively to align the embedding vectors. While this is beneficial for classifying identities, it incurs a non-smooth embedding space. When changing the identity from elderly to young or from a round shape to sharp in this space, the embedding vectors cannot change smoothly as such information is removed. For our purpose, more desired are the embeddings with richer information—even if the alignment is compromised—as can be obtained from the contrastive learning. Then, changing from one identity to another is a smooth path and a good gradient direction can be obtained for training the swapping model (see Fig.
3).4.2 Generator Architecture
Our generator architecture is an adaptation from the noise conditional score network (NCSN++), which is one of the state-of-the-art architectures in score-based generative modeling [songScoreBasedGenerativeModeling2020] (Fig. 2). While the original usage of NCSN++ is far different from face-swapping, we find its U-Net nature [ronnebergerUNetConvolutionalNetworks2015] and conditioning structure is useful for our task. We modify two parts from NCSN++; the time embedding is replaced with the identity embedding and a direct skip connection from the input to the output is added.
Details on Structure
NCSN++ is basically a U-Net [ronnebergerUNetConvolutionalNetworks2015] with a conditioning structure and modern layer designs such as residual and attention blocks. Its original goal is to take a noisy image and output a score vector having the same dimensionality as the image. Since it has to output a vector conditioned on varying noise levels controlled by time, it also takes a time embedding vector that is added to each residual block after being broadcasted over the width and height dimensions. In our design, we replace this embedding vector with identity embedding, as illustrated in Fig. 2. Also, since the score vector is close to a difference between images rather than an image itself, we add the input image when making the final output image, instead of directly passing the output (i.e., an input-to-output skip connection).
Note our architecture does not include any task-specific design components such as a 3D face model or mask-based mixing from the previous work. It is universal and mostly compatible with score modeling by design.
Loss Functions
To train this generator, we use three most basic loss functions, each corresponding to the conditions for described at the beginning of Sec. 3.
The total loss is computed by combining these functions and taking the expectation over (, ) pairs:
Note that
stands for cosine similarity and
stands for the number of dimensions of ; is trained with the original loss from [goodfellowGenerativeAdversarialNetworks2014] and R1 regularizer [meschederWhichTrainingMethods2018]. The loss functions are generally the same as [liFaceShifterHighFidelity2019], except we use a simpler pixel-level change loss instead of the feature-level loss (denoted as attribute loss in the paper). For each minibatch, we include one (, ) pair, whose change loss effectively acts as a reconstruction loss.5 Experiments
Model | VGG↓ | VGG-R↓ | Arc↑ | Arc-R↑ | Shp↓ | Shp-R↓ | Expr↓ | Expr-R↓ | Pose↓ | Pose-R↓ | PoseHN↓ |
---|---|---|---|---|---|---|---|---|---|---|---|
Deepfakes | 120.907 | 0.493 | 0.443 | 0.524 | 0.639 | 0.464 | 0.802 | 0.541 | 0.188 | 0.445 | 4.588 |
FaceShifter | 110.875 | 0.482 | 0.658 | 0.492 | 0.653 | 0.456 | 0.177 | 0.381 | 3.175 | ||
SimSwap | 99.736 | 0.435 | 0.662 | 0.479 | 0.644 | 0.449 | 0.178 | 0.385 | 3.749 | ||
HifiFace | 106.655 | 0.469 | 0.527 | 0.550 | 0.616 | 0.465 | 0.702 | 0.484 | 0.177 | 0.387 | 3.370 |
MegaFS | 110.897 | 0.461 | 0.701 | 0.500 | 0.678 | 0.436 | 0.182 | 0.398 | 5.456 | ||
Smooth-Swap | 101.678 | 0.435 | 0.464 | 0.611 | 0.565 | 0.403 | 0.722 | 0.477 | 0.186 | 0.395 | 4.498 |
107.096 | 0.446 | 0.421 | 0.581 | 0.610 | 0.415 | 0.669 | 0.461 | 0.185 | 0.398 | 4.636 | |
(Arc) | 103.767 | 0.437 | 0.682 | 0.460 | 0.728 | 0.493 | 0.192 | 0.416 | 5.457 | ||
(Arc) | 98.115 | 0.421 | 0.684 | 0.441 | 0.914 | 0.543 | 0.207 | 0.430 | 5.655 |
5.1 Training Details
Datasets
For training the generator, we use FFHQ dataset [karrasStyleBasedGeneratorArchitecture2018], which contains 70k aligned face images. We use the 10% of images for testing. For training the identity embedder, we use the VGGFace2 dataset [caoVGGFace2DatasetRecognising2018], which contains 3.3M identity-labeled images of 9k subjects. We crop and align VGGFace2 images using the same procedure as FFHQ. All images including FFHQ are resized to 256256 scale.
Architecture Details
Our identity embedder is based on ResNet50 [heDeepResidualLearning2016] architecture. The final, average-pooled feature vector is passed through two fully-connected layers and normalized to unit length. The generator architecture is mostly the same as NCSN++ [songScoreBasedGenerativeModeling2020], except we use half as many channels. The discriminator is set to the same as StyleGAN2 [karrasAnalyzingImprovingImage2020]. The detailed structure of the networks is included in the appendix.
Training
We set , , and for training. The loss function for the discriminator is the non-saturating loss [goodfellowGenerativeAdversarialNetworks2014] along with the R1 regularizer [meschederWhichTrainingMethods2018] to prevent the overfitting. For training the generator, we use Adam optimizer [kingmaAdamMethodStochastic2017] with learning rates 0.001 (generator), 0.004 (discriminator). The generator is trained for 800k steps. The batch size is eight, where one pair in the batch is (, ) to consider the self-reconstruction case (see Sec. 4.2). We also use Adam for training the embedder, where the learning rate is 0.001 and decreases by a factor of 10 at 60, 75, and 90% during the total 101K steps. The batch size is 128 (32 identities, four instances per each), and the temperature is 0.07 as suggested in [khoslaSupervisedContrastiveLearning2020].
5.2 Evaluation Details
Compared Models
We compare our Smooth-Swap models with the latest feature-based face-swapping models: FaceShifter [liFaceShifterHighFidelity2019], MegaFS [zhuOneShotFace2021], HifiFace [wangHifiFace3DShape2021], SimSwap [chenSimSwapEfficientFramework2020], and Neural Textures [thiesDeferredNeuralRendering2019]. We also compare two of the earliest models: Deepfakes [deepfakesDeepFakesHttpsGithub2021] and Faceswap [marekFaceSwapHttpsGithub2021].
Quantitative Evaluations
Since the most of the compared models do not open their source code to the public, the current standard for evaluating the models is to compare their generated images111Available on https://github.com/ondyari/FaceForensics; some are on the project page of each model. on the FaceForensics++ (FF++) datasets [rosslerFaceForensicsLearningDetect2019], and we follow accordingly.
We evaluate various metrics that can be grouped into the following: identity, shape, expression, and pose. We want to be close to for the first two, whereas we want it to be close to for the other two. To evaluate identity, we use VGGFace2 [caoVGGFace2DatasetRecognising2018] and ArcFace [dengArcFaceAdditiveAngular2019] embedders and compute the embedding distance and cosine similarity between and . Compared with the retrieval accuracy used in [wangHifiFace3DShape2021, liFaceShifterHighFidelity2019, chenSimSwapEfficientFramework2020], which classifies among fixed candidates, this metric allows more fine-grained comparison. To evaluate shape, expression, and pose, we follow the evaluation protocol of [wangHifiFace3DShape2021]; i.e., we use a 3D face model of [sanyalLearningRegress3D2019] to get the parameters of each class and compute the L2 distances.
When applicable, we compute relative distances and similarities (denoted by ’-R’) as well. For example,
is computed for VGGFace2 embedding distance222For pose and expressions, numerator is changed to . This is to reflect how humans perceive the changes; to our eyes, important is not only the identity of being close to but also its being far from .

5.3 Basic Face-Swapping Performance
We first conducted face-swapping on the FaceForensics++ dataset and compared with the results of the other models as shown in Fig. 4. The figure shows that our model is more aggressive in changing identity, especially in face shape. For example, in the second and the fourth row, our swapped images show more round and grown chin shapes reflecting the characteristics of the source identity (more extreme cases can be found in Fig. 5); the images from the other models are mostly confined to textural change. Also, we can observe other identity-related attributes, such as skin tones or hair colors, are matched more to the source in our results, making the overall figure visually more close to the source. Fig. 5 and 6 show the results of Smooth-Swap on FFHQ and face images in the wild. More samples could be found in appendix.
The same trend can be seen from the quantitative results summarized in Table 1. In the table, Smooth-Swap shows good identity scores (VGG, Arc, and Shp). While our model is not as good in the other scores, it shows comparable numbers (not the worst among the others at all times). We note that the model could recover the expression scores to some degree if is reduced to one; however, our focus here is more on the identity change.


5.4 Ablation Study on the Identity Embedder
To see how our identity embedder makes a difference, we train our models using ArcFace [dengArcFaceAdditiveAngular2019] as well. As seen from the lower part of Table 1, the models using ArcFace perform worse in most of the metrics.
More importantly, we observe that our embedder enables faster and stable training. In Fig. 7, the left graph shows that the identity loss of our model converges faster compared with the one using ArcFace. Note this is not due to the scales or the choice of , since Arc16, which has a similar rate of identity-loss drop, shows a significantly worse curve for the change loss.
The same phenomenon can be seen in Fig. 8. When paired with ArcFace embedder, the models show slow training, rarely changing identity until they reach 400k training steps. On the contrary, the models with our embedder begins to change the identity as early as 100k steps.



Inspection of the smoothness of embedders via interpolation. For two randomly picked images from the FFHQ test split (the leftmost and the rightmost), we compute the interpolations in the embedding space. For each of the nine interpolating points, we retrieve the closest images from the train split. Our embedder tends to show continuously changing identities, whereas others show repeating identities, implying non-smoothness of the space. The graph on the right shows our embedder distributes the identities more uniformly. The distances are normalized by the average of 4k random pairs for each embedder. See Sec.
5.5.5.5 Identity Embedding Performance
The advantage expected from our identity embedder is the smoothness; in particular, smooth change of identities along the interpolation curve as shown in Fig. 3. To quantitatively evaluate this, we devised a smoothness score and compared with other baseline embedders.
The score measures the (normalized) gap between the average point of the two identity embedding vectors, , and the closest valid embedding to it, (here, is an averaging ratio). If the embedding space is smooth, this gap has to be small.
The notion of valid embedding is subject to the settings. When measured using samples, and are samples from the FFHQ dataset , and where . When measured using GAN, and are samples generated from a pretrained StyleGAN2 [karrasAnalyzingImprovingImage2020], where ( is the generator, and ’s are the latent codes; see appendix for details).
As seen from Table 2, our model shows substantially better smoothness while maintaining comparable verification performance with ArcFace and VGGFace2. Note LFW [LFWTech] is one of the standard benchmark dataset for verification; VCHQ is a dataset we derived from VoxCeleb [nagraniVoxCelebLargescaleSpeaker2018] (see appendix for the details).
The same trend is also qualitatively confirmed in Fig. 9. The figure shows the retrieved images for each of the interpolating points (). Our embedder tends to change smoothly while moving along the interpolation curve; others tend to stick with the same identities repeatedly. To quantify this, we compute the number of unique images for each interpolation (the lower, the more repetition, and the worse). Summarizing the results from 64 sample pairs, the numbers were 5.131.18 (Ours), 4.421.41 (VGGFace2), and 4.251.33 (ArcFace).
w/smp | w/GAN | Verification AUC | ||||
---|---|---|---|---|---|---|
r=0.25 | r=0.5 | r=0.5 | VCHQ | VGG2 | LFW | |
CE-Lin | 0.333 | 0.354 | 0.797 | 0.939 | 0.994 | 1.000 |
CE-Arc | 0.404 | 0.430 | 0.914 | 0.925 | 0.997 | 0.998 |
ArcFace | 0.360 | 0.380 | 0.802 | - | - | 0.995 |
Ours | 0.116 | 0.135 | 0.671 | 0.956 | 0.994 | 0.999 |
6 Conclusion
We have proposed a new face-swapping model, Smooth-Swap, to generate high-quality swap images with active change of face shapes. While many existing models use handcrafted components to tackle the difficulty, our model stays with the simplest generator architecture and instead considers a smooth identity embedding. By taking this data-driven approach with minimal inductive bias, we observed that Smooth-Swap can achieve superior scores in identity-related metrics. Also, by virtue of the smooth embedding model, our model can get stable gradient information, which allowed much faster training.
We believe this study can open up opportunity for challenging more difficult face-swapping problems by reducing the complexity considerably. With reduced effort for balancing the components and reduced memory usage of the model parameters, one could consider an expanded problem scope, such as modeling face-swapping on videos in an end-to-end manner. A downside of our current model in that regard is some performance drop in preserving the pose and expression. However, we suppose a simple fine-tuning or different hyperparameter choice would be sufficient to meet the goal.
Potential Negative Societal Impact
Face-swapping models, more commonly known as Deepfake technology to the public, has been maliciously used in making serious negative impacts, including the spread of fake news. Nonetheless, we believe that studying on a face-swapping task is important and necessary because a deep understanding of this task could set a good starting point for developing high-quality DeepFake detection algorithms [rosslerFaceForensicsLearningDetect2019]. In addition, we remark that face-swapping conversely has many positive applications, including anonymization for privacy protection and creating new characters without a CGI techniques in the entertainment industry.
References
Appendix A Architecture Details
a.1 Identity Embedding Model:
Our identity embedding model is based on ResNet-50 [heDeepResidualLearning2015], where we use a different head for contrastive learning as seen in Fig. A1. Note the UnitNorm in the final layer makes to be unit-length (). The network invovles total 32.3M parameters.
a.2 Swap-Image Generator:
Our generator architecture is mostly the same as NCSN++ [songScoreBasedGenerativeModeling2020] except for the following three differences (as described in the main manuscript, Sec. 4.2): 1) we use half as many channels, 2) we use the identity embedding instead of the time embedding, and 3) we add an input-to-output skip connection. Fig. A1 (a) shows the detailed structure with dimensional information. The network involves total 9.8M parameters.
Up/Down Sampling & Skip-Connections
Note in each of the outer block containing multiple ResBlocks, the first ResBlock handles upsampling or downsampling (except for the ResBlock x5, where the second ResBlock handles upsampling). There are 13 skip connections in total (; is the input-to-output skip), where the input to each of the ResBlock in the encoder part (before the Attention Block) is handed over to the decoder part (after the Attention Block). On the decoder side, the first three ResBlocks of each outer block get the skip-connections (except for the ResBlock x5, where the second through the fourth get the skip-connections).
Details on the ResBlocks of the Generator
We describe essential details on the ResBlocks of the generator. One can find the complete information in [songScoreBasedGenerativeModeling2020] or from the attached source code.
The overall structure of our ResBlock is not much different from the conventional designs [heDeepResidualLearning2015]. However, as shown in Fig. A1 (b), a structure for conditioning on the identity embedding vector is added (similar to [brockLargeScaleGAN2019]). The conditioning is done by 1) projecting onto a -dimensional vector, 2) spatially broadcasting the result, and 3) adding it to the intermediate output of the original path ( is the number of output channels of the current block). When upsampling or downsampling is used, the optional components (denoted by yellow and dash-dotted outline) are also computed.
a.3 Discrminiator:
We use the same discriminator as the one used in StyleGAN2 [karrasAnalyzingImprovingImage2020]. The network involves total 28.9M parameters.

Appendix B More Image Samples from Smooth-Swap
We show extended sets of swapped-image samples from our Smooth-Swap model. The following three figures, Fig. A2, A3, and A4 present the results of the same experiments as Fig. 4, 5, and 6 in the main manuscript, but with different source and target pairs.



Comments
There are no comments yet.