Smooth-Swap: A Simple Enhancement for Face-Swapping with Smoothness

by   Jiseob Kim, et al.
Seoul National University

In recent years, face-swapping models have progressed in generation quality and drawn attention for their applications in privacy protection and entertainment. However, their complex architectures and loss functions often require careful tuning for successful training. In this paper, we propose a new face-swapping model called `Smooth-Swap', which focuses on deriving the smoothness of the identity embedding instead of employing complex handcrafted designs. We postulate that the gist of the difficulty in face-swapping is unstable gradients and it can be resolved by a smooth identity embedder. Smooth-swap adopts an embedder trained using supervised contrastive learning, where we find its improved smoothness allows faster and stable training even with a simple U-Net-based generator and three basic loss functions. Extensive experiments on face-swapping benchmarks (FFHQ, FaceForensics++) and face images in the wild show that our model is also quantitatively and qualitatively comparable or even superior to existing methods in terms of identity change.



There are no comments yet.


page 1

page 3

page 6

page 7

page 8

page 13

page 14

page 15


Face Anonymization by Manipulating Decoupled Identity Representation

Privacy protection on human biological information has drawn increasing ...

Towards Privacy Protection by Generating Adversarial Identity Masks

As billions of personal data such as photos are shared through social me...

ShapeEditer: a StyleGAN Encoder for Face Swapping

In this paper, we propose a novel encoder, called ShapeEditor, for high-...

Learning Complete 3D Morphable Face Models from Images and Videos

Most 3D face reconstruction methods rely on 3D morphable models, which d...

Face Attribute Invertion

Manipulating human facial images between two domains is an important and...

OPOM: Customized Invisible Cloak towards Face Privacy Protection

While convenient in daily life, face recognition technologies also raise...

Learning Representations from Temporally Smooth Data

Events in the real world are correlated across nearby points in time, an...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Face swapping is a task to switch the person-identity of a given face image with another, preserving other attributes like facial expressions, head poses, and backgrounds. The task has been highlighted for its wide use of real-world applications, such as anonymization in privacy protection and the creation of new characters in the entertainment industry. With progress made over years [baoOpenSetIdentityPreserving2018b, natsumeRSGANFaceSwapping2018, nirkinFSGANSubjectAgnostic2019, liFaceShifterHighFidelity2019, chenSimSwapEfficientFramework2020, thiesDeferredNeuralRendering2019, zhuOneShotFace2021, wangHifiFace3DShape2021], state-of-the-art face-swapping models can generate a swapped image of decent quality using a single shot of a new source identity.

Despite the performance improvement, the existing models usually adopt complex model architectures and numerous loss functions to change the face shape. Changing face shape is crucial to make a faithful identity swapping, but it is a nontrivial task; it incurs a dramatic change of pixels, but no guidance can be given to the model due to the inherent absence of the ground-truth swapped images. Thus, previous studies have focused on using handcrafted components such as mask-based mixing [liFaceShifterHighFidelity2019, wangHifiFace3DShape2021] and 3D face-shape modeling [wangHifiFace3DShape2021]

. While such components are effective to change the face shape and improve the quality of the swapped images, the models get added complexity of hyperparameters and loss functions, which require careful tuning for successful training.

In this study, we postulate that the approaches based on handcrafted components are not the best way to resolve the difficulty of face-swapping. Instead, we propose a new identity embedding model having improved smoothness, which we assume to be related most to the gist of the problem. An identity embedding model, or an embedder, plays a key role during the training of the swapping model. It gives the gradient information for the generator, to which direction it has to tune to change the identity. If the embedder has a non-smooth space, the gradients can be erroneous or noisy. In our proposed model, Smooth-Swap, we consider a new identity embedder trained with supervised constrastive learning [khoslaSupervisedContrastiveLearning2020]. We find it has a smoother space than the ArcFace embedder [dengArcFaceAdditiveAngular2019], one used in the most of the existing models, and helps faster and stable training.

Through the smooth embedder, Smooth-Swap works without any handcrafted components. It adopts a simple U-Net [ronnebergerUNetConvolutionalNetworks2015]-based generator, and we train it using only three basic losses—identity change, target preserving, and adversarial (Fig. 2). While this setup is simpler than the existing models, we find that our model can still achieve comparable or superior performance by taking a data-driven approach and minimizing inductive bias. We can summarize the advantages of Smooth-Swap as follows:

  • Simple architecture: Smooth-Swap uses a simple U-Net [ronnebergerUNetConvolutionalNetworks2015]-based generator, which does not involve any handcrafted components as the existing models.

  • Simple loss functions: The Smooth-Swap generator can be trained using minimal loss functions for face-swapping—identity, pixel-level change, and adversarial loss.

  • Fast training: The smooth identity embedder allows faster training of the generator by providing more stable gradient information.

Figure 2: An illustrative comparison of the generator architectures and the loss functions of face-swapping models. Previous models (FaceShifter [liFaceShifterHighFidelity2019] and HifiFace [wangHifiFace3DShape2021]) have face-swapping-specific designs such as mask-based mixing (hatched in purple) or 3D face modeling (

). Such designs induce complex architectures and various loss functions, which makes training difficult for balancing. On the contrary, our architecture is very simple without having task-related heuristics, and trained by only three typical losses.

2 Related Work

Approaches based on 3D Models and Segmentation

Earlier face-swapping models rely on external modules such as 3D Morphable Models (3DMM) [blanzMorphableModelSynthesis1999] and a face segmentation model. Face2Face [thiesFace2FaceRealtimeFace] and [nirkinFaceSegmentationFace2018] fits the source and the target images to 3DMM and transfers the expression (and the posture) parameters to synthesize the swapped image. RSGAN [natsumeRSGANFaceSwapping2018], FSNet [natsumeFSNetIdentityAwareGenerative2018], and FSGAN [nirkinFSGANSubjectAgnostic2019] use a segmentation model to separate the facial region from the background, generate the swapped image by switching and blending the regions. Despite the early success, these approaches do not produce high quality images since their performance depends on the non-trainable external modules.

Feature-based GAN models

In contrast with the approaches above, recent models consider end-to-end training, generating a face-swapped image based on the learned features. IPGAN [baoOpenSetIdentityPreserving2018b]

learns separate embedding vectors for the identity and the target attributes, switching and recombining them to generate a swapped image. FaceShifter

[liFaceShifterHighFidelity2019] considers multi-level mixing using an encoder-decoder architecture, which alleviates the information loss in the IPGAN. SimSwap [chenSimSwapEfficientFramework2020] proposes a weak feature matching to focus more on preserving the facial expression of the source, whereas HifiFace [wangHifiFace3DShape2021] proposes a method integrating 3D shape model to focus more on active shape change. MegaFS [zhuOneShotFace2021] utilizes a pretrained StyleGAN2 [karrasAnalyzingImprovingImage2020] to generate high-resolution face-swapped images. Although these models have continuously improved the performance of face-swapping, they tend to show weakness in identity change or involve complexity due to handcrafted components.

3 Problem Formulation & Challenges

We first describe the problem formulation and main technical challenges of face-swapping. Then, we introduce how the smoothness of an identity embedder can alleviate it.

3.1 Problem Formulation

When a source and a target are given, a face-swapping model needs to generate the swap image, , which satisfies the following conditions:

  1. [label=C0.]

  2. It has the identity of the source image.

  3. Other than the identity, it looks the same as the target image (having the same background, pose, etc.).

  4. It looks realistic (indistinguishable from real images).

To meet these requirements, most of face-swapping models [liFaceShifterHighFidelity2019, wangHifiFace3DShape2021] consist of three components: an identity embedder for the source image, a generator for the swapped image, and a discriminator to improve the fidelity. Fig. 2 shows an overview of these face-swapping models including our approach. Note that the identity embedder is pre-trained and frozen during the training of other components, so the asterisk is included in the superscript.

3.2 Challenges for Changing Identity

Figure 3: When identity is changed from one to another, the corresponding vector in a smooth embedding space would also change smoothly. In a non-smooth embedding space, however, the vector would make discrete jumps. The space can become non-smooth if the embedder is strongly trained on a discriminative task. In this case, the embedder cannot give a good gradient direction for the generator to change the identity correctly. See 3.3.

The main difficulty for training a face-swapping model comes from the conflict between C1 and C2. Satisfying C1 makes move away from to change the identity, whereas satisfying C2 enforces it to stay around. If we can accurately extract the identity-irrelevant change of from and use it for the loss of C2, this conflict would have been relaxed. Unfortunately, designing such a loss function is difficult, thereby employing a perceptual [zhangUnreasonableEffectivenessDeep2018] or pixel-level loss, which prevents that a swapped image is not too much deviated from .

This causes that the face shape of is typically not the same as the one of , since shape-wise change such as round to sharp chin involves a geometric transformation, thus many changes on features and pixels happen. Previous work put much effort into changing a face shape correctly, because it is an important attribute to identify a person. In particular, a recently proposed HifiFace [wangHifiFace3DShape2021] model uses a complex 3D model to better capture the shape. Though using this 3D face model helps shape-wise change more accurately, additional complicated components to train a generator are introduced. In contrast, we hypothesize that the conflict can be relaxed not by adding new components but by introducing smoothness to an identity embedder. We will describe the details on this in the following section.

3.3 Importance of A Smooth Identity Embedder

Most of the previous face-swapping models use ArcFace [dengArcFaceAdditiveAngular2019]

as an identity embedder (embedder for short) since it is one of the state-of-the-art face recognition models. Feeding images into the embedder and comparing features from the last layer (called embedding vectors), it provides a decent similarity metric for the person-identities of face images. Using ArcFace or any other face recognition models, we typically deal with a highly non-smooth embedding space, because these are trained only by a discriminative task.

The smoothness of the embedder, however, is crucial during the training of a face-swapping model. When a model generates with a wrong identity amid training, the embedder has to give a good gradient direction to correct it. This gradient has to be accurate and consistent; otherwise easily goes back to by the loss for C2. If the embedding space is non-smooth, the gradient direction can be erroneous or noisy since gradients are only well-defined in a continuous space.

4 Method: Smooth-Swap

We explain our main model called Smooth-Swap. The model introduces a new identity embedder, trained using supervised contrastive learning [khoslaSupervisedContrastiveLearning2020] to improve the smoothness in the embedding space. It also introduces a simple U-Net style generator architecture, which is well suited to the new identity embedder.


Our identity embedder takes images and outputs the corresponding embedding vectors (e.g., ). The generator takes a target image and a source embedding , and produces the swap image: . takes and outputs a scalar ranging (close to for fake and for real).

4.1 Smooth Identity Embedder

As discussed in Sec. 3.3, we desire a smooth embedder for stable and effective training. To train such an embedder, we consult a supervised contrastive learning loss [khoslaSupervisedContrastiveLearning2020]:

where denotes a sample from the training dataset; and denote positive (images having the same identity as ) and negative (having a different identity) samples, respectively.

An important property of contrastive learning is that it makes the embedding vectors keep the maximal information [wangUnderstandingContrastiveRepresentation2020], and this is closely related to our need of a smooth embedder. If we have face images of the same identity but of a different age or of a different face shape (e.g., from a diet), discriminative embedders like ArcFace [dengArcFaceAdditiveAngular2019]

remove this information aggressively to align the embedding vectors. While this is beneficial for classifying identities, it incurs a non-smooth embedding space. When changing the identity from elderly to young or from a round shape to sharp in this space, the embedding vectors cannot change smoothly as such information is removed. For our purpose, more desired are the embeddings with richer information—even if the alignment is compromised—as can be obtained from the contrastive learning. Then, changing from one identity to another is a smooth path and a good gradient direction can be obtained for training the swapping model (see Fig.


4.2 Generator Architecture

Our generator architecture is an adaptation from the noise conditional score network (NCSN++), which is one of the state-of-the-art architectures in score-based generative modeling [songScoreBasedGenerativeModeling2020] (Fig. 2). While the original usage of NCSN++ is far different from face-swapping, we find its U-Net nature [ronnebergerUNetConvolutionalNetworks2015] and conditioning structure is useful for our task. We modify two parts from NCSN++; the time embedding is replaced with the identity embedding and a direct skip connection from the input to the output is added.

Details on Structure

NCSN++ is basically a U-Net [ronnebergerUNetConvolutionalNetworks2015] with a conditioning structure and modern layer designs such as residual and attention blocks. Its original goal is to take a noisy image and output a score vector having the same dimensionality as the image. Since it has to output a vector conditioned on varying noise levels controlled by time, it also takes a time embedding vector that is added to each residual block after being broadcasted over the width and height dimensions. In our design, we replace this embedding vector with identity embedding, as illustrated in Fig. 2. Also, since the score vector is close to a difference between images rather than an image itself, we add the input image when making the final output image, instead of directly passing the output (i.e., an input-to-output skip connection).

Note our architecture does not include any task-specific design components such as a 3D face model or mask-based mixing from the previous work. It is universal and mostly compatible with score modeling by design.

Loss Functions

To train this generator, we use three most basic loss functions, each corresponding to the conditions for described at the beginning of Sec. 3.

The total loss is computed by combining these functions and taking the expectation over (, ) pairs:

Note that

stands for cosine similarity and

stands for the number of dimensions of ; is trained with the original loss from [goodfellowGenerativeAdversarialNetworks2014] and R1 regularizer [meschederWhichTrainingMethods2018]. The loss functions are generally the same as [liFaceShifterHighFidelity2019], except we use a simpler pixel-level change loss instead of the feature-level loss (denoted as attribute loss in the paper). For each minibatch, we include one (, ) pair, whose change loss effectively acts as a reconstruction loss.

5 Experiments

Model VGG↓ VGG-R↓ Arc↑ Arc-R↑ Shp↓ Shp-R↓ Expr↓ Expr-R↓ Pose↓ Pose-R↓ PoseHN↓
Deepfakes 120.907 0.493 0.443 0.524 0.639 0.464 0.802 0.541 0.188 0.445 4.588
FaceShifter 110.875 0.482 0.658 0.492 0.653 0.456 0.177 0.381 3.175
SimSwap 99.736 0.435 0.662 0.479 0.644 0.449 0.178 0.385 3.749
HifiFace 106.655 0.469 0.527 0.550 0.616 0.465 0.702 0.484 0.177 0.387 3.370
MegaFS 110.897 0.461 0.701 0.500 0.678 0.436 0.182 0.398 5.456
Smooth-Swap 101.678 0.435 0.464 0.611 0.565 0.403 0.722 0.477 0.186 0.395 4.498
107.096 0.446 0.421 0.581 0.610 0.415 0.669 0.461 0.185 0.398 4.636
(Arc) 103.767 0.437 0.682 0.460 0.728 0.493 0.192 0.416 5.457
(Arc) 98.115 0.421 0.684 0.441 0.914 0.543 0.207 0.430 5.655
Shp: shape, Expr: expression, PoseHN: pose metric with Hopenet [sanyalLearningRegress3D2019], (Arc): trained using ArcFace, : scores cannot be compared because the model uses ArcFace in training.
Table 1: Quantitative comparison between the models. See Sec. 5.2 for the details on each metric and Sec. 5.3 for the discussion. Note the arrow ↓ (or ↑) denotes that the score is the lower (or the higher) the better; the best two scores are marked as bold. The vertical line divides the metrics into two groups: ones related to the identity change (left) and ones related to keeping the target attributes (right). The lower part of the table reports the scores of the ablation models (see Sec. 5.4).

5.1 Training Details


For training the generator, we use FFHQ dataset [karrasStyleBasedGeneratorArchitecture2018], which contains 70k aligned face images. We use the 10% of images for testing. For training the identity embedder, we use the VGGFace2 dataset [caoVGGFace2DatasetRecognising2018], which contains 3.3M identity-labeled images of 9k subjects. We crop and align VGGFace2 images using the same procedure as FFHQ. All images including FFHQ are resized to 256256 scale.

Architecture Details

Our identity embedder is based on ResNet50 [heDeepResidualLearning2016] architecture. The final, average-pooled feature vector is passed through two fully-connected layers and normalized to unit length. The generator architecture is mostly the same as NCSN++ [songScoreBasedGenerativeModeling2020], except we use half as many channels. The discriminator is set to the same as StyleGAN2 [karrasAnalyzingImprovingImage2020]. The detailed structure of the networks is included in the appendix.


We set , , and for training. The loss function for the discriminator is the non-saturating loss [goodfellowGenerativeAdversarialNetworks2014] along with the R1 regularizer [meschederWhichTrainingMethods2018] to prevent the overfitting. For training the generator, we use Adam optimizer [kingmaAdamMethodStochastic2017] with learning rates 0.001 (generator), 0.004 (discriminator). The generator is trained for 800k steps. The batch size is eight, where one pair in the batch is (, ) to consider the self-reconstruction case (see Sec. 4.2). We also use Adam for training the embedder, where the learning rate is 0.001 and decreases by a factor of 10 at 60, 75, and 90% during the total 101K steps. The batch size is 128 (32 identities, four instances per each), and the temperature is 0.07 as suggested in [khoslaSupervisedContrastiveLearning2020].

5.2 Evaluation Details

Compared Models

We compare our Smooth-Swap models with the latest feature-based face-swapping models: FaceShifter [liFaceShifterHighFidelity2019], MegaFS [zhuOneShotFace2021], HifiFace [wangHifiFace3DShape2021], SimSwap [chenSimSwapEfficientFramework2020], and Neural Textures [thiesDeferredNeuralRendering2019]. We also compare two of the earliest models: Deepfakes [deepfakesDeepFakesHttpsGithub2021] and Faceswap [marekFaceSwapHttpsGithub2021].

Quantitative Evaluations

Since the most of the compared models do not open their source code to the public, the current standard for evaluating the models is to compare their generated images111Available on; some are on the project page of each model. on the FaceForensics++ (FF++) datasets [rosslerFaceForensicsLearningDetect2019], and we follow accordingly.

We evaluate various metrics that can be grouped into the following: identity, shape, expression, and pose. We want to be close to for the first two, whereas we want it to be close to for the other two. To evaluate identity, we use VGGFace2 [caoVGGFace2DatasetRecognising2018] and ArcFace [dengArcFaceAdditiveAngular2019] embedders and compute the embedding distance and cosine similarity between and . Compared with the retrieval accuracy used in [wangHifiFace3DShape2021, liFaceShifterHighFidelity2019, chenSimSwapEfficientFramework2020], which classifies among fixed candidates, this metric allows more fine-grained comparison. To evaluate shape, expression, and pose, we follow the evaluation protocol of [wangHifiFace3DShape2021]; i.e., we use a 3D face model of [sanyalLearningRegress3D2019] to get the parameters of each class and compute the L2 distances.

When applicable, we compute relative distances and similarities (denoted by ’-R’) as well. For example,

is computed for VGGFace2 embedding distance222For pose and expressions, numerator is changed to . This is to reflect how humans perceive the changes; to our eyes, important is not only the identity of being close to but also its being far from .

Figure 4: Comparison of the face-swapping results of various models on the FaceForensics++ dataset [rosslerFaceForensicsLearningDetect2019]. The results from our models show the most active identity and shape change, reflecting the characteristics of the source identities. Note there are minor frame differences among the results as the images are extracted from videos.

5.3 Basic Face-Swapping Performance

We first conducted face-swapping on the FaceForensics++ dataset and compared with the results of the other models as shown in Fig. 4. The figure shows that our model is more aggressive in changing identity, especially in face shape. For example, in the second and the fourth row, our swapped images show more round and grown chin shapes reflecting the characteristics of the source identity (more extreme cases can be found in Fig. 5); the images from the other models are mostly confined to textural change. Also, we can observe other identity-related attributes, such as skin tones or hair colors, are matched more to the source in our results, making the overall figure visually more close to the source. Fig. 5 and 6 show the results of Smooth-Swap on FFHQ and face images in the wild. More samples could be found in appendix.

The same trend can be seen from the quantitative results summarized in Table 1. In the table, Smooth-Swap shows good identity scores (VGG, Arc, and Shp). While our model is not as good in the other scores, it shows comparable numbers (not the worst among the others at all times). We note that the model could recover the expression scores to some degree if is reduced to one; however, our focus here is more on the identity change.

Figure 5: The results of Smooth-Swap on the FFHQ test split (uncurated). An active change of identity (e.g., row-1, column-2) is observed, but some artifacts can be also found when the source identity has a complicated hair pattern (column-1).
Figure 6: Face swapping results of Smooth-Swap on wild images. More samples are included in appendix.

5.4 Ablation Study on the Identity Embedder

To see how our identity embedder makes a difference, we train our models using ArcFace [dengArcFaceAdditiveAngular2019] as well. As seen from the lower part of Table 1, the models using ArcFace perform worse in most of the metrics.

More importantly, we observe that our embedder enables faster and stable training. In Fig. 7, the left graph shows that the identity loss of our model converges faster compared with the one using ArcFace. Note this is not due to the scales or the choice of , since Arc16, which has a similar rate of identity-loss drop, shows a significantly worse curve for the change loss.

The same phenomenon can be seen in Fig. 8. When paired with ArcFace embedder, the models show slow training, rarely changing identity until they reach 400k training steps. On the contrary, the models with our embedder begins to change the identity as early as 100k steps.

Figure 7: Ablation study of identity embedding model—Ours (solid) versus ArcFace (dashed) [dengArcFaceAdditiveAngular2019]. The number next to the model name indicates the identity-loss weight, , used for training. It can be seen that the model learns to change identity much faster with our embedder while being stable in the change loss. See Sec. 5.4 for the discussion.
Figure 8: The progression of model training with different identity embedding models and loss weighting (); the generator architecture is fixed to ours. The models with ArcFace embedder [dengArcFaceAdditiveAngular2019] shows slow training, making little identity change until being trained for 400k steps. On the other hand, the models with our embedder show identity change at as early as 100k steps. See Sec. 5.4.
Figure 9:

Inspection of the smoothness of embedders via interpolation. For two randomly picked images from the FFHQ test split (the leftmost and the rightmost), we compute the interpolations in the embedding space. For each of the nine interpolating points, we retrieve the closest images from the train split. Our embedder tends to show continuously changing identities, whereas others show repeating identities, implying non-smoothness of the space. The graph on the right shows our embedder distributes the identities more uniformly. The distances are normalized by the average of 4k random pairs for each embedder. See Sec.


5.5 Identity Embedding Performance

The advantage expected from our identity embedder is the smoothness; in particular, smooth change of identities along the interpolation curve as shown in Fig. 3. To quantitatively evaluate this, we devised a smoothness score and compared with other baseline embedders.

The score measures the (normalized) gap between the average point of the two identity embedding vectors, , and the closest valid embedding to it, (here, is an averaging ratio). If the embedding space is smooth, this gap has to be small.

The notion of valid embedding is subject to the settings. When measured using samples, and are samples from the FFHQ dataset , and where . When measured using GAN, and are samples generated from a pretrained StyleGAN2 [karrasAnalyzingImprovingImage2020], where ( is the generator, and ’s are the latent codes; see appendix for details).

As seen from Table 2, our model shows substantially better smoothness while maintaining comparable verification performance with ArcFace and VGGFace2. Note LFW [LFWTech] is one of the standard benchmark dataset for verification; VCHQ is a dataset we derived from VoxCeleb [nagraniVoxCelebLargescaleSpeaker2018] (see appendix for the details).

The same trend is also qualitatively confirmed in Fig. 9. The figure shows the retrieved images for each of the interpolating points (). Our embedder tends to change smoothly while moving along the interpolation curve; others tend to stick with the same identities repeatedly. To quantify this, we compute the number of unique images for each interpolation (the lower, the more repetition, and the worse). Summarizing the results from 64 sample pairs, the numbers were 5.131.18 (Ours), 4.421.41 (VGGFace2), and 4.251.33 (ArcFace).

w/smp w/GAN Verification AUC
r=0.25 r=0.5 r=0.5 VCHQ VGG2 LFW
CE-Lin 0.333 0.354 0.797 0.939 0.994 1.000
CE-Arc 0.404 0.430 0.914 0.925 0.997 0.998
ArcFace 0.360 0.380 0.802 - - 0.995
Ours 0.116 0.135 0.671 0.956 0.994 0.999
Table 2: Scores of the embedder models. Our model shows far better smoothness scores, maintaining comparable verification scores. CE-Lin and CE-Arc are reproduced versions of VGGFace2 [caoVGGFace2DatasetRecognising2018] and ArcFace [dengArcFaceAdditiveAngular2019], trained from FFHQ-aligned VGGFace2 dataset. ArcFace is the original model provided in [dengArcFaceAdditiveAngular2019], trained from a larger dataset with a different alignment.

6 Conclusion

We have proposed a new face-swapping model, Smooth-Swap, to generate high-quality swap images with active change of face shapes. While many existing models use handcrafted components to tackle the difficulty, our model stays with the simplest generator architecture and instead considers a smooth identity embedding. By taking this data-driven approach with minimal inductive bias, we observed that Smooth-Swap can achieve superior scores in identity-related metrics. Also, by virtue of the smooth embedding model, our model can get stable gradient information, which allowed much faster training.

We believe this study can open up opportunity for challenging more difficult face-swapping problems by reducing the complexity considerably. With reduced effort for balancing the components and reduced memory usage of the model parameters, one could consider an expanded problem scope, such as modeling face-swapping on videos in an end-to-end manner. A downside of our current model in that regard is some performance drop in preserving the pose and expression. However, we suppose a simple fine-tuning or different hyperparameter choice would be sufficient to meet the goal.

Potential Negative Societal Impact

Face-swapping models, more commonly known as Deepfake technology to the public, has been maliciously used in making serious negative impacts, including the spread of fake news. Nonetheless, we believe that studying on a face-swapping task is important and necessary because a deep understanding of this task could set a good starting point for developing high-quality DeepFake detection algorithms [rosslerFaceForensicsLearningDetect2019]. In addition, we remark that face-swapping conversely has many positive applications, including anonymization for privacy protection and creating new characters without a CGI techniques in the entertainment industry.


Appendix A Architecture Details

a.1 Identity Embedding Model:

Our identity embedding model is based on ResNet-50 [heDeepResidualLearning2015], where we use a different head for contrastive learning as seen in Fig. A1. Note the UnitNorm in the final layer makes to be unit-length (). The network invovles total 32.3M parameters.

a.2 Swap-Image Generator:

Our generator architecture is mostly the same as NCSN++ [songScoreBasedGenerativeModeling2020] except for the following three differences (as described in the main manuscript, Sec. 4.2): 1) we use half as many channels, 2) we use the identity embedding instead of the time embedding, and 3) we add an input-to-output skip connection. Fig. A1 (a) shows the detailed structure with dimensional information. The network involves total 9.8M parameters.

Up/Down Sampling & Skip-Connections

Note in each of the outer block containing multiple ResBlocks, the first ResBlock handles upsampling or downsampling (except for the ResBlock x5, where the second ResBlock handles upsampling). There are 13 skip connections in total (; is the input-to-output skip), where the input to each of the ResBlock in the encoder part (before the Attention Block) is handed over to the decoder part (after the Attention Block). On the decoder side, the first three ResBlocks of each outer block get the skip-connections (except for the ResBlock x5, where the second through the fourth get the skip-connections).

Details on the ResBlocks of the Generator

We describe essential details on the ResBlocks of the generator. One can find the complete information in [songScoreBasedGenerativeModeling2020] or from the attached source code.

The overall structure of our ResBlock is not much different from the conventional designs [heDeepResidualLearning2015]. However, as shown in Fig. A1 (b), a structure for conditioning on the identity embedding vector is added (similar to [brockLargeScaleGAN2019]). The conditioning is done by 1) projecting onto a -dimensional vector, 2) spatially broadcasting the result, and 3) adding it to the intermediate output of the original path ( is the number of output channels of the current block). When upsampling or downsampling is used, the optional components (denoted by yellow and dash-dotted outline) are also computed.

a.3 Discrminiator:

We use the same discriminator as the one used in StyleGAN2 [karrasAnalyzingImprovingImage2020]. The network involves total 28.9M parameters.

Figure A1: Detailed architecture of our Smooth-Swap model; both the embedder and the generator are shown. The intermediate feature-map dimensions are written in the order of (channels height width). ‘ResBlock x4’ in (a) denotes that there are four residual sub-blocks connected sequentially; the structure of the sub-block (i.e., ResBlock) is detailed in (b). Note the multi-line text should be read from the bottom (e.g., Linear(512) to BatchNorm to UnitNorm). See Sec. A.2.

Appendix B More Image Samples from Smooth-Swap

We show extended sets of swapped-image samples from our Smooth-Swap model. The following three figures, Fig. A2, A3, and A4 present the results of the same experiments as Fig. 4, 5, and 6 in the main manuscript, but with different source and target pairs.

Figure A2: Comparison of the face-swapping results of various models on the FaceForensics++ dataset [rosslerFaceForensicsLearningDetect2019] (extension of Fig. 4 in the main manuscript)
Figure A3: More results of Smooth-Swap on the FFHQ test split (extension of Fig. 5 in the main manuscript). Active change of identity is observed. However, in some cases where the source and the target have largely different face shapes (e.g., a child in the rightmost column in the lower-right block), artifacts are noticed. In real-world applications, such cases can be avoided by choosing the swapping pairs from a similar age range.
Figure A4: More face swapping results of Smooth-Swap on wild images (extension of Fig. 6 in the main manuscript).