Python implementation of ADD: Frequency Attention and Multi-View based Knowledge Distillation to Detect Low-Quality Compressed Deepfake Images, AAAI2022.
Despite significant advancements of deep learning-based forgery detectors for distinguishing manipulated deepfake images, most detection approaches suffer from moderate to significant performance degradation with low-quality compressed deepfake images. Because of the limited information in low-quality images, detecting low-quality deepfake remains an important challenge. In this work, we apply frequency domain learning and optimal transport theory in knowledge distillation (KD) to specifically improve the detection of low-quality compressed deepfake images. We explore transfer learning capability in KD to enable a student network to learn discriminative features from low-quality images effectively. In particular, we propose the Attention-based Deepfake detection Distiller (ADD), which consists of two novel distillations: 1) frequency attention distillation that effectively retrieves the removed high-frequency components in the student network, and 2) multi-view attention distillation that creates multiple attention vectors by slicing the teacher's and student's tensors under different views to transfer the teacher tensor's distribution to the student more efficiently. Our extensive experimental results demonstrate that our approach outperforms state-of-the-art baselines in detecting low-quality compressed deepfake images.READ FULL TEXT VIEW PDF
Python implementation of ADD: Frequency Attention and Multi-View based Knowledge Distillation to Detect Low-Quality Compressed Deepfake Images, AAAI2022.
Recently, facial manipulation techniques using deep learning methods such as deepfakes have drawn considerable attention rossler2019faceforensics++; pidhorskyi2020adversarial; richardson2020encoding; nitzan2020face. Moreover, deepfakes have become more realistic and sophisticated, making it difficult to be distinguished by human eyes siarohin2020first. And it has become much easier to generate such realistic deepfakes than before. Hence, such advancements and convenience enable even novices to easily create highly realistic fake faces for simple entertainment. However, these fake images raise serious security, privacy, and social concerns, as they can be abused for malicious purposes, such as impersonation fraudst, revenge pornography cole_2018, and fake news propagation quandt2019fake.
To address such problems arising from deepfakes, there have been immense research efforts in developing effective deepfake detectors dzanic2019fourier; rossler2019faceforensics++; wang2019fakespotter; li2018exposing; khayatkhoei2020spatial; zhang2019detecting. Most approaches utilize the deep learning-based approaches, where generally they perform well if there are a large amount of high-resolution training data. However, the performances of these approaches drop dramatically (by up to dzanic2019fourier; rossler2019faceforensics++) for compressed low-resolution images due to lack of available pixel information to sufficiently distinguish fake images from real ones. In other words, because of the compression, subtle differences and artifacts such as sharp edges in hairs and lips that can be possibly leveraged for differentiating deepfakes can also be removed. Therefore, there still remains an important challenge to effectively detect low-quality compressed deepfakes, which frequently occur on social media and mobile platforms in bandwidth-challenging and storage-limited environments.
In this work, we propose the Attention-based Deepfake detection Distiller (ADD). Our primary goal is to detect low-quality (LQ) deepfakes, which are less explored in most previous studies but plays a pivotal role in real-world scenarios. First, we assume there are high-quality (HQ) images are readily available, similar to the settings in other studies rossler2019faceforensics++; wang2019fakespotter; li2018exposing; khayatkhoei2020spatial; zhang2019detecting; dzanic2019fourier. And, we use knowledge distillation (KD) as an overarching backbone architecture to detect low-quality deepfakes. While most of the existing knowledge distillation methods aim to reduce the student size for model compression applications or improve the performance of lightweight deep learning models hinton2015distilling; tian2019contrastive; huang2017like; passalis2018learning, we hypothesize that a student can learn lost distinctive features of low-quality compressed images from a teacher that is pre-trained on high-quality images for deepfake detection. We first lay out the following two major challenges associated with detecting the LQ compressed deepfakes, and provide the intuitions of our approaches to overcome these issues:
1) Loss of high-frequency information. As discussed, while lossy image compression algorithms make changes visually unnoticeable to humans, they can significantly reduce DNNs’ deepfake detection capability by removing the fine-grained artifacts in high-frequency components. To investigate this phenomenon more concretely, we revisit Frequency Principal (F-Principal) xu2019training, which describes the learning behavior of general DNNs in the frequency domain. F-Principal states that general DNNs tend to learn dominant low-frequency components first and then capture high-frequency components during the training process xu2019training. For example, to illustrate this issue, Fig. 1
is provided to indicates that most of the lost information during compression is from high-frequency components. As a consequence, general DNNs shift their attention in later training epochs to high-frequency components, which now represent intrinsic characteristics of objects in each individual image rather than discriminative features. This learning process increases the variance of DNNs’ decision boundaries and induces overfitting, thereby degrading the detection performance. A trivial approach to tackle the overfitting is applying the early stopping methodmorgan1989generalization; however, fine-grained artifacts of deepfakes can be subsequently omitted, especially when they are highly compressed. To overcome this issue, we propose the novel frequency attention distiller, which guides the student to effectively recover the removed high-frequency components in low-quality compressed images from the teacher during training.
2) Loss of correlated information. In addition, under heavy compression, crucial features and pixel correlations that not only capture the intra-class variations, but also characterize the inter-class differences are also degraded. In particular, these correlations are essential for CNNs’ ability to learn the features at the local filters, but they are significantly removed in the compressed input images.
Recent studies wang2018non; hu2018relation have empirically demonstrated that training DNNs that are able to capture this correlated information can successfully improve their performances. Therefore, in this work, we focus on improving the lost correlations by proposing a novel multi-view attention, inspired by the work of Bonneel et al. bonneel2015sliced, and contrastive distillation tian2019contrastive. The element-wise discrepancy between the teacher’s and student’s tensors that ignores the relationship within local regions of pixels, or channel-wise attention that only considers a single dimension of backbone features. On the other hand, our proposed method ensures that our model attends to output tensors from multiple views (slices) using Sliced Wasserstein distance (SWD) bonneel2015sliced. Therefore, our multi-view attention distiller guides the student to mimic its teacher more efficiently through a geometrically meaningful metric based on SWD. In summary, we present our overall Attention-based Deepfake detection Distiller (ADD), which consists of two novel distillations (See Fig. 2): 1) frequency attention distillation that effectively retrieves the removed high-frequency components in the student network, and 2) multi-view attention distillation that creates multiple attention vectors by slicing the teacher’s and student’s tensors under different views to transfer the teacher tensor’s distribution to the student more efficiently.
Our contributions are summarized as follows:
We propose the novel frequency attention distillation, which effectively enables the student to retrieve high-frequency information from the teacher.
We develop the novel multi-view attention distillation with contrastive distillation for the student to efficiently mimic the teacher while maintaining pixel correlations from the teacher to the student through SWD.
We demonstrate that our approach outperforms well-known baselines, including attention-based distillation methods, on different low-quality compressed deepfake datasets.
Deepfake detection. Deepfake detection has recently drawn significant attention, as it is related to protecting personal privacy. Therefore, there has been a large number of research works to identify such deepfakes rossler2019faceforensics++; li2020face; li2019faceshifter; jeon2020t; rahmouni2017distinguishing; wang2019fakespotter; li2018exposing. Li et al. li2020face tried to expose the blending boundaries of generated faces and showed the effectiveness of their method, when applied for unseen face manipulation techniques. Self-training with L2-starting point regularization was introduced by Jeon et al. jeon2020t
to detect newly generated images. However, the majority of prior works are limited to high-quality (HQ) synthetic images, which are rather straightforward to detect by constructing binary classifiers with a large amount of HQ images.
Knowledge distillation (KD). Firstly introduced by Hinton et al. hinton2015distilling, KD is a training technique that transfers acquired knowledge from a pre-trained teacher model to a student model for model compression applications. However, many existing works yim2017gift; tian2019contrastive; huang2017like; passalis2018learning applied different types of distillation methods to conventional datasets, e.g.zhu2019low used FitNets romero2014fitnets to train a student model that is able to detect low-resolution images, which is similar to our method in that the teacher and the student learn to detect high and low-quality images, respectively. However, their approach coerces the student to mimic the penultimate layer’s distribution from the teacher, while it does not possess rich features at the lower layers.
In order to encourage the student model to mimic the teacher more effectively, Zagoruyko and Komodakis zagoruyko2016paying proposed the activation-based attention transfer, similar to FitNets, but their approach achieves better performance by creating spatial attention maps. Our multi-view attention method inherits from this approach but carries more generalization ability by not only exploiting spatial attention (in width and height dimension), but also introducing attention features from random dimensions using Radon transform helgason2010integral. Thus, our approach pushes the student’s backbone features closer to the teacher’s.
In addition, inspired by InfoNCE loss oord2018representation, Tian et al. tian2019contrastive proposed contrastive representation distillation (CRD), which formulates the contrastive learning framework and motivates the student network to drive samples from positive pairs closer, and push away those from negative pairs. Although CRD achieves superior performance to those of previous approaches, it requires a large memory buffer to save embedding features of each sample. This is restrictive when training size and embedding space become larger. Instead, we directly sample positive and negative images in the same mini-batch and apply the contrastive loss to embedded features, similar to the Siamese network bromley1994signature.
Frequency domain learning. In the field of media forensics, several approaches jiang2020focal; khayatkhoei2020spatial; dzanic2019fourier showed that discrepancies of high-frequency’s Fourier spectrum are effective clues to distinguish CNN-based generated images. Frank et al. frank2020leveraging and and Zhang et al. zhang2019detecting utilized the checkerboard artifacts odena2016deconvolution
of the frequency spectrum caused by up-sampling components of generative neural networks (GAN) as effective features in detecting GAN-based fake images. Nevertheless, their detection performances were greatly degraded when the training synthesized images are compressed, becoming low-quality. Quianet al. proposed an effective frequency-based forgery detection method, named , which decomposes an input image to many frequency components, collaborating with local frequency statistics on a two-streams network. The , however, doubles the number of parameters from its backbone.
Wasserstein distance. Induced by the optimal transport theory, Wasserstein distance (WD) villani2008optimal, and its variations have been explored in training DNNs to learn a particular distribution thanks to Wasserstein’s underlying geometrically meaningful distance property. In fact, WD-based applications cover a wide range of fields, such as to improve generative models arjovsky2017wasserstein; deshpande2018generative
, learn the distribution of latent space in autoencoderskolouri2018sliced; xu2020learning, and match features in domain adaptation tasks lee2019sliced.In this work, we utilize the Wasserstein metric to provide the student geometrically meaningful guidance to efficiently mimic the teacher’s tensor distribution. Thus, the student can learn the true tensor distribution, even though its input features are partially degraded through high compression.
Our Attention-based Deepfake detection Distiller (ADD) is consisted of the following two novel distillations (See Fig. 2): 1) frequency attention distillation and 2) multi-view attention distillation.
Let and be the student and the pre-trained teacher network. By forwarding a low-quality compressed input image and its corresponding raw image through and , respectively, we obtain features and from its backbone network, which have channels, the width of , and the height of . To create frequency representations, Discrete Fourier Transform (DFT) is applied to each channel as follows:
where and denote the and slice in the channel, the width and height dimension of and , respectively. Here, for convenience, we use the notation to denote that the function is independently applied for both student’s and teacher’s backbone features. Then, the value at on each single feature-map indicates the coefficient of a basic frequency component. The difference between a pair of corresponding coefficients from the teacher and the student represents the absence of that student’s frequency component. Next, let
be a metric that assesses the distance between two input complex numbers and supports stochastic gradient descent. Then, the frequency loss between the teacher and student can be defined as follows:
where is an attention weight at . In this work, we utilize the exponential of the difference across channels between the teacher and student as the weight in the following way:
where is a positive hyper-parameter that governs the exponential cumulative loss, as the student’s removed frequency increases. This design of attention weights ensures that the model focuses more on the losing high-frequency, and makes Eq. 2 partly similar to focal loss lin2017focal. Figure 3 visually illustrates our frequency loss.
Sliced Wasserstein distance. The p
-Wasserstein distance between two probability measuresand villani2008optimal
with their corresponding probability density functionsand in a probability space and , is defined as follows:
where is a set of all transportation plans , which has the marginal densities and , respectively, and is a transportation cost function. Equation 4 searches for an optimal transportation plan between and , which is also known as Kantorovitch formulation kantorovitch1958translocation. In the case of one-dimensional probability space, i.e., , the closed-form solution of the p-Wasserstein distance is:
are the cumulative distribution functions ofand , respectively.
A variation of Wasserstein distance, inspired by the above closed-form solution, is Sliced Wasserstein distance (SWD) that deploys multiple projections from a high dimensional distribution to various one-dimensional marginal distributions and calculates the optimal transportation cost for each projection. In order to construct these one-dimensional marginal distributions, we use the Radon transform helgason2010integral, which is defined as follows:
where denotes the Diract delta function, is the Euclidean inner-product, and is the d-dimensional unit sphere. Thus, we denote as a 1-D marginal distribution of under the projection on . The Sliced 1-Wasserstein distance is defined as follows:
Now, we can calculate the Sliced Wasserstein distance by optimizing a series of 1-D transportation problems, which have the closed-form solution that can be computed in rabin2011wasserstein. In particular, by sorting and in ascending order using two permutation operators and , respectively, the can be approximated as follows:
where is the number of uniform random samples using Monte Carlo method to approximate the integration of over the unit sphere in Eq. 7.
Multi-view attention distillation. Let be the square of after being normalized by the Frobenius norm, i.e., , where denotes the Hadamard power bocci2016hadamard. Consequently, we are now able to consider as a discrete probability density function over , where indicates the density value at the slice and of the channel, the width and height dimension, respectively. To avoid replicating the element-wise differences, we additionally need to bin the projected vectors into groups before applying distillation. One important property of our multi-view attention is that different values of provide different attention views (slices) of and . For instance, with , we achieve the channel-wise attention that was introduced by Chen et al. chen2017sca. Or, we can produce an attention vector in the width and height dimension, when becomes close to and , respectively. With this general property, a student can pay full attention to its teacher’s tensor distribution instead of some pre-defined constant attention views.
Figure 4 pictorially illustrates our overall multi-view attention distillation, and we summarize our multi-view attention in Algorithm 1 in the supplementary materials. In order to encourage the semantic similarity of samples’ representation from the same class and discourage that of those from different classes, we further apply the contrastive loss for each instance, which inspired by the CRD distillation framework of Tian et al. tian2019contrastive . Thus, our overall multi-view attention loss is defined as follows:
where and are the random instance’s representation that belong to the same and the opposite class of at the teacher, respectively. And is a margin that manages the discrepancy of negative pairs and , and are scaling hyper-parameters.
The overall distillation loss in our KD framework is formulated as follows:
where and are hyper-parameters to balance the contribution of frequency attention distiller and multi-view attention distiller, respectively. Our attention loss is parameter-free and is independent from model architecture design, and it can be directly added to any detector model’s conventional loss (e.g., cross-entropy loss). Also, the frequency attention requires computational complexity in for one backbone feature, where is the complexity of 2-D Fast Fourier Transform applied for one channel. On the other hand, the average-case complexity of multi-view attention is , where is the complexity of 1-D closed-form solution as mentioned above, is the number of random samples , and is the number of elements in one backbone feature, i.e., . Our end-to-end Attention-based Deepfake detection Distiller (ADD) pipeline is presented in Fig. 2.
Our proposed method is evaluated on five different popular deepfake benchmark datasets: NeuralTextures thies2019deferred, Deepfakes deepfakes, Face2Face thies2016face2face, FaceSwap faceswap, and FaceShifter li2019faceshifter. Every dataset has 1,000 videos generated from 1,000 real human face videos by Rössler et al. rossler2019faceforensics++. These videos are compressed into two versions: medium compression (c23) and high compression (c40), using the H.264 codec with a constant rate quantization parameter of 23, and 40, respectively. Each dataset is randomly divided into training, validation, and test set consisting of 720, 140, and 140 videos, respectively. We randomly select 64 frames from each video and obtain 92,160, 17,920, and 17,920 images for training, validation, and test set, respectively. Then, we utilize the Dblib king2009dlib to detect the largest face in every single frame and resize them to a square image of pixels.
In our experiments, we use Adam optimizer kingma2014adam with , and . The learning rate is , which follows one cycle learning rate schedule smith2019super with a mini-batch size of 144. In every epoch, the model is validated 10 times to save the best parameters using validation accuracy. Early stopping is applied, when the validation performance does not improve after 10 consecutive times. We use ResNet50 he2016deep as our backbone to implement our proposed distillation framework. In Eq. 2, we define as the square of the modulus of the difference between two complex number, i.e., , which satisfies the properties of a general distance metric: non-negative, symmetric, identity, and triangle inequality. The number of binning groups is equal to a half of the number of channels of . Our hyper-parameters settings are kept the same, while is fine-tuned on each dataset in the range of 16 to 23 through the experiments. The experiments are conducted on two TITAN RTX 24GB GPUs with Intel Xeon Gold 6230R CPU @ 2.10GHz.
Our experimental results are presented in Table 1. We use Accuracy score (ACC) and Recall at 1 (R@1), which are described in detail in the supplementary materials. We compare our ADD method, with both distillation and non-distillation baselines. For a fair comparison between different methods, the same low resolution is used at pixels as mentioned above is used throughout the experiments.
Non-distillation methods. We reproduce two highest-score deepfake detection benchmark methods faceforensicsbenchmark: 1) the method proposed by Rössler et al. rossler2019faceforensics++, which used Xception model, and 2) the approach by Dogonadze et al. dogonadze2020deep, which employed Inception ResNet V1 pre-trained on the VGGFace2 dataset cao2018vggface2. These are the two best performing publicly available deepfake detection methods111http://kaldir.vc.in.tum.de/faceforensics_benchmark/. Additionally, we use the , which is a frequency-based deepfake detection introduced by Quian et al. qian2020thinking for evaluation. The is deployed on two streams of XceptionNet as described in the paper. Finally, ResNet50 he2016deep is also included as a baseline to compare with distillation methods.
Distillation baseline methods. As there has not been much research that deploys KD for deepfake detection, we further integrate other three well-known distillation architectures in the ResNet50 backbone to perform comparisons, including: FitNet romero2014fitnets, Attention Transfer zagoruyko2016paying (AT) and Non-local wang2018non (NL). Each of these methods is fine-tuned on the validation set to achieve its best performance.
First, comparing ours with the non-distillation baselines, we can observe that our method improves the detection accuracy from to across all five datasets for both compression data types. On average, our approach outperforms the other three distillation methods, and is superior on the highest compressed (c40) datasets. The model with FitNet loss, though it has a small improvement, does not have competitive results due to retaining insufficient frequency information. The attention module and non-local module also provide compelling results. However, they do not surpass our methods because of the lower attention dimension and frequency information shortage.
|Rössler et al.||76.36||57.24||56.75||51.88 px|
|Dogonadze et al.||78.03||77.13||61.12||48.01 px|
|FitNet - ResNet50||86.26||84.83||66.01||57.28 px|
|AT - ResNet50||85.21||84.99||62.61||43.50 px|
|NL - ResNet50||88.26||86.95||65.65||46.82 px|
|ADD - ResNet50 (ours)||88.48||87.53||68.53||58.42 px|
|Rössler et al.||97.42||96.96||92.43||82.39 px|
|Dogonadze et al.||94.67||94.39||93.97||93.52 px|
|FitNet - ResNet50||97.28||97.78||93.68||93.34 px|
|AT - ResNet50||97.37||98.72||95.11||94.35 px|
|NL - ResNet50||98.42||98.21||93.09||94.35 px|
|ADD - ResNet50 (ours)||98.67||98.09||95.50||94.59 px|
|Rössler et al.||91.83||91.02||80.21||77.42 px|
|Dogonadze et al.||89.34||88.73||83.44||81.00 px|
|FitNet - ResNet50||95.91||96.16||83.48||78.99 px|
|AT - ResNet50||96.80||96.84||83.55||78.72 px|
|NL - ResNet50||96.44||96.64||83.69||82.04 px|
|ADD - ResNet50 (ours)||96.82||97.14||85.42||83.54 px|
|Rössler et al.||95.49||95.36||88.09||87.67 px|
|Dogonadze et al.||93.33||92.78||90.02||89.10 px|
|FitNet - ResNet50||97.29||96.29||89.16||90.13 px|
|AT - ResNet50||97.66||97.27||89.75||90.41 px|
|NL - ResNet50||97.34||96.95||91.86||90.78 px|
|ADD - ResNet50 (ours)||97.85||97.34||92.49||92.13 px|
|Rössler et al.||93.04||93.16||89.20||87.12 px|
|Dogonadze et al.||89.80||89.36||82.03||79.96 px|
|FitNet - ResNet50||96.63||95.95||90.16||89.36 px|
|AT - ResNet50||96.32||96.76||88.28||89.45 px|
|NL - ResNet50||96.24||95.28||90.04||87.71 px|
|ADD - ResNet50 (ours)||96.60||95.84||91.64||90.27 px|
Effects of Attention Modules. We investigate the quantitative impact of the frequency attention and multi-view attention on the final performance. In the past, the NeuralTextures (NT) dataset has shown to be the most difficult to differentiate by both human eyes and DNN rossler2019faceforensics++. Hence, we conduct our ablation study on the c40 highly NT compressed dataset. The results are presented in Table 2. We can observe that frequency attention improves about of the accuracy. Multi-view attention with contrastive loss provides a slightly better result than that of without contrastive at and , respectively. Finally, combining the frequency attention and multi-view attention distillation with contrastive loss significantly improves the accuracy up to . The results of our ablation study demonstrate that each proposed attention distiller has a different contribution to the student’s ability to mimic the teacher, and they are compatible when integrated together to achieve the best performance.
|Our ResNet (FR)||67.03|
|Our ResNet (MV w/o contrastive)||67.01|
|Our ResNet (MV w/ contrastive)||68.14|
|Our ResNet (FR+MV)||68.53|
Sensitivity of attention weights ( and ). We conduct an experiment on the sensitivity of the frequency attention weight and multi-view attention weight on the five different datasets. The detailed results are presented in the supplementary materials. The result shows that by changing the value of and , the performance of our method continuously outperform the baseline results, indicating that our approach is less sensitive to both and .
Experiment with other backbones. Table 3 shows the results with three other backbones: ResNet18 and ResNet34 he2016deep, and EfficientNet-B0 tan2019efficientnet. We set up the hyper-parameters of the four DNNs as the same for ResNet50, except is changed to for EfficientNet-B0. Our distilled model improves the detection accuracy of all five datasets in different compression quality, up to , with ResNet18, ResNet34, and EfficientNet-B0 backbone compared to their baselines, respectively.
Grad-CAM selvaraju2017grad. Using Grad-CAM, we provide visual explanations regarding the merits of training a LQ deepfake detector with our ADD framework. The gallery of Grad-CAM visualization is included in the supplementary material. First, our ADD is able to correct the facial artifacts’ attention of the LQ detector to resemble its teacher trained on raw datasets. Second, the ADD vigorously instructs the student model to neglect the background noises and activate the facial areas as its teacher does when encountering facial images in complex backgrounds. Meanwhile, the baseline model which is solely trained on LQ datasets steadily makes wrong predictions with high confidence by activating non-facial areas and is deceived by complex backgrounds.
In this paper, we proposed a novel Attention-based Deepfake detection Distillations (ADD), exploring frequency attention distillation and multi-view attention distillation in a KD framework to detect highly compressed deepfakes. The frequency attention helps the student to retrieve and focus more on high-frequency components from the teacher. The multi-view attention, inspired by Sliced Wasserstein distance, pushes the student’s output tensor distribution toward the teacher’s, maintaining correlated pixel features between tensor elements from multiple views (slices). Our experiments demonstrate that our proposed method is highly effective and achieves competitive results in most cases when detecting extremely challenging highly compressed challenging LQ deepfakes. Our code is available here222https://github.com/Leminhbinh0209/ADD.git.
This work was partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-00421, AI Graduate School Support Program (Sungkyunkwan University)), (No. 2019-0-01343, Regional strategic industry convergence security core talent training business) and the Basic Science Research Program through National Research Foundation of Korea (NRF) grant funded by Korea government MSIT (No. 2020R1C1C1006004). Also, this research was partly supported by IITP grant funded by the Korea government MSIT (No. 2021-0-00017, Original Technology Development of Artificial Intelligence Industry) and was partly supported by the Korea government MSIT, under the High-Potential Individuals Global Training Program (2020-0-01550) supervised by the IITP.
Algorithm 1 presents the pseudo-code for multi-view attention distillation between two corresponding backbone features using the Sliced Wasserstein distance (SWD) from the student and teacher models. For implicity, we formulate how each single projection contributes to the total SWD. However, in practice, uniform vectors in can be sampled simultaneously by deep learning libraries, e.g.TensorFlow or PyTorch, and the projection operation or binning can be vertorized.
We describe the five different deepfake datasets used in our experiments:
NeuralTextures. Facial reenactment is an application of Neural Textures thies2019deferred technique that is used in video re-renderings. This approach includes learned feature maps stored on top of 3D mesh proxies, called neural textures, and a deferred neural renderer. The NeuralTextures datasets used in our experiment includes facial modifications of mouth regions, while the other face regions remain the same.
DeepFakes. The DeepFakes dataset is generated using two autoencoders with a shared encoder, each of which is trained on the source and target faces, respectively. Fake faces are generated by decoding the source face’s embedding representation with the target face’s decoder. Note that DeepFakes, at the beginning, was a specific facial swapping technique, but is now referred to as AI-generated facial manipulation methods.
Face2Face. Face2Face thies2016face2face is a real-time facial reenactment approach, in which the target person’s expression follows the source person’s, while his/her identity is preserved. Particularly, the identity corresponding to the target face is recovered by a non-rigid model-based bundling approach on a set of key-frames that are manually selected in advance. The source face’s expression coefficients are transferred to the target, while maintaining environment lighting as well as target background.
FaceSwap. FaceSwap faceswap is a lightweight application that is built upon the graphic structures of source and target faces. A 3D model is designed to fit 68 facial landmarks extracted from the source face. Then, it projects the facial regions back to the target face by minimizing the pair-wise landmark errors and applies color correction in the final step.
FaceShifter. FaceShifter li2019faceshifter
is a two-stage face swapping framework. The first stage includes a encoder-based multi-level feature extractor used for a target face and a generator that has Adaptive Attentional Denormalization (AAD) layers. In particular, AAD is able to blend the identity and the features to a synthesized face. In the second stage, they developed a novel Heuristic Error Acknowledging Refinement Network in order to enhance facial occlusions.
In our experiement, we actively integrate three well-known distillation losses into the teacher-student training framework for comparing with ours:
FitNet romero2014fitnets. FitNet method proposed hints algorithms, in which the student’s guided layers try to predict the outputs of teacher’s hint layers. We apply this hint-based learning approach on the penultimate layer of the teacher and student.
Attention Transfer zagoruyko2016paying (AT). Attention method transfers attention maps, which are obtained by summing up spatial values across the backbone features’ channels from the teacher to the student.
Non-local wang2018non (NL). Non-local module generates self-attention features from the student and teacher’s backbone features. Subsequently, the student’s self-attention tensors attempt to mimic the teacher’s.
The results in our experiments are evaluated based on the following metrics:
Accuracy (ACC). ACC is widely used to evaluate a classifier’s performance, and it calculates the proportion of samples whose true classes are predicted with the highest probability. ACC of a model tested on a test set of samples is formulated as follows:
where is the Kronecker delta function.
Recall at (). indicates the proportion of test samples with at least one observed sample from the same class in nearest neighbors determined in a particular feature space. A small implies small intra-class variation, which usually leads to better accuracy. is formulated as follows:
where is the label of the nearest neighbor of , and is the Kronecker delta function. We use the Euclidean distance to measure the distances between queried and referenced samples whose features are the penultimate layer’s outputs, and we adopt which considers the first nearest neighbor of a test sample.
We examine the sensitivity of the different choices of attention weight hyperparameters,and . We experiment and show that our approach (solid lines) is less sensitive to and , and almost always outperforms the ResNet50 baselines (dashed lines) across all datasets, as shown in Fig. 5. In this work, we fix and fine-tune , but they can be further tuned to optimize the performance.
Grad-CAM selvaraju2017grad is generated by back-propagating the gradients of the highest-probability class to a preceding convolutional layer’s output, producing a weighted combination with that output. The localization heat map visually indicates a particular contribution of each region to the final class prediction. The strong activation regions, which are represented in red color, are resulted from positive layer’s outputs and higher value of the gradients, whereas negative pixels or low gradients produce less activation regions are indicated in blue color. In our experiments, we utilized ResNet50 pre-trained on raw data as the teacher, our ADD - ResNet50 trained on low-quality compressed data (c40) as the student, and ResNet50 trained on the c40 data alone without any KD method as a baseline for comparison. We provide visual explanations regarding the two benefits of training a low-quality deepfake detector with our ADD framework as follows:
Correctly identifying facial activation regions. Despite being trained on low-quality compressed images, the ResNet50 baseline still makes wrong predictions with high confidence as shown in the second column of Fig. 6. The areas pointed by red arrows indicate the baselines’ activation regions, which are different from the teacher’s, indicated by green arrows in the third column. After training with our ADD framework, the student is able to produce correct predictions that are strongly correlated and can be further visualized by its activation regions which are closely following the teacher’s.
Resolving background confusion. Figure 7 presents the Grad-CAMs for the fake class of five different datasets, the selected samples of which have complex backgrounds. We can observe that when the teacher is trained with raw data, it produces predictions (around 1.00) with nearly perfect confidence for the fake class. When encountered highly compressed images with complex backgrounds, the ResNet50 baseline model makes wrong predictions, not activating the facial regions but the background (red arrows in Fig. 7). On the other hand, our ADD - ResNet50 model is also trained on low-quality compressed data, but our approach accumulates more distilled knowledge from the teacher and it is able to correctly identify and emphasize the actual facial areas, as its teacher similarly does with raw images (green arrows in Fig. 7).
As shown in Fig. 7, we can demonstrate and conclude that when a low-quality compressed image has a complex background, it is easy for a conventional learning model trained without additional information (the second column) to become more vulnerable, making incorrect predictions with high confidence, activating background regions. Meanwhile, with our KD framework, the student is trained under the guidance of its teacher that can focus more on the facial areas, and effectively eliminate the background effects from the real and fake faces.
Interestingly, as shown in Fig. 7, the baseline classifier can possibly be susceptible to and exploited by adversarial attacks, which typically add small amount of noise to the image. Therefore, one can explore the possibility of performing adversarial attacks on the compressed images by adding small amount of noise to or disordering the background. These changes resulting from adversarial attacks would be difficult to be distinguished by humans and can easily deceive the output of the baseline classifier. Therefore, it would be interesting to explore adversarial attacks for complex, compressed images and videos. Also, as a defense mechanism, we can consider KD as a framework with our novel attention mechanism for future work to better detect face regions and increase robustness against complex background noise as shown in Fig. 7.