Log In Sign Up

Video Face Manipulation Detection Through Ensemble of CNNs

In the last few years, several techniques for facial manipulation in videos have been successfully developed and made available to the masses (i.e., FaceSwap, deepfake, etc.). These methods enable anyone to easily edit faces in video sequences with incredibly realistic results and a very little effort. Despite the usefulness of these tools in many fields, if used maliciously, they can have a significantly bad impact on society (e.g., fake news spreading, cyber bullying through fake revenge porn). The ability of objectively detecting whether a face has been manipulated in a video sequence is then a task of utmost importance. In this paper, we tackle the problem of face manipulation detection in video sequences targeting modern facial manipulation techniques. In particular, we study the ensembling of different trained Convolutional Neural Network (CNN) models. In the proposed solution, different models are obtained starting from a base network (i.e., EfficientNetB4) making use of two different concepts: (i) attention layers; (ii) siamese training. We show that combining these networks leads to promising face manipulation detection results on two publicly available datasets with more than 119000 videos.


page 2

page 6


Recurrent Convolutional Strategies for Face Manipulation Detection in Videos

The spread of misinformation through synthetically generated yet realist...

DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection

The free access to large-scale public databases, together with the fast ...

Deepfake Video Detection Using Convolutional Vision Transformer

The rapid advancement of deep learning models that can generate and synt...

FrameProv: Towards End-To-End Video Provenance

Video feeds are often deliberately used as evidence, as in the case of C...

Dynamic texture analysis for detecting fake faces in video sequences

The creation of manipulated multimedia content involving human character...

Deepfakes Detection with Automatic Face Weighting

Altered and manipulated multimedia is increasingly present and widely di...

What's wrong with this video? Comparing Explainers for Deepfake Detection

Deepfakes are computer manipulated videos where the face of an individua...

I Introduction

Over the past few years, huge steps forward in the field of automatic video editing techniques have been made. In particular, great interest has been shown towards methods for facial manipulation [39]. Just to name an example, it is nowadays possible to easily perform facial reenactment, i.e., transferring the facial expressions from one video to another one [30, 29]. This enables to change the identity of a speaker with very little effort.

Systems and tools for facial manipulations are now so advanced that even users without any previous experience in photo retouching and digital arts can use them. Indeed, code and libraries that work in an almost automatic fashion are more and more often made available to the public for free [10, 12]. On one hand, this technological advancement opens the door to new artistic possibilities (e.g., movie making, visual effect, visual arts, etc.). On the other hand, unfortunately, it also eases the generation of video forgeries by malicious users.

Fake news spreading and revenge porn are just a few of the possible malicious applications of advanced facial manipulation technology in the wrong hands. As the distribution of these kinds of manipulated videos indubitably leads to serious and dangerous consequences (e.g., diminished trust in media, targeted opinion formation, cyber bullying, etc.), the ability of detecting whether a face has been manipulated in a video sequence is becoming of paramount importance [2].

Detecting whether a video has been modified is not a novel issue per se. Multimedia forensics researchers have been working on this topic since many years, proposing different kinds of solutions to different problems [24, 21, 27]. For instance, in [5, 33] the authors focus on studying the coding history of videos. The authors of [4, 7] focus on localizing copy-move forgeries with block-based or dense techniques. In [26, 13], different methods are proposed to detect frame duplication or deletion.

Fig. 1: Sample faces extracted from ff++ and dfdc datasets. For each pristine face, we show a corresponding fake sample generated from it.

All the above-mentioned methods work according to a common principle: each non-reversible operation leaves a peculiar footprint that can be exposed to detect the specific editing. However, forensics footprints are often very subtle and hard to detect. This is the case of videos undergoing excessive compression, multiple editing operations at once, or strong downsampling [21]. This is also the case of very realistic forgeries operated through methods that are hard to formally model. For this reason, modern facial manipulation techniques are very challenging to detect from the forensic perspective [34]. As a matter of fact, many different face manipulation techniques exist (i.e., there is not a unique model explaining these forgeries). Moreover, they often operate on small video regions only (i.e., the face or part of it, and not the full frame). Finally, these kinds of manipulated videos are typically shared through social platforms that apply resizing as well as coding steps, further hindering classic forensic detectors performance.

In this paper, we tackle the problem of detecting facial manipulation operated through modern solutions. In particular, we focus on all the manipulation techniques reported in [25] (i.e., deepfakes, Face2Face, FaceSwap and NeuralTextures) and in the Facebook dfdc started on Kaggle in December 2019 [9]. Within this context, we study the possibility of using an ensemble of different cnn trained models. We consider EfficientNetB4 [28] and propose a modified version of it obtained by adding an attention mechanism [32]. Moreover, for each network, we investigate two different training strategies, one of which is based on the siamese paradigm.

As one of the big challenges is to be able to run a forensic detector in real-world scenarios, we develop our solution keeping computational complexity at bay. Specifically, we consider the strong hardware and time constraints imposed by the dfdc [9]. This means that the proposed solution must be able to analyze videos in less than 9 hours using at most a single NVIDIA P100 GPU. Moreover, the trained models must occupy less than 1GB of disk space.

Evaluation is performed on two disjoint datasets: ff++ [25], which has been recently proposed as a public benchmark; dfdc [9], which has been released as part of the dfdc Kaggle competition. Fig. 1 depicts a few examples of faces extracted from the two datasets, reporting pristine and manipulated samples. Results show that the proposed attention-based modification as well as the siamese training strategy help the ensemble system in outperforming the baseline reported in ff++ on both datasets. Moreover, the proposed attention-based solution provides interesting insights on which part of each frame drives face manipulation detection, thus enabling a small step forward towards the explainability of the network results.

The rest of the paper is structured as follows. Section II reports a literature review of the latest related work. Section III reports all the details about the proposed method. Section IV details the experimental setup. Section V collects all the achieved results. Finally, Section VI concludes the paper.

Ii Related work

Multiple video forensics techniques have been proposed for a variety of tasks in the last few years [24, 21, 27]. However, since the forensics community has become aware of the potential social risks introduced by the latest facial manipulation techniques, many detection algorithms have been proposed to detect this kind of forgeries [34].

Some of the proposed techniques focus on a cnn-based frame-by-frame analysis. For instance, MesoNet is proposed in [1]. This is a relatively shallow cnn with the goal of detecting fake faces. The authors of [25] have shown that this network is outperformed by XceptionNet retrained on purpose.

Alternative techniques exploit also the temporal evolution of video frames through lstm analysis. This is the case of [17] and [14], which first extract a series of frame-based features, and then put them together with a recurrent mechanism.

Other methods leverage specific processing traces. This is the case of [19], where the authors exploit the fact that deepfake donor faces are warped in order to realistically stick to the host video. They therefore propose a detector that captures warping traces.

In order to overcome the limitation of pixel analysis, other techniques are based on a semantic analysis of the frames. In [37], a technique that learns to distinguish natural and fake head pose is proposed. Conversely, the authors of [20] focus on inconsistent lighting effects. Alternatively, [18] reports a methodology based on eye blinking analysis. Indeed, the first generation of deepfake videos was showing some eye artifacts that could be captured with this method. Unfortunately, the more the manipulation techniques produce realistic results, the less semantic methods work.

Finally, other techniques provide additional localization information. The authors of [22] propose a multi-task learning method that provides a detection score together with a segmentation mask. Alternatively, in [8], an attention mechanism is proposed.

Inspired by the state of the art, in this paper we focus on network ensembles, proposing a solution that works on multiple datasets and is sufficiently lightweight according to dfdc competition rules [9].

Iii Proposed method

In this section, we describe our proposed method for video face manipulation detection, i.e., given a video frame, to detect whether faces are real (pristine) or fake.

The proposed method is based on the concept of ensembling. Indeed, it is well-known that model ensembling may lead to better prediction performance. We therefore focus on investigating whether and how it is possible to train different cnn-based classifiers to capture different high-level semantic information that complement one another, thus positively contributing to the ensemble for this specific problem.

To do so, we consider as starting point the EfficientNet family of models, proposed in [28] as a novel approach for the automatic scaling of cnns. This set of architectures achieves better accuracy and efficiency with respect to other state-of-the-art cnns, and actually revealed to be very useful to fulfil hardware and time constraints imposed by dfdc. Given an EfficientNet architecture, we propose to follow two paths to make the model beneficial for the ensambling. On one hand, we propose to include an attention mechanism, which also provides the analyst with a method to infer which portion of the investigated video is more informative for the classification process. On the other hand, we investigate how siamese training strategies can be included into the learning process for extrapolating additional information about the data.

In the following, more details are provided about EfficientNet architecture with the proposed attention mechanism and the network training strategies.

Iii-a EfficientNet and attention mechanism

Among the family of EfficientNet models, we choose the EfficientNetB4 as the baseline for our work, motivated by the good trade-off offered by this architecture in terms of dimensions (i.e., number of parameters), run time (i.e., FLOPS cost) and classification performance. As reported in [28]

, with 19 millions of parameters and 4.2 billions of FLOPS, EfficientNetB4 reaches the 83.8% top-1 accuracy on the ImageNet

[11] dataset. On the same dataset, XceptionNet, used as face manipulation detection baseline method by the authors of [25], reaches the 79% top-1 accuracy at the expense of 23 millions parameters and 8.4 billions FLOPS.

EfficientNetB4 architecture is represented within the blue block in Fig. 2, where all layers are defined using the same nomenclature introduced in [28].

The input to the network is a squared color image , i.e., in our experiments, the face extracted from a video frame. As a matter of fact, authors of [25] recommend to track face information instead of using the full frame as input to the network for increasing the classification accuracy. Moreover, faces can be easily extracted from frames using any of the widely available face detectors proposed in the literature [38, 3]

. The network output is a feature vector of

elements, defined as . The final score related to the face is the result of a classification layer.

The proposed variant of the standard EfficientNetB4 architecture is inspired by the several contributions in the natural language processing and computer vision fields that make use of attention mechanisms. Works such as the transformer

[32] and residual attention networks [35] show how it is possible for a neural network to learn which part of its input (being an image or a sequence of words) is more relevant for accomplishing the task at hand. In the context of video deepfake detection, it would be of great benefit to discover which portion of the input gave the network more information for its decision making process. We thus explicitly implement an attention mechanism similar to the one already exploited by the EfficientNet itself, as well as to the self-attention mechanisms presented in [15, 8]:

  1. we select the feature maps extracted by the EfficientNetB4 up to a certain layer, chosen such that these features provide sufficient information on the input frame without being too detailed or, on the contrary, too unrefined. To this purpose, we select the output features at the third MBConv block which have size ;

  2. we process the feature maps with a single convolutional layer with kernel size 1 followed by a Sigmoid activation function to obtain a single attention map;

  3. we multiply the attention map for each of the feature maps at the selected layer.

For clarity’s sake, the attention-based module is depicted in the red block of Fig. 2.

On one hand, this simple mechanism enables the network to focus only on the most relevant portions of the feature maps, on the other hand it provides us with a deeper insight on which parts of the input the network assumes as the most informative. Indeed, the obtained attention map can be easily mapped to the input sample, highlighting which elements of it have been given more importance by the network. The result of the attention block is finally processed by the remaining layers of EfficientNetB4. The whole training procedure can be executed end-to-end, and we call the resulting network EfficientNetB4Att.

Fig. 2: Blue block: EfficientNetB4 model. If the red block is embedded into the network, an attention mechanism is included in the model, defining the proposed EfficientNetB4Att architecture.

Iii-B Network training

We train each model according to two different training paradigms: (i) end-to-end, and; (ii) siamese. The former represents a more classical training strategy, also used as evaluation metrics in the contest of dfdc. The latter aims at exploiting the generalization capabilities offered by the networks in order to obtain a feature descriptor that privileges the similarity between samples belonging to the same class. The ultimate goal is to learn a representation in the encoding space of the network’s layers that well separates samples (i.e., faces) of the real and fake class.

Iii-B1 End-to-end training

We feed the network with a sample face, and the network returns a face-related score . Notice that this score is not passed through a Sigmoid activation function yet. The weights update is led by the commonly used LogLoss function


where represents the -th face score, the related face label. Specifically, label is associated with faces coming from real pristine videos and label with fake videos. is the total number of faces used for training and

is the Sigmoid function.

Iii-B2 Siamese training

Inspired by computer vision works that generate local feature descriptors using cnns, we adopt the triplet margin loss, first proposed in [36]. Recalling that is the non-linear encoding obtained by the network for an input face (see Fig. 2), being the norm, the triplet margin loss is defined as


with , and is a strictly positive margin. In this case and are, respectively:

  • the anchor sample (i.e., a real face);

  • a positive sample, belonging to the same class as (i.e., another real face);

  • a negative sample, belonging to a different class than (i.e., a fake face).

We then finalize the training by finetuning a simple classification layer on top of the network, following the end-to-end approach described before.

Iv Experiments

In this section we report all the details regarding the used datasets and experimental setup.

Iv-a Dataset

We test the proposed method on two different datasets: ff++ [25]; dfdc [9].

ff++ is a large-scale facial manipulation dataset generated using automated state-of-the-art video editing methods. In detail, two classical computer graphics approaches are used, i.e., Face2Face [30] and FaceSwap [12], together with two learning-based strategies, i.e., DeepFakes [10] and NeuralTextures [29]. Every method is applied to high quality pristine videos downloaded from YouTube, manually selected to present nearly front-facing subjects without occlusions. All the sequences contain at least frames. Eventually, a database of more than million images from manipulated videos is built. In order to simulate a realistic setting, videos are compressed using the H.264 codec. High quality as well as low quality videos are generated using a constant rate quantization parameter equal to and , respectively.

dfdc is the training dataset released for the homologous Kaggle challenge. It is composed by more than video sequences, created specifically for this challenge, representing both real and fake videos. The real videos are sequences of actors taking into account diversity in several axes (gender, skin-tone, age, etc.) recorded with arbitrary backgrounds to bring visual variability. The fake videos are created starting from the real ones and applying different DeepFake techniques, e.g., different face swap algorithms. Notice that we do not know the precise algorithms used to generate fake videos, since for the time being the complete dataset (i.e., with the public and private testing sequences and possibly an explanation of the creation procedure) has not been released yet. The sequence length is roughly frames, and the classes are strongly unbalanced towards the fake one, counting roughly fakes and reals.

Iv-B Networks

In our experiments, we consider the following networks:

  • XceptionNet, since it is the best performing model used in [25], thus being the natural yardstick for our experimental campaign;

  • EfficentNetB4, as it achieves better accuracy and efficiency than other existing methods [28];

  • EfficentNetB4Att, which should discriminate relevant parts of the face sample from irrelevant ones.

Each model is trained and tested separately over both the considered datasets. Specifically, regarding ff++, we consider only videos generated with constant rate quantization equal to . XceptionNet is trained using the same approach of [25], whereas the two EfficientNet models are trained following the end-to-end as well as the siamese fashion described in Section III-B. In doing so, we end up with trained models: EfficientNetB4 and EfficientNetB4Att which are trained with the classical end-to-end approach, together with EfficientNetB4ST and EfficientNetB4AttST, trained using the siamese strategy. All these EfficientNetB4-derived models can contribute to the final ensembling.

Iv-C Setup

We adopt a different split policy for each dataset. We split dfdc according to its folder structure, using the first folders for training, folders from to for validation and the last folders for testing. Regarding ff++, we use a similar split as in [25] selecting videos for training, for validation and for test from the pool of original sequences taken from YouTube. The corresponding fake videos are assigned to the same split. All the results are shown on the test sets.

Fig. 3: Training and validation loss curves for XceptionNet on ff++, while varying the number of frames per video (FPV).

In our experiments, we only consider a limited number of frames for each video. In training phase, this choice is motivated by two main considerations: (i) when using a really small amount of frames per video, there is a strong tendency to overfit; (ii) increasing the number of frames does not improve performances in a justifiable manner. This phenomenon can be noticed in Fig. 3, which reports training and validation losses as a function of training iterations, selecting a variable amount of frames per video. It is worth noting that the minimum validation loss does not improve selecting frames per video instead of , however choosing frames per video helps to prevent overfitting. For testing, we should also take into account the hardware and time constraints imposed by the dfdc challenge. With this in mind, we limit the number of analyzed frames from each sequence to for both training and testing phases. Even in this setting, the dimensions of the datasets remain remarkable: for the ff++, we end up with roughly million images, while for the dfdc with million frames.

In this perspective, we can further reduce the amount of data processed by the networks by recalling that not all the frame information is useful for the deepfake detection process [25]. Indeed, we can mainly focus our analysis on the region where the face of the subject is located. Consequently, as a pre-processing step, we extract from each frame the faces of the scene subjects using the BlazeFace extractor [3], that, in our experiments, proved to be faster than the MTCNN detector [38] used by the authors of [25]. In case more than one face is detected, we keep the face with the best confidence score. The resulting input for the networks is the squared color image introduced in section III, of size pixel.

During training and validation, to make our models more robust, we perform data augmentation operations on the input faces. In particular, we randomly apply downscaling, horizontal flipping, random brightness contrast, hue saturation, noise addition and finally JPEG compression. Specifically, we resort to Albumentation [6]

as our data-augmentation library, while we use Pytorch 

[23] as Deep Learning framework. We train the models using Adam [16]

optimizer with hyperparameters equal to

, and initial learning rate equal to .

Independently from the used training strategy, given the size of the datasets, we never train our networks for a complete epoch. Specifically:

  • for the end-to-end training, we either train for a maximum of iterations, indicating as iteration the processing of a batch of faces ( real, fake) taken randomly and evenly across all the videos of the train split, or until reaching a plateau on the validation loss. Validation of the model in this context is performed every training iterations, on samples taken again evenly and randomly across all videos of the validation set. The initial learning rate is reduced of a factor if the validation loss does not decrease after validation routines ( training iterations), and the training is stopped when we reach a minimum learning rate of ;

  • for the siamese training

    , the feature extractor is trained using the same number of iterations, validation routine and learning rate scheduling of the end-to-end training. The main difference lies in the different loss function used (as explained in Section 

    III), and in the composition of the batch, which in this case is made by triplets of samples (6 real-fake-fake, 6 fake-fake-real) selected across all videos of the set considered. Regarding the parameter in (2), we set it to after some preliminary experiments. The fine-tuning of the classification layer is then executed in a successive step following the end-to-end training paradigm with the hyperparameters specified above.

We finally run our experiments on a machine equipped with an Intel Xeon E5-2687W-v4 and a NVIDIA Titan V. The code to replicate our tests is freely available at

V Results

In this section we collect all the results obtained during our experimental campaign.

Fig. 4: Effect of the attention on faces under analysis. Given some faces to analyze (top row), the attention network tends to select regions like eyes, mouth and nose (bottom row). Faces have been extracted from FF++ dataset.

V-a EfficientNetB4Att explainability

In order to show the effectiveness of the attention mechanism in extracting the most informative content of faces, we evaluate the attention map computed on a few faces of ff++. Referring to Fig. 2, we select the output of the Sigmoid layer in the attention block, which is a 2D map with size . Then, we up-scale it to the input face size (), and superimpose this to the input face. Results are reported in Fig. 4. It is worth noting that this simple attention mechanism enables to highlight the most detailed portion of faces, e.g., eyes, mouth, nose and ears. On the contrary, flat regions (where gradients are small) are not informative for the network. As a matter of fact, it has been shown several times that artifacts of deepfake generation methods are mostly localized around facial features [34]. For instance, roughly modeled eyes and teeth, showing excessively white regions, are still the main trademarks of these methods.

V-B Siamese features

In order to understand whether the features produced by the encoding of the network when trained in siamese fashion are discriminatory for the task, we computed a projection over a reduced space using the well known algorithm t-SNE [31]. In Fig. 5 we show the projection obtained by means of EfficientNetB4Att starting from ff++ videos. We can clearly see how frames of the same videos clusters into small sub-regions. More importantly, all the real samples cluster into the top region of the chart, whereas the fake samples are in the bottom region. Frames of the same videos clusters into smaller sub-regions. This justifies the choice to adopt this particular training paradigm in addition to the classical end-to-end approach.

V-C Architecture independence

As we want to understand whether the different networks can be used in an ensemble, we explore whether the scores extracted by each model are independent to some extent.

In Fig. 6, all plots outside of the main diagonal show that different networks provide slightly different scores for each frame. Indeed, the point clouds do not perfectly align on a shape that can be easily described by a simple relation. This motivates us in using the different trained models in an ensemble way. If all networks were perfectly correlated, this would not be reasonable.

V-D Face manipulation detection capability

Fig. 5: t-SNE visualization of features obtained by EfficientNetB4Att with siamese training. Faces have been extracted from FF++ dataset.
Xception EfficientNet AUC LogLoss
Net B4 B4ST B4Att B4AttST ff++ dfdc ff++ dfdc
0.9273 0.8784 0.3844 0.4897
0.9382 0.8766 0.3777 0.4819
0.9337 0.8658 0.3439 0.5075
0.9360 0.8642 0.3873 0.5133
0.9293 0.8360 0.3597 0.5507
0.9413 0.8800 0.3411 0.4687
0.9428 0.8785 0.3566 0.4731
0.9421 0.8729 0.3370 0.4739
0.9423 0.8760 0.3371 0.4770
0.9393 0.8642 0.3289 0.4977
0.9390 0.8625 0.3515 0.4997
0.9441 0.8813 0.3371 0.4640
0.9432 0.8769 0.3269 0.4684
0.9433 0.8751 0.3399 0.4717
0.9426 0.8719 0.3304 0.4800
0.9444 0.8782 0.3294 0.4658
TABLE I: auc and LogLoss obtained with different network combinations over all the datasets. Top-3 results per column in bold, baseline in italics.

In this section, we report the average results achieved by the baseline network (i.e., XceptionNet) and the proposed models (i.e., EfficientNetB4, EfficientNetB4Att, EfficientNetB4ST and EfficientNetB4AttST). We also verify our guess behind the use of an ensemble, specifically combining two, three or even all the proposed models. In this case, the final score associated with a face is simply computed as the average between the scores returned by the single models.

In Table I

we report the auc (computed binarizing the network output with different thresholds) and LogLoss obtained in our experiments. Results are provided in a

per-frame fashion.

Analyzing these results, it is worth noting that the strategy of model ensembling generally awards in terms of performances. As somehow expected, best top- results are always reached by a combination of

or more networks, meaning that network fusion helps both the accuracy of the deepfake detection (estimated by means of auc) and the quality of the detection (estimated by means of LogLoss measure). Indeed, on both datasets, LogLoss and AUC are always better than the baseline.

(a) ff++
(b) dfdc
Fig. 6: Pair-plot showing the score distribution for real (orange ) and fake (blue ) samples for each pair of networks on ff++ (a) and dfdc (b) datasets.

V-E Kaggle results

In order to gain a deeper insight on the proposed solution performance, we also participated to the dfdc challenge on Kaggle [9] as ISPL team. The ultimate goal of the competition was to build a system able to tell whether a video is real or fake. The dfdc dataset used in this paper represents the training dataset released by the competition host, while the evaluation is performed over two different testing datasets: (i) the public test dataset; (ii) the private test dataset. Participants were not aware of the composition of those datasets (e.g., the provenance of the sequences, the techniques used for generating fakes, etc.), apart from the number of videos in public test set, which is roughly . The final solution proposed by our team was an ensemble of the proposed models, which led us to top on the leaderboard computed against the public test set. For the time being, the leaderboard computed over the private test set has not been disclosed yet.

Vi Conclusions

Being able to detect whether a video contains manipulated content is nowadays of paramount importance, given the significant impact of videos in everyday life and in mass communications. In this vein, we tackle the detection of facial manipulation in video sequences, targeting classical computer graphics as well as deep learning generated fake videos.

The proposed method takes inspiration from the family of EfficientNet models and improves upon a recently proposed solution, investigating an ensemble of models trained using two main concepts: (i) an attention mechanism which generates a human comprehensible inference of the model, increasing the learning capability of the network at the same time; (ii) a triplet siamese training strategy which extracts deep features from data to achieve better classification performances.

Results evaluated over two publicly available datasets containing almost videos reveals the proposed ensemble strategy as a valid solution for the goal of facial manipulation detection.

Future work will be devoted to the embedding of temporal information. As a matter of fact, intelligent voting schemes when more frames are analyzed at once might lead to an increased accuracy.


  • [1] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen (2018) MesoNet: a compact facial video forgery detection network. In IEEE International Workshop on Information Forensics and Security (WIFS), Cited by: §II.
  • [2] S. Agarwal, H. Farid, Y. Gu, M. He, K. Nagano, and H. Li (2019) Protecting world leaders against deep fakes. In

    IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    Cited by: §I.
  • [3] V. Bazarevsky, Y. Kartynnik, A. Vakunov, K. Raveendran, and M. Grundmann (2019)

    BlazeFace: sub-millisecond neural face detection on mobile gpus

    CoRR abs/1907.05047. External Links: Link, 1907.05047 Cited by: §III-A, §IV-C.
  • [4] P. Bestagini, S. Milani, M. Tagliasacchi, and S. Tubaro (2013) Local tampering detection in video sequences. In IEEE International Workshop on Multimedia Signal Processing (MMSP), Vol. . Cited by: §I.
  • [5] P. Bestagini, S. Milani, M. Tagliasacchi, and S. Tubaro (2016) Codec and gop identification in double compressed videos. IEEE Transactions on Image Processing (TIP) 25, pp. 2298–2310. External Links: Document Cited by: §I.
  • [6] A. Buslaev and A. A. Kalinin (2018) Albumentations: fast and flexible image augmentations. ArXiv e-prints. External Links: 1809.06839 Cited by: §IV-C.
  • [7] L. D’Amiano, D. Cozzolino, G. Poggi, and L. Verdoliva (2019) A patchmatch-based dense-field algorithm for video copy–move detection and localization. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) 29, pp. 669–682. Cited by: §I.
  • [8] H. Dang, F. Liu, J. Stehouwer, X. Liu, and A. Jain (2019) On the detection of digital face manipulation. External Links: 1910.01717 Cited by: §II, §III-A.
  • [9] (2019) Deepfake Detection Challenge (DFDC). Note: Cited by: §I, §I, §I, §II, §IV-A, §V-E.
  • [10] Deepfakes github. Note: Cited by: §I, §IV-A.
  • [11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §III-A.
  • [12] Faceswap. Note: Cited by: §I, §IV-A.
  • [13] A. Gironi, M. Fontani, T. Bianchi, A. Piva, and M. Barni (2014) A video forensic technique for detecting frame deletion and insertion. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6226–6230. Cited by: §I.
  • [14] D. Güera and E. J. Delp (2019)

    Deepfake Video Detection Using Recurrent Neural Networks

    IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS). External Links: Document, ISBN 9781538692943 Cited by: §II.
  • [15] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, pp. 7132–7141. Cited by: §III-A.
  • [16] D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arxiv: 14126980. Cited by: §IV-C.
  • [17] P. Korshunov and S. Marcel (2018) DeepFakes: a new threat to face recognition? assessment and detection. CoRR abs/1812.08685. External Links: 1812.08685 Cited by: §II.
  • [18] Y. Li, M. Chang, and S. Lyu (2018) In ictu oculi: exposing AI created fake videos by detecting eye blinking. In IEEE International Workshop on Information Forensics and Security (WIFS), External Links: Document, ISSN 2157-4774 Cited by: §II.
  • [19] Y. Li and S. Lyu (2019) Exposing deepfake videos by detecting face warping artifacts. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Cited by: §II.
  • [20] F. Matern, C. Riess, and M. Stamminger (2019) Exploiting visual artifacts to expose deepfakes and face manipulations. In IEEE Winter Applications of Computer Vision Workshops (WACVW), External Links: Document Cited by: §II.
  • [21] S. Milani, M. Fontani, P. Bestagini, M. Barni, A. Piva, M. Tagliasacchi, and S. Tubaro (2012) An overview on video forensics. APSIPA Transactions on Signal and Information Processing 1, pp. e2. External Links: Document Cited by: §I, §I, §II.
  • [22] H. H. Nguyen, F. Fang, J. Yamagishi, and I. Echizen (2019) Multi-task learning for detecting and segmenting manipulated facial images and videos. CoRR abs/1906.06876. External Links: 1906.06876 Cited by: §II.
  • [23] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §IV-C.
  • [24] A. Rocha, W. Scheirer, T. Boult, and S. Goldenstein (2011) Vision of the unseen: current trends and challenges in digital image and video forensics. ACM Computing Surveys 43 (26), pp. 1–42. Cited by: §I, §II.
  • [25] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner (2019) FaceForensics++: learning to detect manipulated facial images. In International Conference on Computer Vision (ICCV), Cited by: §I, §I, §II, §III-A, §III-A, 1st item, §IV-A, §IV-B, §IV-C, §IV-C.
  • [26] M. C. Stamm, W. S. Lin, and K. J. R. Liu (2012) Temporal forensics and anti-forensics for motion compensated video. IEEE Transactions on Information Forensics and Security (TIFS) 7, pp. 1315–1329. Cited by: §I.
  • [27] M. C. Stamm, Min Wu, and K. J. R. Liu (2013) Information forensics: an overview of the first decade. IEEE Access 1, pp. 167–200. Cited by: §I, §II.
  • [28] M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In

    International Conference on Machine Learning, (ICML) 2019

    Proceedings of Machine Learning Research, Vol. 97, pp. 6105–6114. Cited by: §I, §III-A, §III-A, §III, 2nd item.
  • [29] J. Thies, M. Zollhöfer, and M. Nießner (2019) Deferred neural rendering: image synthesis using neural textures. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–12. Cited by: §I, §IV-A.
  • [30] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner (2016) Face2face: real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2387–2395. Cited by: §I, §IV-A.
  • [31] L. van der Maaten and G. Hinton (2008) Visualizing data using t-SNE. Journal of Machine Learning Research 9, pp. 2579–2605. External Links: Link Cited by: §V-B.
  • [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems (NIPS), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. Cited by: §I, §III-A.
  • [33]

    D. Vázquez-Padín, M. Fontani, D. Shullani, F. Pérez-González, A. Piva, and M. Barni

    Video integrity verification and gop size estimation via generalized variation of prediction footprint. IEEE Transactions on Information Forensics and Security (TIFS) 15 (), pp. 1815–1830. Cited by: §I.
  • [34] L. Verdoliva (2020) Media forensics and deepfakes: an overview. External Links: 2001.06564 Cited by: §I, §II, §V-A.
  • [35] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang (2017-07) Residual attention network for image classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §III-A.
  • [36] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu (2014)

    Learning fine-grained image similarity with deep ranking

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386–1393. Cited by: §III-B2.
  • [37] X. Yang, Y. Li, and S. Lyu (2019) Exposing deep fakes using inconsistent head poses. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §II.
  • [38] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. Cited by: §III-A, §IV-C.
  • [39] M. Zollhöfer, J. Thies, P. Garrido, D. Bradley, T. Beeler, P. Pérez, M. Stamminger, M. Nießner, and C. Theobalt (2018) State of the art on monocular 3d face reconstruction, tracking, and applications. Computer Graphics Forum 37, pp. 523–550. External Links: Document Cited by: §I.