Emotions Don't Lie: A Deepfake Detection Method using Audio-Visual Affective Cues

03/14/2020 ∙ by Trisha Mittal, et al. ∙ 0

We present a learning-based multimodal method for detecting real and deepfake videos. To maximize information for learning, we extract and analyze the similarity between the two audio and visual modalities from within the same video. Additionally, we extract and compare affective cues corresponding to emotion from the two modalities within a video to infer whether the input video is "real" or "fake". We propose a deep learning network, inspired by the Siamese network architecture and the triplet loss. To validate our model, we report the AUC metric on two large-scale, audio-visual deepfake detection datasets, DeepFake-TIMIT Dataset and DFDC. We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4 and 96.6



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in computer vision and deep learning techniques have enabled the creation of sophisticated and compelling forged versions of social media images and videos (also known as

“deepfakes”[11, 12, 15, 14, 16].Due to the surge in AI-synthesized deepfake content, multiple attempts have been made to release benchmark datasets [25, 36, 17, 9] and algorithms [50, 1, 44, 28, 30, 36, 32, 33, 37, 18]

for deepfake detection. DeepFake detection methods classify an input video or image as

“real” or “fake”. Prior methods exploit only a single modality, i.e., only the facial cues from these videos either by employing temporal features or by exploring the visual artifacts within frames. There are many other applications of video processing that use and combine multiple modalities for audio-visual speech recognition [21], emotion recognition [31, 49], and language and vision tasks [22, 5]. These applications show that combining multiple modalities can provide complementary information and lead to stronger inferences. Even for detecting deepfake content, we can extract many such modalities like facial cues, speech cues, background context, hand gestures, and body posture and orientation from a video. When combined, multiple cues or modalities can be used to detect whether a given video is real or fake.

In this paper, the key idea used for deepfake detection is to exploit the relationship between the visual and audio modalities extracted from the same video. Prior studies in both the psychology [41]

literature, as well as the multimodal machine learning literature 

[3] have shown evidence of a strong correlation between different modalities of the same subject [41]. More specifically, [51, 38, 35, 40] suggest some positive correlation between audio-visual modalities, which have been exploited for multimodal emotion recognition. For instance, [2, 19] suggests that when different modalities are modeled and projected into a common space, they should point to similar affective cues. Affective cues are specific features that convey rich emotional and behavioral information to human observers and help them distinguish between different perceived emotions [43]. These affective cues comprise of various positional and movement features, such as dilation of the eye, raised eyebrows, volume, pace, and tone of the voice. We exploit this correlation between modalities and affective cues to classify “real” and “fake” videos.

Main Contribution:

We propose a Siamese network-based architecture for detecting deepfake videos. At training time, we pass a real video along with its deepfake through our network and obtain modality and emotion embedding vectors for the face and speech of the subject. We use these embedding vectors to compute the triplet loss function to minimize the similarity between the modalities from the fake video and maximize the similarity between modalities for the real video.

The novel aspects of our work include:

  1. We propose a Siamese Network-based architecture with a modified triplet loss to model the similarity (or dissimilarity) between the facial and speech modalities, extracted from the input video, to perform deepfake detection.

  2. We also exploit the affect information, i.e., emotion cues from the two modalities to detect the similarity (or dissimilarity) between modality signals, and show that emotion information helps in detecting deepfake content.

We validate our model on two benchmark deepfake detection datasets, DeepFakeTIMIT Dataset [25], and DFDC [9]. We report the Area Under Curve (AUC) metric on the two datasets for our approach and compare with several prior works. We report the per-video AUC score of 84.4%, which is an improvement of about 9% over SOTA on DFDC, and our network performs at-par with prior methods on the DF-TIMIT dataset.

2 Related Work

In this section we summarise prior work done in the domain. We elaborate on prior work in unimodal deepfake detection methods in Section 2.1. In Section 2.2, we discuss the multimodal approaches for deepfake detection. We summarize the deepfake video datasets in Section 2.3. We give references from psychology literature validating our correlation method for deepfake detection in Section 2.4. And, finally, in Section 2.5, we motivate the use of Siamese networks and triplet loss for deepfake detection.

2.1 Unimodal DeepFake Detection Methods

Most prior work in deepfake detection decompose videos into frames and explore visual artifacts across frames. For instance, Li et al. [28]

propose a Deep Neural Network (DNN) to detect fake videos based on artifacts observed during the face warping step of the generation algorithms. Similarly, Yang et al. 

[44] look at inconsistencies in the head poses in the synthesized videos and Matern et al. [30] capture artifacts in the eyes, teeth and facial contours of the generated faces. Prior works have also experimented with a variety of network architectures. For instance, Nguyen et al. [33] explore capsule structures, Rossler et al. [36] use the XceptionNet, and Zhou et al. [50]

use a two-stream Convolutional Neural Network (CNN) to achieve SOTA in general-purpose image forgery detection. Previous researchers have also observed and exploited the fact that temporal coherence is not enforced effectively in the synthesis process of deepfakes. For instance, Sabir et al. 

[37] leveraged the use of spatio-temporal features of video streams to detect deepfakes. Likewise, Guera and Delp et al.  [18]

highlight that deepfake videos contain intra-frame consistencies and hence use a CNN with a Long Short Term Memory (LSTM) to detect deepfake videos.

Dataset Released # Videos Video Source Modes
Real Fake Total Real Fake Visual Audio
UADFV [44] Nov 2018 49 49 98 YouTube FakeApp [12]
DF-TIMIT [25] Dec 2018 0 620 620 VidTIMIT [39] FS_GAN [16]
Face Forensics++ [36] Jan 2019 1,000 4,000 5,000 YouTube FS [11], DF
DFD [17] Sep 2019 361 3,070 3,431 YouTube DF
CelebDF [29] Nov 2019 408 795 1,203 YouTube DF
DFDC [9] Oct 2019 19,154 99,992 119,146 Actors Unknown
Deeper Forensics 1.0 [24] Jan 2020 50,000 10,000 60,000 Actors DF
Table 1: Benchmark Datasets for DeepFake Video Detection. Our approach is applicable to datasets that include the audio and visual modalities. Only two datasets (highlighted in blue) satisfy that criteria and we evaluate the performance on those datasets. Further details in Section 4.1.

2.2 Multimodal DeepFake Detection Methods

While unimodal DeepFake Detection methods (discussed in Section 2.1) have focused only on the facial features of the subject, there has not been much focus on using the multiple modalities that are part of the same video. Jeo and Bang et al. [23] propose FakeTalkerDetect, which is a Siamese-based network to detect the fake videos generated from the neural talking head models. They perform a classification based on distance. However, the two inputs to their Siamese network are a real and fake video. Korshunov et al. [26] analyze the lip-syncing inconsistencies using two channels, the audio and visual of moving lips. Krishnamurthy et al. [27] investigated the problem of detecting deception in real-life videos, which is very different from deepfake detection. They use an MLP-based classifier combining video, audio, and text with Micro-Expression features. Our approach to exploiting the mismatch between two modalities is quite different and complimentary to these methods.

2.3 DeepFake Video Datasets

The problem of deepfake detection has increased considerable attention, and this research has been stimulated with many datasets. We summarize and analyze benchmark deepfake video detection datasets in Table 1. Furthermore, the newer datasets (DFDC [9] and Deeper Forensics-1.0 [24])are larger and do not disclose details of the AI model used to synthesize the fake videos from the real videos. Also, DFDC is the only dataset that contains a mix of videos with manipulated faces, audio, or both. All the other datasets contain only manipulated faces. Furthermore, only DFDC and DF-TIMIT [25] contain both audio and video, allowing us to analyze both modalities.

2.4 Multimodal Affect Correlation in Psychology Research

Shan et al. [41] state that even if two modalities representing the same emotion vary in terms of appearance, the features detected are similar and should be correlated. Hence, if projected to a common space, they are compatible and can be fused to make inferences. Zhu. et al. [51]

explore the relationship between visual and auditory human modalities. Based on the neuroscience literature, they suggest that the visual and auditory signals are coded together in small populations of neurons within a particular part of the brain.

[38, 35, 40] explored the correlation of lip movements with speech. Studies concluded that our understanding of the speech modality is greatly aided by the sight of the lip and facial movements. Subsequently, such correlation among modalities has been explored extensively to perform multimodal emotion recognition [2, 19, 20, 34]. These works have suggested and shown correlations between affect features obtained from the individual modalities (face, speech, eyes, gestures). For instance, Mittal et al. [31] propose a multimodal emotion perception network, where they use the correlation among modalities to differentiate between effectual and ineffectual modality features. Our approach is motivated by these developments in psychology research.

2.5 Siamese Network Architecture and Triplet Loss

The Siamese network [6] architecture consists of two neural networks that share the same weights and are trained together. Each network typically takes in a different pattern (e.g.

, two views of an image, two tones of a speech), and the end output is a value representing the similarity between those two patterns. The overall network is trained using variants of the triplet loss or the contrastive loss, which are designed to maximize the distance between features learned from dissimilar patterns and minimize the distance between features learned from similar patterns. With this training objective, Siamese network-based architectures have been extensively used in applications such as face recognition 

[8], face verification [42], and speaker identification [7]. In our work, we develop a Siamese network-based architecture and a variant of the triplet loss to maximally separate features learned from real and fake videos.

(a) Training Routine: (left) We extract facial and speech features from the raw videos (each subject has a real and fake video pair) using OpenFace and pyAudioAnalysis, respectively. (right) The extracted features are passed to the training network that consists of two modality embedding networks and two emotion embedding networks.
(b) Testing Routine: At runtime, given an input video, our network predicts the label (real or fake).
Figure 1: Overview Diagram:

We present an overview diagram for both the training and testing routines of our model. The networks consist of 2D convolutional layers (purple), max-pooling layers (yellow), fully-connected layers (green), and normalization layers (blue).

and are modality embedding networks and and are emotion embedding networks for face and speech, respectively.
Symbol Description

denote face and speech features extracted from OpenFace and pyAudioAnalysis.

indicate whether the feature is real or fake.
E.g. denotes the face features extracted from a real video using OpenFace.
denote modality embedding and emotion embedding.
denote face and speech cues.
indicate whether the embedding is real or fake.
E.g. denotes the face modality embedding generated from a real video.
Modality Embedding Similarity Loss (Used in Training)
Emotion Embedding Similarity Loss (Used in Training)
Face/Speech Modality Embedding Distance (Used in Testing)
Face/Speech Emotion Embedding Distance (Used in Testing)
Table 2: Notation: We highlight the notation and symbols used in the paper.

3 Our Approach

In this section, we present our multimodal approach to detecting deepfake videos. We briefly describe the problem statement and give an overview of our approach in Section 3.1. We also elaborate on how our approach is similar to a Siamese Network architecture. We elaborate on the modality embeddings and the emotion embedding, the two main components in Section 3.2 and Section 3.3, respectively. We explain the similarity score and modified triplet losses used for training the network in Section 3.4, and finally, in Section 3.5, we explain how they are used to classify between real and fake videos. We list all notations used throughout the paper in Table 2.

3.1 Problem Statement and Overview

Given an input video with audio-visual modalities present, our goal is to detect if it is a deepfake video. Overviews of our training and testing routines are given in Figure (a)a and Figure (b)b, respectively. During training, we select one “real” and one “fake” video containing the same subject. We extract the visual face as well as the speech features, and , respectively, from the real input video. In similar fashion, we extract the face and speech features (using OpenFace [4] and pyAudioAnalysis [13]), and , respectively, from the fake video. More details about the feature extraction from the raw videos have been presented in Section 4.3. The extracted features, , form the inputs to the networks (, , , and ), respectively. We train these networks using a combination of two triplet loss functions designed using the similarity scores, denoted by and . represents similarity among the facial and speech modalities, and is the similarity between the affect cues (specifically, emotion) from the modalities of both real and fake videos.

Our training method is similar to a Siamese network because we also use the same weights of the network () to operate on two different inputs, one real video and the other a fake video of the same subject. Unlike regular classification-based neural networks, which perform classification and propagate that loss back, we instead use similarity-based metrics for distinguishing the real and fake videos. We model this similarity between these modalities using Triplet loss (explained elaborately in Section 3.5).

During testing, we are given a single input video, from which we extract the face and speech feature vectors, and , respectively. We pass into and , and pass into and , where , , , and are used to compute distance metrics, and . We use a threshold , learned during training, to classify the video as real or fake.

3.2 and : Modality Embeddings

and are neural networks that we use to learn the unit-normalized embeddings for the face and speech modalities, respectively. In Figure 1, we depict and in both training and testing routines. They are composed of

D convolutional layers (purple), max-pooling layers (yellow), and fully connected layers (green). ReLU non-linearity is used between all layers. The last layer is a unit-normalization layer (blue). For both face and speech modalities,

and return -dimensional unit-normalized embeddings.


3.3 and : Emotion Embedding

and are neural networks that we use to learn the unit-normalized affect embeddings for the face and speech modalities, respectively. and are based on the Memory Fusion Network (MFN) [47]

, which is reported to have SOTA performance on both face and speech emotion recognition. MFN is based on a recurrent neural network architecture with three main components: a system of LSTMs, a Memory Attention Network, and a Gated Memory component. The system of LSTMs takes in different views of the input data. In our case,

takes in different views of a video and takes in different views of an audio. We construct the different views by cropping, trimming and sub-sampling the input streams. The memory attention network learns underlying patterns across the different views processed by the system of LSTMs. The gated memory component stores the most informative cross-view patterns, which the network subsequently fuses to predict emotion labels. We pre-trained the MFN with video and the MFN with audio from CMU-MOSEI dataset [48]. The CMU-MOSEI dataset describes the emotion space with 6 discrete emotions following the Ekman model [10]: happy, sad, angry, fearful, surprise, and disgust, and a “neutral” emotion to denote the absence of any of these emotions. For face and speech modalities in our network, we use -dimensional unit-normalized features constructed from the cross-view patterns learned by and respectively.


3.4 Training Routine

At training time, we use a fake and a real video with the same subject as the input. After, passing extracted features from raw videos () through , , and , we obtain the unit-normalised modality and emotion embeddings as described in Eq. 1 and Eq. 2.

Considering a input real and fake video, we first compare with , and with to understand what modality was manipulated more in the fake video. Considering, we identify the face modality to be manipulated more in the fake video, based on these embeddings we compute the first similarity between the real and fake speech and face embeddings as follows:


where denotes the Euclidean distance.

In simpler terms, is computing the distance between two pairs, and . We expect to be closer to each other than as it contains a fake face modality. Hence, we expect to maximize this difference. To use this correlation metric as a loss function to train our model, we formulate it using the notation of Triplet Loss


where is the margin used for convergence of training.

If we had observed that speech is the more manipulated modality in the fake video, we would formulate as follows:

Similarly, we compute the second similarity as the difference in affective cues extracted from the modalities from both real and fake videos. We denote this as follows:


As per prior psychology studies, we expect that similar un-manipulated modalities point towards similar affective cues. Hence, because the input here has a manipulated face modality, we expect to be closer to each other than to . To use this as a loss function, we again formulate this using a Triplet loss.


where is the margin.

Again, if speech was the highly manipulated modality in the fake video, we would formulate as follows:

We use both the similarity scores as the cumulative loss and propagate this back into the network.


3.5 Testing Inference

At test time, we only have a single input video that is to be labeled real or fake. After extracting the features, and from the raw videos, we perform a forward pass through , , and , as depicted in Figure (b)b to obtain modality and emotion embeddings.

To make an inference about real and fake, we compute the following two distance values:


To distinguish between real and fake, we compare and with a threshold, that is, empirically learned during training as follows:

we label the video as a fake video.

Computation of : To compute , we use the best-trained model and run it on the training set. We compute and for both real and fake videos of the train set. We average these values and find an equidistant number, which serves as a good threshold value. Based on our experiments, the computed value of was almost consistent and didn’t vary much between datasets.

4 Implementation and Evaluation

4.1 Datasets

We perform experiments on the DF-TIMIT [25] and DFDC [9] datasets, as only these datasets contain modalities for face and speech features (1). We used the entire DF-TIMIT dataset and were able to use videos from DFDC dataset due to computational overhead. Both the datasets are split into training (%), and testing (%) sets.

4.2 Training Parameters

On the DFDC Dataset, we trained our models with a batch size of for epochs. Due to the significantly smaller size of the DF-TIMIT dataset, we used a batch size of and trained it for epochs. We used Adam optimizer with a learning rate of . All our results were generated on an NVIDIA GeForce GTX1080 Ti GPU.

4.3 Feature Extraction

In our approach (See Figure 1), we first extract the face and speech features from the real and fake input videos. We use existing SOTA methods for this purpose. In particular, we use OpenFace [4] to extract -dimensional facial features, including the D landmarks positions, head pose orientation, and gaze features. To extract speech features, we use pyAudioAnalysis [13] to extract Mel Frequency Cepstral Coefficients (MFCC) speech features.

5 Results and Analysis

In this section, we elaborate on some quantitative and qualitative results of our methods.

5.1 Comparison with SOTA Methods

We report and compare per-video AUC Scores of our method against prior deepfake video detection methods on DF-TIMIT and DFDC. We have summarized these results in Table 3. The following are the prior methods used to compare the performance of our approach on the same datasets.

S.No. Methods Datasets
DF-TIMIT [25] DFDC [9]
1 Capsule [33] 78.4 74.4 53.3
2 Multi-task [32] 62.2 55.3 53.6
3 HeadPose [44] 55.1 53.2 55.9
4 Two-stream [50] 83.5 73.5 61.4
5 VA-MLP [30] 61.4 62.1 61.9
VA-LogReg 77.0 77.3 66.2
6 MesoInception4 80.4 62.7 73.2
Meso4 [1] 87.8 68.4 75.3
7 Xception-raw [36] 56.7 54.0 49.9
Xception-c40 75.8 70.5 69.7
Xception-c23 95.9 94.4 72.2
8 FWA [28] 99.9 93.2 72.7
DSP-FWA 99.9 99.7 75.5
Our Method 96.3 94.9 84.4
Table 3: AUC Scores. Blue denotes best and green denotes second-best. Our model improves the SOTA by approximately 9% on the DFDC dataset and achieves accuracy similar to the SOTA on the DF-TIMIT dataset.
  1. [noitemsep]

  2. Two-stream [50]: uses a two-stream CNN to achieve SOTA performance in image-forgery detection. They use standard CNN network architectures to train the model.

  3. MesoNet [1] is a CNN-based detection method that targets the microscopic properties of images. AUC scores are reported on two variants.

  4. HeadPose [44] captures inconsistencies in headpose orientation across frames to detect deepfakes.

  5. FWA [28]

    uses a CNN to expose the face warping artifacts introduced by the resizing and interpolation operations.

  6. VA [30] focuses on capturing visual artifacts in the eyes, teeth and facial contours of synthesized faces. Results have been reported on two standard variants of this method.

  7. Xception [36] is a baseline model trained on the FaceForensics++ dataset based on the XceptionNet model. AUC scores have been reported on three variants of the network.

  8. Multi-task [32] uses a CNN to simultaneously detect manipulated images and segment manipulated areas as a multi-task learning problem.

  9. Capsule [33] uses capsule structures based on a standard DNN.

  10. DSP-FWA is an improved version of FWA [28] with a spatial pyramid pooling module to better handle the variations in resolutions of the original target faces.

Figure 2: Qualitative Results: We show results of our model on the DFDC and DF-TIMIT datasets. Our model uses the subjects’ audio-visual modalities as well as their emotions to distinguish between real and deepfake videos. The emotions from the speech and facial cues in fake videos are different; however in the case of real videos, the emotions from both modalities are the same.
(a) Correct Classification: Our model correctly classified this popular deepfake video as fake.
(b) Misclassification: Our model was not able to detect this deepfake video as fake because the network was not able to detect any emotion from the subject.
Figure 3: In the Wild Deepfake Videos: Our model succeeds in the wild. We collect several popular deepfake videos from online social media and our model achieves reasonably good results. We will present additional results in the supplementary video.

5.2 Qualitative Results

We show some selected frames of videos from both the datasets in Figure 2 along with the labels (real/fake). For the qualitative results shown for DFDC, the real video predicted a “neutral” emotion label for both speech and face modality, whereas in the fake video the face predicted “surprise” and speech predicted “neutral”. This result is indeed interpretable because the fake video was generated by manipulating only the face modality and not the speech modality. We see a similar emotion label mismatch for the DF-TIMIT sample as well.

5.3 Interpreting the Correlations

To better understand the learned embeddings, we plot the distance between the unit-normalized face and speech embeddings learned from and on randomly chosen points from the DFDC train set in Figure 4(a). We plot in blue and in orange. It is interesting to see that the peak or the majority of the subjects from real videos have a smaller separation, between their embeddings as opposed to the fake videos (). We also plot the number of videos, both fake and real, with a mismatch of emotion labels extracted using and in Figure 4(b). Of a total of fake videos, showed a mismatch in the labels extracted from face and speech modalities. Similarly, out of real videos, also showed a label mismatch.

(a) Modality Embedding Distances: We plot the percentage of subject videos versus the distance between the face and speech modality embeddings. The figure shows that the distribution of real videos (blue curve) is centered around a lower modality embedding distance (). In contrast, the fake videos (orange curve) are distributed around a higher distance center (). Conclusion: We show that audio-visual modalities are more similar in real videos as compared to fake videos.
(b) Emotion Embedding in Real and Fake Videos: The blue and orange bars represent the total number of videos where the emotion labels, obtained from the face and speech modalities, do not match, and match, respectively. Of the total fake videos, videos were found to contain a mismatch between emotion labels and for real videos this was only Conclusion: We show that emotions of subjects, from multiple modalities, are strongly similar in real videos, and often mismatched in fake videos.
Figure 4: Embedding Interpretation: We provide an intuitive interpretation of the learned embeddings from with visualizations. These results back our hypothesis of emotions being highly correlated in real videos as compared to fake videos.
Methods Datasets
DF-TIMIT [25] DFDC [9]
Our Method w/o Modality Similarity () 92.5 91.7 78.3
Our Method w/o Emotion Similarity () 94.8 93.6 82.8
Our Method 96.3 94.9 84.4
Table 4: Ablation Experiments. To motivate our model, we perform ablation studies where we remove one correlation at a time for training and report the AUC scores.
Figure 5: Misclassification Results: We show one sample each from DFDC and DF-TIMIT where our model predicted the two fake videos as real due to incorrect emotion embeddings.

5.4 Ablation Experiments

As explained in Section 3.5, we use two distances, based on the modality embedding similarities and emotion embedding similarities, to detects fake videos. To understand and motivate the contribution of each similarity, we perform an ablation study where we run the model using only one correlation for training. We have summarized the results of the ablation experiments in Table 4. The modality embedding similarity helps to achieve better AUC scores than the emotion embedding similarity.

5.5 Failure Cases

Our approach models the correlation between two modalities and the associated affective cues to distinguish between real and fake modalities. However, there are multiple instances where the deepfake videos do not contain such a mismatch in terms of emotional classification based on different modalities. This is also because humans express emotions differently. As a result, our model fails to classify such videos as fake. Similarly, both face and speech are modalities that are easy to fake. As a result, it is possible that our method also classifies a real video as a fake video due to this mismatch. In Figure 5, we show one such video from both the datasets, where our model failed.

5.6 Results on Videos in the Wild

We tested the performance of our model on two such deepfake videos obtained from an online social platform [45, 46]. Some frames from this video have been shown in Figure 3.

6 Conclusion, Limitations, and Future Work

We present a learning-based method for detecting fake videos. We use the similarity between audio-visual modalities and the similarity between the affective cues of the two modalities to infer whether a video is “real” or “fake”. We evaluated our method on two benchmark audio-visual deepfake datasets, DFDC, and DF-TIMIT.

Our approach has some limitations. First, our approach could result in misclassifications on both the datasets, as compared to the one in the real video. Given different representations of expressing emotions, our approach can also find a mismatch in the modalities of real videos, and (incorrectly) classify them as fake. Furthermore, many of the deepfake datasets primarily contain more than one person per video. We may need to extend our approach to take into account the emotional state of multiple persons in the video and come with a possible scheme for deepfake detection.

In the future, we would like to look into incorporating more modalities and even context to infer whether a video is a deepfake or not. We would also like to combine our approach with the existing ideas of detecting visual artifacts across frames for better performance. Additionally, we would like to approach better methods for using audio cues.


  • [1] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen (2018) Mesonet: a compact facial video forgery detection network. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1–7. Cited by: §1, item 2, Table 3.
  • [2] T. Balomenos, A. Raouzaiou, S. Ioannou, A. Drosopoulos, K. Karpouzis, and S. Kollias (2004) Emotion analysis in man-machine interaction systems. In International Workshop on Machine Learning for Multimodal Interaction, pp. 318–328. Cited by: §1, §2.4.
  • [3] T. Baltrušaitis, C. Ahuja, and L. Morency (2018) Multimodal machine learning: a survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 41 (2), pp. 423–443. Cited by: §1.
  • [4] T. Baltrušaitis, P. Robinson, and L. Morency (2016)

    Openface: an open source facial behavior analysis toolkit

    In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–10. Cited by: §3.1, §4.3.
  • [5] J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, et al. (2010) VizWiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pp. 333–342. Cited by: §1.
  • [6] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1994) Signature verification using a” siamese” time delay neural network. In Advances in neural information processing systems, pp. 737–744. Cited by: §2.5.
  • [7] K. Chen and A. Salman (2011) Extracting speaker-specific information with a regularized siamese deep network. In Advances in Neural Information Processing Systems, pp. 298–306. Cited by: §2.5.
  • [8] S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In

    2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

    Vol. 1, pp. 539–546. Cited by: §2.5.
  • [9] B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. C. Ferrer (2019) The deepfake detection challenge (dfdc) preview dataset. arXiv preprint arXiv:1910.08854. Cited by: §1, §1, §2.3, Table 1, §4.1, Table 3, Table 4.
  • [10] P. Ekman, W. V. Freisen, and S. Ancoli (1980) Facial signs of emotional experience.. Journal of personality and social psychology 39 (6), pp. 1125. Cited by: §3.3.
  • [11] () Faceswap: deepfakes software for all. Note: https://github.com/deepfakes/faceswap(Accessed on 02/16/2020) Cited by: §1, Table 1.
  • [12] () FakeApp 2.2.0. Note: https://www.malavida.com/en/soft/fakeapp/#gref(Accessed on 02/16/2020) Cited by: §1, Table 1.
  • [13] T. Giannakopoulos (2015) Pyaudioanalysis: an open-source python library for audio signal analysis. PloS one 10 (12). Cited by: §3.1, §4.3.
  • [14] () GitHub - dfaker/df: larger resolution face masked, weirdly warped, deepfake,. Note: https://github.com/dfaker/df(Accessed on 02/16/2020) Cited by: §1.
  • [15] () GitHub - iperov/deepfacelab: deepfacelab is the leading software for creating deepfakes.. Note: https://github.com/iperov/DeepFaceLab(Accessed on 02/16/2020) Cited by: §1.
  • [16] ()

    GitHub - shaoanlu/faceswap-gan: a denoising autoencoder + adversarial losses and attention mechanisms for face swapping.

    Note: https://github.com/shaoanlu/faceswap-GAN(Accessed on 02/16/2020) Cited by: §1, Table 1.
  • [17] () Google ai blog: contributing data to deepfake detection research. Note: https://ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html(Accessed on 02/16/2020) Cited by: §1, Table 1.
  • [18] D. Güera and E. J. Delp (2018) Deepfake video detection using recurrent neural networks. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. Cited by: §1, §2.1.
  • [19] H. Gunes and M. Piccardi (2007) Bi-modal emotion recognition from expressive face and body gestures. Journal of Network and Computer Applications 30 (4), pp. 1334–1345. Cited by: §1, §2.4.
  • [20] H. Gunes and M. Piccardi (2007) Bi-modal emotion recognition from expressive face and body gestures. Journal of Network and Computer Applications 30 (4), pp. 1334–1345. Cited by: §2.4.
  • [21] M. Gurban, J. Thiran, T. Drugman, and T. Dutoit (2008) Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition. In Proceedings of the 10th international conference on Multimodal interfaces, pp. 237–240. Cited by: §1.
  • [22] M. Hodosh, P. Young, and J. Hockenmaier (2013)

    Framing image description as a ranking task: data, models and evaluation metrics


    Journal of Artificial Intelligence Research

    47, pp. 853–899.
    Cited by: §1.
  • [23] H. Jeon, Y. Bang, and S. S. Woo (2019) FakeTalkerDetect: effective and practical realistic neural talking head detection with a highly unbalanced dataset. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §2.2.
  • [24] L. Jiang, W. Wu, R. Li, C. Qian, and C. C. Loy (2020) DeeperForensics-1.0: a large-scale dataset for real-world face forgery detection. arXiv preprint arXiv:2001.03024. Cited by: §2.3, Table 1.
  • [25] P. Korshunov and S. Marcel (2018) Deepfakes: a new threat to face recognition? assessment and detection. arXiv preprint arXiv:1812.08685. Cited by: §1, §1, §2.3, Table 1, §4.1, Table 3, Table 4.
  • [26] P. Korshunov and S. Marcel (2018) Speaker inconsistency detection in tampered video. In 2018 26th European Signal Processing Conference (EUSIPCO), pp. 2375–2379. Cited by: §2.2.
  • [27] G. Krishnamurthy, N. Majumder, S. Poria, and E. Cambria (2018) A deep learning approach for multimodal deception detection. arXiv preprint arXiv:1803.00344. Cited by: §2.2.
  • [28] Y. Li and S. Lyu (2018) Exposing deepfake videos by detecting face warping artifacts. arXiv preprint arXiv:1811.00656. Cited by: §1, §2.1, item 4, item 9, Table 3.
  • [29] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu (2019) Celeb-df: a new dataset for deepfake forensics. arXiv preprint arXiv:1909.12962. Cited by: Table 1.
  • [30] F. Matern, C. Riess, and M. Stamminger (2019) Exploiting visual artifacts to expose deepfakes and face manipulations. In 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), pp. 83–92. Cited by: §1, §2.1, item 5, Table 3.
  • [31] T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha (2019) M3ER: multiplicative multimodal emotion recognition using facial, textual, and speech cues. arXiv preprint arXiv:1911.05659. Cited by: §1, §2.4.
  • [32] H. H. Nguyen, F. Fang, J. Yamagishi, and I. Echizen (2019) Multi-task learning for detecting and segmenting manipulated facial images and videos. arXiv preprint arXiv:1906.06876. Cited by: §1, item 7, Table 3.
  • [33] H. H. Nguyen, J. Yamagishi, and I. Echizen (2019) Capsule-forensics: using capsule networks to detect forged images and videos. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2307–2311. Cited by: §1, §2.1, item 8, Table 3.
  • [34] M. Pantic, N. Sebe, J. F. Cohn, and T. Huang (2005) Affective multimodal human-computer interaction. In Proceedings of the 13th annual ACM international conference on Multimedia, pp. 669–676. Cited by: §2.4.
  • [35] L. A. Ross, D. Saint-Amour, V. M. Leavitt, D. C. Javitt, and J. J. Foxe (2007) Do you see what i am saying? exploring visual enhancement of speech comprehension in noisy environments. Cerebral cortex 17 (5), pp. 1147–1153. Cited by: §1, §2.4.
  • [36] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner (2019) Faceforensics++: learning to detect manipulated facial images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1–11. Cited by: §1, §2.1, Table 1, item 6, Table 3.
  • [37] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and P. Natarajan (2019) Recurrent convolutional strategies for face manipulation detection in videos. Interfaces (GUI) 3, pp. 1. Cited by: §1, §2.1.
  • [38] D. A. Sanders and S. J. Goodrich (1971) The relative contribution of visual and auditory components of speech to speech intelligibility as a function of three conditions of frequency distortion. Journal of Speech and Hearing Research 14 (1), pp. 154–159. Cited by: §1, §2.4.
  • [39] C. Sanderson (2002) The vidtimit database. Technical report IDIAP. Cited by: Table 1.
  • [40] J. Schwartz, F. Berthommier, and C. Savariaux (2004) Seeing to hear better: evidence for early audio-visual interactions in speech identification. Cognition 93 (2), pp. B69–B78. Cited by: §1, §2.4.
  • [41] C. Shan, S. Gong, and P. W. McOwan (2007) Beyond facial expressions: learning human emotion from body gestures.. In BMVC, pp. 1–10. Cited by: §1, §2.4.
  • [42] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf (2014) Deepface: closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708. Cited by: §2.5.
  • [43] C. Wang, Z. Zhou, X. Jin, Y. Fang, and M. K. Lee (2017) The influence of affective cues on positive emotion in predicting instant information sharing on microblogs: gender as a moderator. Information Processing & Management 53 (3), pp. 721–734. Cited by: §1.
  • [44] X. Yang, Y. Li, and S. Lyu (2019) Exposing deep fakes using inconsistent head poses. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8261–8265. Cited by: §1, §2.1, Table 1, item 3, Table 3.
  • [45] () YouTube video 1. Note: https://www.youtube.com/watch?time_continue=1&v=cQ54GDm1eL0&feature=emb_logo(Accessed on 02/16/2020) Cited by: §5.6.
  • [46] () YouTube video 2. Note: https://www.youtube.com/watch?time_continue=14&v=aPp5lcqgISk&feature=emb_logo(Accessed on 02/16/2020) Cited by: §5.6.
  • [47] A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and L. Morency (2018) Memory fusion network for multi-view sequential learning. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §3.3.
  • [48] A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L. Morency (2018) Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2236–2246. Cited by: §3.3.
  • [49] A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L. Morency (2018) Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2236–2246. Cited by: §1.
  • [50] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis (2017) Two-stream neural networks for tampered face detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1831–1839. Cited by: §1, §2.1, item 1, Table 3.
  • [51] L. L. Zhu and M. S. Beauchamp (2017) Mouth and voice: a relationship between visual and auditory preference in the human superior temporal sulcus. Journal of Neuroscience 37 (10), pp. 2697–2708. Cited by: §1, §2.4.