Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization

We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, eg, loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as an aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7 our technique identifies the manipulated video segments.



There are no comments yet.


page 4

page 6


Multi-Modulation Network for Audio-Visual Event Localization

We study the problem of localizing audio-visual events that are both aud...

Investigating Audio, Visual, and Text Fusion Methods for End-to-End Automatic Personality Prediction

We propose a tri-modal architecture to predict Big Five personality trai...

Modality Dropout for Improved Performance-driven Talking Faces

We describe our novel deep learning approach for driving animated faces ...

Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization

Due to its high societal impact, deepfake detection is getting active at...

Vision-Infused Deep Audio Inpainting

Multi-modality perception is essential to develop interactive intelligen...

Mutual Contrastive Learning to Disentangle Whole Slide Image Representations for Glioma Grading

Whole slide images (WSI) provide valuable phenotypic information for his...

Audio-Visual Person-of-Interest DeepFake Detection

Face manipulation technology is advancing very rapidly, and new methods ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The combination of ubiquitous multimedia and high performance computing resources has inspired numerous efforts to manipulate audio-visual content for both benign and sinister purposes. Recently, there has been a lot of interest in the creation and detection of high-quality videos containing facial and auditory manipulations, popularly known as deepfakes (Dolhansky et al., 2019; Korshunov and Marcel, 2018). Since fake videos are often indistinguishable from genuine counterparts, detection of deepfakes is challenging but topical given their potential for denigration and defamation, especially against women and in propagating misinformation (Yadlin-Segal and Oppenheim, ; Floridi, 2018).

Part of the challenge in detecting deepfakes via artifical intelligence (AI) approaches is that deepfakes are themselves created via AI techniques. Neural network-based architectures like Generative Adversarial Networks (GANs) 

(Goodfellow et al., 2014)

and Autoencoders 

(Vincent et al., 2008) are used for generating fake media content, and due to their ‘learnable’ nature, output deepfakes become more naturalistic and adept at cheating fake detection methods over time. Improved deepfakes necessitate novel fake detection (FD) solutions; FD methods have primarily looked at frame-level visual features (Nguyen et al., 2019a) for statistical inconsistencies, but temporal characteristics (Güera and Delp, 2018) have also been examined of late. Recently, researchers have induced audio-based manipulations to generate fake content, and therefore, corruptions can occur in both the visual and audio channels.

Deepfakes tend to be characterized by visual inconsistencies such as a lack of lip-sync, unnatural facial and lip appearance/movements or assymmetry between facial regions such as the left and right eyes (see Fig. 5 for an example). Such artifacts tend to capture user attention. Authors of (Grimes, 1991) performed a psycho-physical experiment to study the effect of dissonance, i.e., lack of sync between the audio and visual channels on user attention and memory. Three different versions of four TV news stories were shown to users, one having perfect audio-visual sync, a second with some asynchrony, and a third with no sync. The study concluded that out-of-sync or dissonant audio-visual channels induced a high user cognitive load, while in-sync audio and video (no dissonance condition) were perceived as a single stimulus as they ‘belonged together’.

We adopt the dissonance rationale for deepfake detection, and since a fake video would contain either an altered audio or visual channel, one can expect higher audio-visual dissonance for fake videos than for real ones. We measure the audio-visual dissonance in a video via the Modality Dissonance Score (MDS), and use this metric to label it as real/fake. Specifically, audio-visual dissimilarity is computed over 1-second video chunks to perform a fine-grained analysis, and then aggregated over all chunks for deriving MDS, employed as figure-of-merit for video labeling.

MDS is modeled based on the contrastive loss, which has traditionally been employed for discovering lip-sync issues in video (Chung and Zisserman, 2017). Contrastive loss enforces the video and audio features to be closer for real videos, and farther for fake ones. Our method also works with videos involving only the audio/visual channel as our neural network architecture includes the video and audio-specific sub-networks, which seek to independently learn discriminative real/fake features via the imposition of a cross-entropy

loss. Experimental results confirm that these unimodal loss functions facilitate better real-fake discriminability over modeling only the contrastive loss. Experiments on the DFDC 

(Dolhansky et al., 2019) and DF-TIMIT (Sanderson and Lovell, 2009) datasets show that our technique outperforms the state-of-the-art by upto 7%, which we attribute to the following factors: (1) Modeling unimodal losses in addition to the contrastive loss which measures modality dissonance, and (2) Learning discriminative features by comparing 1-second audio-visual chunks to compute MDS, as against directly learning video-level features. Chunk-wise learning also enables localization of transient video forgeries (i.e., where only some frames in a sequence are corrupted), while past works have only focused on the real-fake labeling problem.

The key contributions of our work are: (a) We propose a novel multimodal framework for deepfake video detection based on modality dissonance computed over small temporal segments; (b) To our knowledge, our method is the first to achieve temporal forgery localization, and (c) Our method achieves state-of-art results on the DFDC dataset, improving AUC score by up to 7%.

2. Literature Review

Recently, considerable research efforts have been devoted to detecting fake multimedia content automatically and efficiently. Most video faking methods focus on manipulating the video modality; audio-based manipulation is relatively rare. Techniques typically employed for generating fake visual content include 3D face modeling, computer graphics-based image rendering, GAN-based face image synthesis, image warping, etc. Artifacts are synthesized either via face swapping while keeping the expressions intact (e.g., DeepFakes111, FaceSwap222 or via facial expression transfer, i.e., facial reenactment (e.g., Face2Face (Thies et al., 2016)). Approaches for fake video detection can be broadly divided into three categories as follows:

2.1. Image-based

These methods employ image/frame-based statistical inconsistencies for real/fake classification. For example, (Matern et al., 2019) uses visual artifacts such as missing reflections and incomplete details in the eye and teeth regions, inconsistent illumination and heterogeneous eye colours as cues for fake detection. Similarly, (Yang et al., 2019) relies on the hypothesis that the 3-D head pose generated using the entire face’s landmark points will be very different from the one computed using only the central facial landmarks, in case the face is synthetically generated. Authors of (Li and Lyu, 2018) hypothesize that fake videos contain artifacts due to resolution inconsistency between the warped face region (which is usually blurred) and surrounding context, and models the same via the VGG and ResNet deep network architectures. A capsule network architecture is proposed in (Nguyen et al., 2019a) for detecting various kinds of spoofs, such as video replays with embedded images and computer-generated videos.

A multi-task learning framework for fake detection-cum-segmentation of manipulated image (or video frame) regions is presented in (Nguyen et al., 2019b)

. It is based on a convolutional neural network, comprising an encoder and a Y-shaped decoder, where information gained by one of the detection/segmentation tasks is shared with the other so as to benefit both tasks. A two-stream network is proposed in 

(Zhou et al., 2018), which leverages information from local noise residuals and camera characteristics. It employs a GoogLeNet-based architecture for one stream, and a patch based triplet network as second stream. Authors of (Afchar et al., 2018)

train a CNN, named MesoNet, to classify real and fake faces generated by the DeepFake and Face2face techniques. Given the similar nature of falsifications achieved by these methods, identical network structures are trained for both problems by focusing on mesoscopic properties (intermediate-level analysis) of images.

2.2. Video-based

Video-based fake detection methods also use temporal features for classification, since many a time, deepfake videos contain realistic frames but the warping used for manipulation is temporally inconsistent. For instance, variations in eye-blinking patterns are utilized in (Li et al., 2018) to determine real and fake videos. Correlations between different pairs of facial action units across frames are employed for forgery detection in (Agarwal et al., 2019). Authors of (Güera and Delp, 2018)

extract frame-level CNN features, and use them to train a recurrent neural network for manipulation detection.

2.3. Audio-visual features based

Aforementioned approaches exploit only the visual modality for identifying deepfake videos. However, examining other modalities such as audio signal in addition to the video can also be helpful. As an example, authors of (Mittal et al., 2020) propose a Siamese network-based approach, which compares the multimedia as well as emotion-based differences in facial movements and speech patterns to learn differences between real and fake videos. Lip-sync detection in unconstrained settings is achieved by learning the closeness between the audio and visual channels for in-sync vs out-of-sync videos via contrastive loss modeling in (Chung and Zisserman, 2017). While this technique is not meant to address deep-fake detection per se, lip-sync issues can also be noted from a careful examination of deepfake videos, and the contrastive loss is useful for tying up genuine audio-visual pairs.

2.4. Bimodal Approaches

While it may be natural to see audio and video as the two main information modes that indicate the genuineness/fakeness of a video, one can also employ multiple cues from the visual modality for identifying fake videos. Multimodal cues are especially useful while tackling sophisticated visual manipulations. In (Güera and Delp, 2018), both intra-frame and inter-frame visual consistencies are modeled by feeding in frame-level CNN features to train a recurrent neural network. Statistical differences between the warped face area and the surrounding regions are learned via the VGG and ResNet architectures in (Li and Lyu, 2018). Hierarchical relations in image (video frame) content are learned via a Capsule Network architecture in (Nguyen et al., 2019a).

2.5. Analysis of Related Work

Upon examining related literature, we make the following remarks to situate our work with respect to existing works.

Figure 1.

MDS-based fake video detection: Features extracted from 1-second audio-visual segments are input to the MDS network. The MDS network comprises the audio and visual sub-networks, whose description is provided in Table 

1. Descriptors learned by the video and audio sub-networks are tuned via the cross-entropy loss, while the contrastive loss is employed to enforce higher dissimilarity between audio-visual chunks arising from fake videos. MDS is computed as the aggregate audio-visual dissonance over the video length, and employed as a figure of merit for labeling a video as real/fake.
  • While frame-based methods that learn spatial inconsistencies have been proposed to detect deepfakes, temporal-based approaches are conceivably more powerful to this end, as even if the manipulation looks natural in a static image, achieving temporally consistent warping even over a few seconds of video is difficult. Our approach models temporal characteristics, as we consider 1-second long audio-visual segments to distinguish between real and fake videos. Learning over tiny video chunks allows for a fine-grained examination of temporal differences, and also enables our method to temporally localize manipulation in cases where the forgery targets only a small portion of the video. To our knowledge, prior works have only focused on assigning a real/fake video label.

  • Very few approaches have addressed deepfake detection where the audio/video stream may be corrupted. In this regard, two works very related to ours are (Chung and Zisserman, 2017) and (Mittal et al., 2020). In (Chung and Zisserman, 2017), the contrastive loss function is utilized to enforce a smaller distance between lip-synced audio-visual counterparts; our work represents a novel application of the contrastive loss, employed for lip-sync detection in (Chung and Zisserman, 2017), to deepfake detection. Additionally, we show that learning audio-visual dissonance over small chunks and aggregating these measures over the video duration is more beneficial than directly learning video-level features.

  • Both multimedia and emotional audio-visual features are learned for FD in (Mittal et al., 2020). We differ from (Mittal et al., 2020) in three respects: (a) While we do not explicitly learn emotional audio-visual coherence, forged videos need not be emotional in nature; audio-visual consistency is enforced via the contrastive loss in our approach. (b) The training framework in (Mittal et al., 2020) requires a realfake video pair. Our approach does not constrain the training process to involve video-pairs, and adopts the traditional training protocol. (c) Whilst (Mittal et al., 2020) perform a video-level comparison of audio-visual features to model dissonance, we compare smaller chunks and aggregate chunk-wise measures to obtain the MDS. This enables our approach to localize transient forgeries.

  • Given that existing datasets primarily involve visual manipulations (number of datasets do not have an audio component), our architecture also includes audio and visual sub-networks which are designed to learn discriminative unimodal features via the cross-entropy loss. Our experiments show that additionally including the cross-entropy loss is more beneficial than employing only the contrastive loss. Enforcing the two losses in conjunction enables our approach to achieve state-of-the-art performance on the DFDC dataset.

3. MDS-based fake video detection

As discussed in Section 1, our FD technique is based on the hypothesis that deepfake videos have higher dissonance between the audio and visual streams as compared to real videos. The dissimilarity between the audio and visual channels for a real/fake video is obtained via the Modality Dissonance Score (MDS), which is obtained as the aggregate dissimilarity computed over 1-second visual-audio chunks. In addition, our network enforces learning of discriminative visual and auditory features even if the contrastive loss is not part of the learning objective; this enables FD even if the input video is missing the audio/visual modality, in which case the contrastive loss is not computable. A description of our network architecture for MDS-based deepfake detection follows.

3.1. Overview

Given an input video, our aim is to classify it as real or fake. We begin with a training dataset consisting of videos. Here, denotes the input video and the label indicates whether the video is real () or fake (). The training procedure is depicted in Fig. 1. We extract the audio signal from input video using the ffmpeg library, and split it into -second segments. Likewise for the visual modality, we divide the input video into -second long segments, and perform face tracking on these video segments using the S3FD face detector (Zhang et al., 2017) to extract the face crops. This pre-processing gives us segments of visual frames along with corresponding audio segments , where denotes segment count for an input video .

We employ a bi-stream network, comprising the audio and video streams, for deepfake detection. Each video segment is passed through a visual stream , and the corresponding audio segment is passed through the audio stream . These streams are described in Sections 3.2 and 3.3. The network is trained using a combination of the contrastive loss and the cross-entropy loss. The contrastive loss is meant to tie up the audio and visual streams; it ensures that the video and audio streams are closer for real videos, and farther for fake videos. Consequently, one can expect a low MDS for real, and high MDS for fake videos. If either the audio or visual stream is missing in the input video, in which case the contrastive loss is not computable, the video and audio streams will still learn discriminative features as constrained by the unimodal cross-entropy losses. These loss functions are described in Sec. 3.4.

3.2. Visual Stream

The input to the visual stream is , a video sequence of size (), where 3 refers to the RGB color channels of each frame, are the frame height and width, is the segment length in seconds, and is the video frame rate. Table 1 depicts the architecture of the video and audio sub-networks. The visual stream () architecture is inspired by the 3D-ResNet similar to (Hara et al., 2017). 3D-CNNs have been widely used for action recognition in videos, and ResNet is one of the most popular architectures for image classification. The feature representation learnt by the visual stream, in particular the fc8 output, denoted by

is used to compute the contrastive loss. We also add a 2-neuron classification layer at the end of this stream, which outputs the visual-based prediction label. So the output of this stream, labeled as

, constitutes the unimodal cross-entropy loss.

Visual Stream Audio Stream
conv1 conv_1, 33, 1, 64
batch_norm_1, 64
pool_1, 11, MaxPool
conv2_x conv_2, 33, 64, 192
batch_norm_2, 192
pool_2, 33, MaxPool
conv3_x conv_3, 33, 192, 384
batch_norm_3, 384
conv4_x conv_4, 33, 384, 256
batch_norm_4, 256
conv5_x conv_5, 33, 256, 256
batch_norm_5, 256
pool_5, 33, MaxPool
average pool conv_6, 54, 256, 512
batch_norm_6, 512
fc7, 25677, 4096 fc7, 51221, 4096
batch_norm_7, 4096 batch_norm_7, 4096
fc8, 4096, 1024 fc8, 4096, 1024
batch_norm_8, 1024 batch_norm_8, 1024
dropout, dropout,
fc10, 1024, 2 fc10, 1024, 2
Table 1. Structure of the audio and visual streams (initial layers of the visual stream are the same as in the 3D ResNet architecture (Hara et al., 2017)).

3.3. Audio Stream

Mel-frequency cepstral coefficients (MFCC) features are input to the audio stream. MFCC features (Mogran et al., 2004) are widely used for speaker and speech recognition (Martinez et al., 2012), and have been the state-of-the-art for over three decades. For each audio segment of second duration , MFCC values are computed and passed through the audio stream . 13 mel frequency bands are used at each time step. Overall, audio is encoded as a heat-map image representing MFCC values for each time step and each mel frequency band. We base the audio stream architecture on convolutional neural networks designed for image recognition. Contrastive loss uses the output of fc8, denoted by . Similar to the visual stream, we add a classification layer at the end of the audio stream, and the output is incorporated in the cross-entropy loss for the audio modality.

Figure 2. Effect of loss functions: The graphs show the effect of audio and visual cross-entropy loss functions (Section 4.3). The top and bottom graphs show the MDS distribution for test samples, when the contrastive loss alone and combined losses are used in the training of the MDS network, respectively.

3.4. Loss functions

Inspired by (Chung and Zisserman, 2017), we use contrastive loss as the key component of the objective function. Contrastive loss enables maximization of the dissimilarity score for manipulated video sequences, while minimizing the MDS for real videos. This consequently ensures separability of the real and fake videos based on MDS (see Fig. 2). The loss function is represented by Equation 1. Here, is the label for video and margin is a hyper-parameter. Dissimilarity score , is the Euclidean distance between the (segment-based) feature representations and of the visual and audio streams respectively.

In addition, we employ the cross-entropy loss for the visual and audio streams to learn feature representations in a robust manner. These loss functions are defined in Equations 3 (visual) and 4 (audio). The overall loss is a weighted sum of these three losses, and as in Eq. 5.


where in our design.

3.5. Test Inference

During test inference, the visual segments and corresponding audio segments of a video are passed through and , respectively. For each such segment, dissimilarity score (Equation 2) is accumulated to compute the MDS as below:


To label the test video, we compare with a threshold using where denotes the logical indicator function. is determined on the training set. We compute MDS for both real and fake videos of the train set, and the midpoint between the average values for the real and fake videos is used as a representative value for .

4. Experiments

4.1. Dataset Description

As our method is multimodal, we use two public audio-visual deepfake datasets in our experiments. Their description is as follows:

Deepfake-TIMIT (Korshunov and Marcel, 2018): This dataset contains videos of 16 similar looking pairs of people, which are manually selected from the publicly available VIDTIMIT (Sanderson and Lovell, 2009)

database and manipulated using an open-source GAN-based

333 approach. There are two different models for generating fake videos, one Low Quality (LQ), with input/output size, and the other High Quality (HQ), with input/output size. Each of the 32 subjects has 10 videos, resulting in a total of 640 face swapped videos in the dataset; each video is of resolution with 25 fps frame rate, and of duration. However, the audio channel is not manipulated in any of the videos.

DFDC dataset (Dolhansky et al., 2019): The preview dataset, comprising of 5214 videos was released in Oct 2019 and the complete one with 119,146 videos in Dec 2019. Details of the manipulations have not been disclosed, in order to represent the real adversarial space of facial manipulation. The manipulations can be present in either audio or visual or both of the channels. In order to bring out a fair comparison, we used 18,000 videos444Same as the videos used in (Mittal et al., 2020). in our experiments. The videos are of duration each with an fps of 30, so there are frames per video.

4.2. Training Parameters

For both the datasets, we used second segment duration and the margin hyper-parameter described in Equation 1

was set to 0.99. This resulted in (3 x 30 x 224 x 224) dimensional input for the visual stream in case of DFDC dataset and for the other dataset, Deepfake-TIMIT, the input dimension to the visual stream was (3 x 25 x 224 x 224). On DFDC we trained our model for 100 epochs with a batch size of 8 whereas for Deepfake-TIMIT the model only required 50 epochs with 16 batch size for convergence as the dataset size was small. We used Adam optimizer with a learning rate of 0.001 and all the results were generated on Nvidia Titan RTX GPU with 32 GB system RAM. For the evaluation, we use video-wise Area Under the Curve (AUC) metric.

Capsule Multi-task HeadPose Two-stream VA-MLP VA-LogReg Meso4
(Nguyen et al., 2019a) (Nguyen et al., 2019b) (Yang et al., 2019) (Zhou et al., 2018) (Matern et al., 2019) (Afchar et al., 2018)
Datasets DFDC 53.3 53.6 55.9 61.4 61.9 66.2 75.3
DFTIMIT LQ 78.4 62.2 55.1 83.5 61.4 77.0 87.8
HQ 74.4 55.3 53.2 73.5 62.1 77.3 68.4
Modality V V V V V V V


Xception-raw Xception-c40 Xception-c23 FWA DSP-FWA Siamese-based Our Method
(Rossler et al., 2019) (Li and Lyu, 2018) (Mittal et al., 2020)
Datasets DFDC 49.9 69.7 72.2 72.7 75.5 84.4 91.5 (91.6)
DFTIMIT LQ 56.7 75.8 95.9 99.9 99.9 96.3 97.9 (98.3)
HQ 54.0 70.5 94.4 93.2 99.7 94.9 96.8 (94.7)
Modality V V V V V AV AV
Table 2. Comparison of our method with other techniques on DFDC and DFTIMIT datasets using the AUC metric. For our method frame-wise results are reported inside the brackets.
Audio Stream
Contrastive Only
Visual Stream
Combined Loss
50.0 86.1 89.7 91.7
Table 3. The AUC based comparison of audio stream, visual stream, contrastive loss only and combined loss (Details in Section 4.3).

4.3. Ablation Studies

To decide the structure of the network and effect of different components, we chose 5800 videos (4000 real and 1800 fake) from the DFDC dataset, divided them into an 85:15 train-test split, following video-wise AUC and conducted the following experiments:
Effect of Loss Functions: We evaluated the contribution of the audio and visual streams based cross-entropy loss functions ( and , respectively). The hypothesis behind adding these two loss functions to the network is that the feature representations learnt across the audio and visual streams, respectively will be more discriminative towards the task of deep fake detection. This should further assist in the task of computing a segment dissimilarity score , which disambiguates between fake and real videos. This hypothesis was tested by training the network on the DFDC dataset in two different settings. The first setting is based on training the MDS network with contrastive loss only. The second setting is combination of the three loss functions for training of the MDS network. Figure 2 shows two graphs generated using the MDS scores as predicted by the networks from the two settings above. It is observed that the distributions of real and fake videos have lesser intersection in the case of the network trained with all three losses. Overall, combined loss function and contrastive loss only based networks achieved 92.7% and 86.1% AUC scores.

The difference attributes to the observation that the gap between average MDS for real and fake videos widens when cross-entropy loss functions are also used. Hence, there is more clear distinction between the dissonance scores for real and fake.

Audio and Visual Stream Performance: We analysed the individual discriminative ability to identify fake and real videos for the audio and the visual streams. In this case the cross-entropy loss alone was used for training of the streams. It was observed that the audio stream only and visual stream only based deepfake classifiers achieved 50.0 and 89.7%, respectively. Note that audio stream achieves less performance as in the DFDC dataset, minimal manipulation is performed on the audio signal of the videos.

Figure 3. Discriminative-ability of individual streams: The graphs show the distribution of the mean L2 distance, when a fake segment and corresponding real segment are passed through only the audio and visual streams.
Figure 4. Grad-CAM results: Three frames are overlayed with the important attention regions highlighted using Grad-CAM. Note that forehead, face region and neck regions are highlighted in the 1st, 2nd and 3rd frames respectively.

In Equation 6, for the four configurations: audio stream only, contrastive loss only, visual stream only and combined loss, we set the parameters as follows: (), (), () and (

). In the case of audio stream and visual stream based classification, the cross-entropy generated real and fake probabilities are generated segment-wise. We compute a maximum over the probabilities to compute the final fake and real for these two streams, individually.

For further assessing the audio and visual stream’s individual performances, we generated the plots shown in Figure 3. The process is as follows: a fake video segment and corresponding real video segment is passed through the streams and L2 distance is computed between the output of fc8 layer of the visual/audio sub-network. Then the average of these L2 distances for a video pair is plotted. It is observed in the audio stream plot, that most of the videos are centered close to 0 inter-pair L2 distance. This is due to the fact that audio has been modified in few cases in DFDC dataset. On the contrary, in the plot for the visual stream, we observed that the inter-pair L2 scores is spread across the dataset. This supports the argument that the visual stream is more discrimiantive due to the added cross entropy loss.

In Table 3, we show the AUC of the four configurations mentioned above. Note that these numbers are on a smaller set of DFDC, which is used for ablation studies only. The contrastive loss only based configuration, which uses both the streams, achieves 86.1%. The fusion of the cross-entropy loss into the final MDS network ( for Equation 6), achieves 92.7%. This backs up the argument that comparing (using contrastive loss) features learned through supervised channels enhances the performance of the MDS network.

Segment Duration: For deciding the optimal length of the temporal segment , we conducted empirical analysis with temporal sizes of seconds. From the evaluation on the DFDC dataset, we observed that the temporal duration second is most optimum. This can attributed to the larger number of segments representing each video, thereby allowing fine-grained comparison of the audio and the visual data of a video.

4.4. Evaluation on DFDC Dataset

We compare the performance of our method on the DFDC dataset with other state-of-the-art works (Mittal et al., 2020; Nguyen et al., 2019a; Afchar et al., 2018; Rossler et al., 2019; Matern et al., 2019; Li et al., 2018; Yang et al., 2019; Zhou et al., 2018). A total of 18,000 videos are used in this experiment555Please note that some earlier methods in Table 2 are trained on DFDC preview (5000 videos), which is no longer available. . In Table 2, it is observed that the MDS network approach outperforms the other visual-only and audio-visual based approaches by achieving 91.54%. The performance is ~8% more relatively than the other audio-visual based approach (Mittal et al., 2020). Please note that the result 91.54% is achieved with the test set containing 3281 videos out of which 281 videos have two subjects each. We chose the larger face and passed it through the visual stream. The performance of the network without these multiple subject videos is 93.50%. We also report the frame-wise AUC (91.60%) as mentioned in brackets in Table 2. This is computed by assigning each frame of a video the same label as predicted for the video by our method.

We argue that the gain in performance here is due to: (a) The temporal segmentation of the video into segment helps in fine-grained audio-visual feature comparison; (b) We extract task-tuned features from the audio and visual segments. Here task-tuned means that the features are learnt to discriminate between real and fake with the and loss functions, and (c) The visual stream’s input is the face and an extra margin (see Figure 1) around it, which accounts for some rigid (head) movement along with the non-rigid (facial) movements. We visualise the important regions using the Gradient-weighted Class Activation Mapping (Grad-CAM) method (Selvaraju et al., 2017). Figure 4 shows the important regions localised by Grad-CAM on few frames of a video. Note that the region around the face is highlighted in the middle frame. Also, the forehead and the neck regions are highlighted in the first and third frames, respectively. This supports our argument that the disharmony between the non-rigid and rigid movement is also discriminative for the visual stream to classify between real and fake videos.

4.5. Evaluation on DFTIMIT Dataset

The DeepFake-TIMIT (DFTIMIT) dataset is smaller as compared to the DFDC dataset. We trained the MDS network in two resolution settings: LQ and HQ. Table 2 shows the comparison of our method with the other state-of-the-art methods. It is observed that our method achieves comparable results (LQ: 97.92% and HQ: 96.87%) with the top achieving method (Li and Lyu, 2018). In the DFTIMIT test set there are 96 videos in total. This applies that the penalty for mis-classification towards the overall AUC is high in DFTIMIT’s case. It is interesting to note that our method mis-classified just 3 video samples in the HQ experiments and 2 videos in the LQ experiments. (Li and Lyu, 2018) achieve state-of-the-art results (LQ: 99.9% and HQ: 99.7%) on DFTIMIT, however, achieve relatively lower AUC results (72.7%) on the larger dataset DFDC. In comparison our method achieved 18% more than (Li and Lyu, 2018)

on DFDC dataset. This could also mean that the DFTIMIT dataset is now saturated due to smaller size similar to the popular ImageNet dataset

(Deng et al., 2009). We also report the frame-wise AUC (LQ: 98.3% and HQ: 94.7%) for DFTIMIT as mentioned in brackets in Table 2.

Figure 5. Forgery Localization results: In the two examples above, the red part of the curve is detected as fake and blue as real by our method. The segment of the video with segment wise dissimilarity above the threshold is labelled as fake and below the threshold is labelled as real. On the x-axis, we have the original segment labels and on the y-axis the segment-level dissimilarity score (Section 4.6).

4.6. Temporal Forgery Localization

With the advent of sophisticated forgery techniques, it is possible that an entire video or smaller portions of the video are manipulated to deceive the audience. If in case parts of the original video are corrupted, it would be useful from the FD perspective to be able to locate the timestamps corresponding to the corrupted segments. In an interesting work, Wu et al. (Wu et al., 2018) proposed a CNN which detects forgery along with the forged locations in images. However, their method is only applicable to copy-paste image forgery. As we process the vidseo by dividing it into temporal segments, a fine-grained analysis of the input video is possible, thereby enabling forgery localization. In contrast to the MDS network, earlier techniques (Mittal et al., 2020; Li and Lyu, 2018) computed features over the entire video. We argue that if a forgery has been performed on a small segment of a video, the forgery signatures in that segment may get diluted due to pooling across the whole video. A similar phenomenon is also observed in prior work relating to pain detection and localization (Sikka et al., 2014). As the pain event could be short and its location is not labeled, the authors divide the video into temporal segments for better pain detection.

Most of the datasets, including the ones used in this paper have data manipulation performed on practically the entire video. To demonstrate the effectiveness of our method for forgery localization in fake videos, we joined segments from real and corresponding fake videos of the same subject at random locations. In Figure 5, we show the outputs on two videos created by mixing video segments of the same subject from the DFDC dataset. Here, the segment-wise score is shown on the -axis. The segments for which the score is above a threshold are assigned as being fake (red on the curve) and below the threshold (blue color on the curve) are assigned are real. In addition to Figs. 4 and 2, forgery localization makes the working of the MDS-based fake detection framework more explainable and interpretable.

5. Conclusion and Future Work

We propose a novel bimodal deepfake detection approach based on the modality dissonance score (MDS), which captures the similarity between audio and visual streams for real and fake videos thereby facilitating separability. The MDS is modeled via the contrastive loss computed over segment-level audiovisual features, which constrains genuine audio-visual streams to be closer than fake counterparts. Furthermore, cross-entropy loss is enforced on the unimodal streams to ensure that they independently learn discriminative features. Experiments show that (a) the MDS-based FD framework can achieve state-of-the-art performance on the DFDC dataset, and (b) the unimodal cross-entropy losses provide extra benefit on top of the contrastive loss to enhance FD performance. Explainability and interpretability of the proposed approach are demonstrated via audio-visual distance distributions obtained for real and fake videos, Grad-CAM outputs denoting attention regions of the MDS network, and forgery localization results.

Future work would focus on (a) incorporating human assessments (acquired via EEG and eye-gaze sensing) in addition to content analysis adopted in this work; (b) exploring algorithms such as multiple instance learning for transient forgery detection, and (c) achieving real-time forgery detection (accomplished by online intrusions) given the promise of processing audio-visual information over 1-second segments.

6. Acknowledgement

We are grateful to all the brave frontline workers who are working hard during this difficult COVID19 situation.


  • D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen (2018) Mesonet: a compact facial video forgery detection network. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1–7. Cited by: §2.1, §4.4, Table 2.
  • S. Agarwal, H. Farid, Y. Gu, M. He, K. Nagano, and H. Li (2019) Protecting world leaders against deep fakes. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    Cited by: §2.2.
  • J. S. Chung and A. Zisserman (2017) Out of time: automated lip sync in the wild. pp. 251–263. External Links: ISBN 978-3-319-54426-7, Document Cited by: §1, item 2., §2.3, §3.4.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §4.5.
  • B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. C. Ferrer (2019) The deepfake detection challenge (dfdc) preview dataset. External Links: 1910.08854 Cited by: §1, §1, §4.1.
  • L. Floridi (2018) Artificial intelligence, deepfakes and a future of ectypes. Philosophy & Technology 31 (3), pp. 317–321. Cited by: §1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. Cited by: §1.
  • T. Grimes (1991) Mild auditory‐visual dissonance in television news may exceed viewer attentional capacity. Cited by: §1.
  • D. Güera and E. J. Delp (2018) Deepfake video detection using recurrent neural networks. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Vol. , pp. 1–6. Cited by: §1, §2.2, §2.4.
  • K. Hara, H. Kataoka, and Y. Satoh (2017) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. CoRR abs/1711.09577. External Links: Link, 1711.09577 Cited by: §3.2, Table 1.
  • P. Korshunov and S. Marcel (2018)

    DeepFakes: a new threat to face recognition? assessment and detection

    CoRR abs/1812.08685. External Links: Link, 1812.08685 Cited by: §1, §4.1.
  • Y. Li, M. Chang, and S. Lyu (2018) In ictu oculi: exposing ai created fake videos by detecting eye blinking. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS), Vol. , pp. 1–7. Cited by: §2.2, §4.4.
  • Y. Li and S. Lyu (2018) Exposing deepfake videos by detecting face warping artifacts. In CVPR Workshops, Cited by: §2.1, §2.4, §4.5, §4.6, Table 2.
  • J. Martinez, H. Perez, E. Escamilla, and M. M. Suzuki (2012)

    Speaker recognition using mel frequency cepstral coefficients (mfcc) and vector quantization (vq) techniques

    In Proceedings of the International Conference on Electrical Communications and Computers, pp. 248–251. Cited by: §3.3.
  • F. Matern, C. Riess, and M. Stamminger (2019) Exploiting visual artifacts to expose deepfakes and face manipulations. In 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), Vol. , pp. 83–92. Cited by: §2.1, §4.4, Table 2.
  • T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha (2020) Emotions don’t lie: a deepfake detection method using audio-visual affective cues. External Links: 2003.06711 Cited by: item 2., item 3., §2.3, §4.4, §4.6, Table 2, footnote 4.
  • N. Mogran, H. Bourlard, and H. Hermansky (2004) Automatic speech recognition: an auditory perspective. In Speech Processing in the Auditory System, pp. 309–338. External Links: ISBN 978-0-387-21575-4, Document, Link Cited by: §3.3.
  • H. H. Nguyen, J. Yamagishi, and I. Echizen (2019a) Capsule-forensics: using capsule networks to detect forged images and videos. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 2307–2311. Cited by: §1, §2.1, §2.4, §4.4, Table 2.
  • H. H. Nguyen, F. Fang, J. Yamagishi, and I. Echizen (2019b) Multi-task learning for detecting and segmenting manipulated facial images and videos. CoRR abs/1906.06876. External Links: Link, 1906.06876 Cited by: §2.1, Table 2.
  • A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner (2019) Faceforensics++: learning to detect manipulated facial images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1–11. Cited by: §4.4, Table 2.
  • C. Sanderson and B. C. Lovell (2009) Multi-region probabilistic histograms for robust and scalable identity inference. In Advances in Biometrics, M. Tistarelli and M. S. Nixon (Eds.), Berlin, Heidelberg, pp. 199–208. External Links: ISBN 978-3-642-01793-3 Cited by: §1, §4.1.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §4.4.
  • K. Sikka, A. Dhall, and M. S. Bartlett (2014) Classification and weakly supervised pain localization using multiple segment representation. Image and vision computing 32 (10), pp. 659–670. Cited by: §4.6.
  • J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner (2016) Face2Face: real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), IEEE, Cited by: §2.
  • P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008)

    Extracting and composing robust features with denoising autoencoders


    International Conference on Machine Learning

    ICML ’08, pp. 1096–1103. Cited by: §1.
  • Y. Wu, W. Abd-Almageed, and P. Natarajan (2018) Busternet: detecting copy-move image forgery with source/target localization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 168–184. Cited by: §4.6.
  • A. Yadlin-Segal and Y. Oppenheim (0) Whose dystopia is it anyway? deepfakes and social media regulation. Convergence 0 (0), pp. 1354856520923963. External Links: Document, Link Cited by: §1.
  • X. Yang, Y. Li, and S. Lyu (2019) Exposing deep fakes using inconsistent head poses. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 8261–8265. Cited by: §2.1, §4.4, Table 2.
  • S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li (2017) S3fd: single shot scale-invariant face detector. In Proceedings of the IEEE International Conference on Computer Vision, pp. 192–201. Cited by: §3.1.
  • P. Zhou, X. Han, V. I. Morariu, and L. S. Davis (2018) Two-stream neural networks for tampered face detection. CoRR abs/1803.11276. External Links: Link, 1803.11276 Cited by: §2.1, §4.4, Table 2.