DF-TransFusion: Multimodal Deepfake Detection via Lip-Audio Cross-Attention and Facial Self-Attention

09/12/2023
by   Aaditya Kharel, et al.
0

With the rise in manipulated media, deepfake detection has become an imperative task for preserving the authenticity of digital content. In this paper, we present a novel multi-modal audio-video framework designed to concurrently process audio and video inputs for deepfake detection tasks. Our model capitalizes on lip synchronization with input audio through a cross-attention mechanism while extracting visual cues via a fine-tuned VGG-16 network. Subsequently, a transformer encoder network is employed to perform facial self-attention. We conduct multiple ablation studies highlighting different strengths of our approach. Our multi-modal methodology outperforms state-of-the-art multi-modal deepfake detection techniques in terms of F-1 and per-video AUC scores.

READ FULL TEXT

page 1

page 3

page 7

research
08/16/2023

SCANet: A Self- and Cross-Attention Network for Audio-Visual Speech Separation

The integration of different modalities, such as audio and visual inform...
research
09/24/2021

DeepStroke: An Efficient Stroke Screening Framework for Emergency Rooms with Multimodal Adversarial Deep Learning

In an emergency room (ER) setting, the diagnosis of stroke is a common c...
research
02/07/2020

Exploiting Temporal Coherence for Multi-modal Video Categorization

Multimodal ML models can process data in multiple modalities (e.g., vide...
research
05/09/2023

Integrating Holistic and Local Information to Estimate Emotional Reaction Intensity

Video-based Emotional Reaction Intensity (ERI) estimation measures the i...
research
06/25/2023

AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

Visual information can serve as an effective cue for target speaker extr...
research
07/27/2022

AutoTransition: Learning to Recommend Video Transition Effects

Video transition effects are widely used in video editing to connect sho...
research
05/04/2021

Where and When: Space-Time Attention for Audio-Visual Explanations

Explaining the decision of a multi-modal decision-maker requires to dete...

Please sign up or login with your details

Forgot password? Click here to reset