Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization

by   Zhixi Cai, et al.

Due to its high societal impact, deepfake detection is getting active attention in the computer vision community. Most deepfake detection methods rely on identity, facial attribute and adversarial perturbation based spatio-temporal modifications at the whole video or random locations, while keeping the meaning of the content intact. However, a sophisticated deepfake may contain only a small segment of video/audio manipulation, through which the meaning of the content can be, for example, completely inverted from sentiment perspective. To address this gap, we introduce a content driven audio-visual deepfake dataset, termed as Localized Audio Visual DeepFake (LAV-DF), explicitly designed for the task of learning temporal forgery localization. Specifically, the content driven audio-visual manipulations are performed at strategic locations in order to change the sentiment polarity of the whole video. Our baseline method for benchmarking the proposed dataset is a 3DCNN model, termed as Boundary Aware Temporal Forgery Detection (BA-TFD), which is guided via contrastive, boundary matching and frame classification loss functions. Our extensive quantitative analysis demonstrates the strong performance of the proposed method for both task of temporal forgery localization and deepfake detection.


"Glitch in the Matrix!": A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization

Most deepfake detection methods focus on detecting spatial and/or spatio...

Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization

We propose detection of deepfake videos based on the dissimilarity betwe...

An audio-only method for advertisement detection in broadcast television content

We address the task of advertisement detection in broadcast television c...

Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios

Audio-visual question answering (AVQA) is a challenging task that requir...

To catch a chorus, verse, intro, or anything else: Analyzing a song with structural functions

Conventional music structure analysis algorithms aim to divide a song in...

VADER: Video Alignment Differencing and Retrieval

We propose VADER, a spatio-temporal matching, alignment, and change summ...

ARCHANGEL: Tamper-proofing Video Archives using Temporal Content Hashes on the Blockchain

We present ARCHANGEL; a novel distributed ledger based system for assuri...

Please sign up or login with your details

Forgot password? Click here to reset