NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake Detection

06/12/2023
by   Yu Chen, et al.
0

Deepfake technologies empowered by deep learning are rapidly evolving, creating new security concerns for society. Existing multimodal detection methods usually capture audio-visual inconsistencies to expose Deepfake videos. More seriously, the advanced Deepfake technology realizes the audio-visual calibration of the critical phoneme-viseme regions, achieving a more realistic tampering effect, which brings new challenges. To address this problem, we propose a novel Deepfake detection method to mine the correlation between Non-critical Phonemes and Visemes, termed NPVForensics. Firstly, we propose the Local Feature Aggregation block with Swin Transformer (LFA-ST) to construct non-critical phoneme-viseme and corresponding facial feature streams effectively. Secondly, we design a loss function for the fine-grained motion of the talking face to measure the evolutionary consistency of non-critical phoneme-viseme. Next, we design a phoneme-viseme awareness module for cross-modal feature fusion and representation alignment, so that the modality gap can be reduced and the intrinsic complementarity of the two modalities can be better explored. Finally, a self-supervised pre-training strategy is leveraged to thoroughly learn the audio-visual correspondences in natural videos. In this manner, our model can be easily adapted to the downstream Deepfake datasets with fine-tuning. Extensive experiments on existing benchmarks demonstrate that the proposed approach outperforms state-of-the-art methods.

READ FULL TEXT

page 1

page 4

page 5

page 6

page 10

research
08/14/2023

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

In recent research, slight performance improvement is observed from auto...
research
05/29/2020

Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization

We propose detection of deepfake videos based on the dissimilarity betwe...
research
09/01/2021

CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

Existing audio-language task-specific predictive approaches focus on bui...
research
08/26/2021

Multi-Modulation Network for Audio-Visual Event Localization

We study the problem of localizing audio-visual events that are both aud...
research
06/16/2021

LiRA: Learning Visual Speech Representations from Audio through Self-supervision

The large amount of audiovisual content being shared online today has dr...
research
04/16/2021

Multimodal Deception Detection in Videos via Analyzing Emotional State-based Feature

Deception detection is an important task that has been a hot research to...
research
08/11/2022

Hybrid Transformer Network for Deepfake Detection

Deepfake media is becoming widespread nowadays because of the easily ava...

Please sign up or login with your details

Forgot password? Click here to reset