Unmasking Deepfakes: Masked Autoencoding Spatiotemporal Transformers for Enhanced Video Forgery Detection

06/12/2023
by   Sayantan Das, et al.
0

We present a novel approach for the detection of deepfake videos using a pair of vision transformers pre-trained by a self-supervised masked autoencoding setup. Our method consists of two distinct components, one of which focuses on learning spatial information from individual RGB frames of the video, while the other learns temporal consistency information from optical flow fields generated from consecutive frames. Unlike most approaches where pre-training is performed on a generic large corpus of images, we show that by pre-training on smaller face-related datasets, namely Celeb-A (for the spatial learning component) and YouTube Faces (for the temporal learning component), strong results can be obtained. We perform various experiments to evaluate the performance of our method on commonly used datasets namely FaceForensics++ (Low Quality and High Quality, along with a new highly compressed version named Very Low Quality) and Celeb-DFv2 datasets. Our experiments show that our method sets a new state-of-the-art on FaceForensics++ (LQ, HQ, and VLQ), and obtains competitive results on Celeb-DFv2. Moreover, our method outperforms other methods in the area in a cross-dataset setup where we fine-tune our model on FaceForensics++ and test on CelebDFv2, pointing to its strong cross-dataset generalization ability.

READ FULL TEXT

page 2

page 4

page 9

research
02/28/2023

Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors

Current popular backbones in computer vision, such as Vision Transformer...
research
07/19/2022

Time Is MattEr: Temporal Self-supervision for Video Transformers

Understanding temporal dynamics of video is an essential aspect of learn...
research
09/18/2023

FactoFormer: Factorized Hyperspectral Transformers with Self-Supervised Pre-Training

Hyperspectral images (HSIs) contain rich spectral and spatial informatio...
research
10/12/2022

M^3Video: Masked Motion Modeling for Self-Supervised Video Representation Learning

We study self-supervised video representation learning that seeks to lea...
research
07/27/2023

Pre-training Vision Transformers with Very Limited Synthesized Images

Formula-driven supervised learning (FDSL) is a pre-training method that ...
research
06/28/2022

Cross-Forgery Analysis of Vision Transformers and CNNs for Deepfake Image Detection

Deepfake Generation Techniques are evolving at a rapid pace, making it p...
research
03/04/2022

Voice-Face Homogeneity Tells Deepfake

Detecting forgery videos is highly desirable due to the abuse of deepfak...

Please sign up or login with your details

Forgot password? Click here to reset