Rethinking Vision Transformer and Masked Autoencoder in Multimodal Face Anti-Spoofing

02/11/2023
by   Zitong Yu, et al.
0

Recently, vision transformer (ViT) based multimodal learning methods have been proposed to improve the robustness of face anti-spoofing (FAS) systems. However, there are still no works to explore the fundamental natures (e.g., modality-aware inputs, suitable multimodal pre-training, and efficient finetuning) in vanilla ViT for multimodal FAS. In this paper, we investigate three key factors (i.e., inputs, pre-training, and finetuning) in ViT for multimodal FAS with RGB, Infrared (IR), and Depth. First, in terms of the ViT inputs, we find that leveraging local feature descriptors benefits the ViT on IR modality but not RGB or Depth modalities. Second, in observation of the inefficiency on direct finetuning the whole or partial ViT, we design an adaptive multimodal adapter (AMA), which can efficiently aggregate local multimodal features while freezing majority of ViT parameters. Finally, in consideration of the task (FAS vs. generic object classification) and modality (multimodal vs. unimodal) gaps, ImageNet pre-trained models might be sub-optimal for the multimodal FAS task. To bridge these gaps, we propose the modality-asymmetric masked autoencoder (M^2A^2E) for multimodal FAS self-supervised pre-training without costly annotated labels. Compared with the previous modality-symmetric autoencoder, the proposed M^2A^2E is able to learn more intrinsic task-aware representation and compatible with modality-agnostic (e.g., unimodal, bimodal, and trimodal) downstream settings. Extensive experiments with both unimodal (RGB, Depth, IR) and multimodal (RGB+Depth, RGB+IR, Depth+IR, RGB+Depth+IR) settings conducted on multimodal FAS benchmarks demonstrate the superior performance of the proposed methods. We hope these findings and solutions can facilitate the future research for ViT-based multimodal FAS.

READ FULL TEXT

page 3

page 4

page 7

page 12

page 13

page 14

research
07/26/2023

Visual Prompt Flexible-Modal Face Anti-Spoofing

Recently, vision transformer based multimodal learning methods have been...
research
02/16/2022

Flexible-Modal Face Anti-Spoofing: A Benchmark

Face anti-spoofing (FAS) plays a vital role in securing face recognition...
research
12/12/2021

Self-Supervised Modality-Aware Multiple Granularity Pre-Training for RGB-Infrared Person Re-Identification

While RGB-Infrared cross-modality person re-identification (RGB-IR ReID)...
research
07/11/2022

A Closer Look at Invariances in Self-supervised Pre-training for 3D Vision

Self-supervised pre-training for 3D vision has drawn increasing research...
research
01/30/2023

M3FAS: An Accurate and Robust MultiModal Mobile Face Anti-Spoofing System

Face presentation attacks (FPA), also known as face spoofing, have broug...
research
05/23/2023

Training Transitive and Commutative Multimodal Transformers with LoReTTa

Collecting a multimodal dataset with two paired modalities A and B or B ...
research
12/08/2021

Unimodal Face Classification with Multimodal Training

Face recognition is a crucial task in various multimedia applications su...

Please sign up or login with your details

Forgot password? Click here to reset