ViTransPAD: Video Transformer using convolution and self-attention for Face Presentation Attack Detection

03/03/2022
by   Zuheng Ming, et al.
0

Face Presentation Attack Detection (PAD) is an important measure to prevent spoof attacks for face biometric systems. Many works based on Convolution Neural Networks (CNNs) for face PAD formulate the problem as an image-level binary classification task without considering the context. Alternatively, Vision Transformers (ViT) using self-attention to attend the context of an image become the mainstreams in face PAD. Inspired by ViT, we propose a Video-based Transformer for face PAD (ViTransPAD) with short/long-range spatio-temporal attention which can not only focus on local details with short attention within a frame but also capture long-range dependencies over frames. Instead of using coarse image patches with single-scale as in ViT, we propose the Multi-scale Multi-Head Self-Attention (MsMHSA) architecture to accommodate multi-scale patch partitions of Q, K, V feature maps to the heads of transformer in a coarse-to-fine manner, which enables to learn a fine-grained representation to perform pixel-level discrimination for face PAD. Due to lack inductive biases of convolutions in pure transformers, we also introduce convolutions to the proposed ViTransPAD to integrate the desirable properties of CNNs by using convolution patch embedding and convolution projection. The extensive experiments show the effectiveness of our proposed ViTransPAD with a preferable accuracy-computation balance, which can serve as a new backbone for face PAD.

READ FULL TEXT

page 3

page 5

research
07/30/2023

Video Frame Interpolation with Flow Transformer

Video frame interpolation has been actively studied with the development...
research
03/23/2023

MMFormer: Multimodal Transformer Using Multiscale Self-Attention for Remote Sensing Image Classification

To benefit the complementary information between heterogeneous data, we ...
research
10/23/2022

UIA-ViT: Unsupervised Inconsistency-Aware Method based on Vision Transformer for Face Forgery Detection

Intra-frame inconsistency has been proved to be effective for the genera...
research
04/30/2022

Coarse-to-Fine Video Denoising with Dual-Stage Spatial-Channel Transformer

Video denoising aims to recover high-quality frames from the noisy video...
research
11/30/2022

From Coarse to Fine: Hierarchical Pixel Integration for Lightweight Image Super-Resolution

Image super-resolution (SR) serves as a fundamental tool for the process...
research
03/23/2023

Frame-Level Multi-Label Playing Technique Detection Using Multi-Scale Network and Self-Attention Mechanism

Instrument playing technique (IPT) is a key element of musical presentat...
research
11/24/2021

MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video

Self-attention has become an integral component of the recent network ar...

Please sign up or login with your details

Forgot password? Click here to reset