DeepAI AI Chat
Log In Sign Up

Facial Expression Recognition using Vanilla ViT backbones with MAE Pretraining

by   Jia Li, et al.
Hefei University of Technology

Humans usually convey emotions voluntarily or involuntarily by facial expressions. Automatically recognizing the basic expression (such as happiness, sadness, and neutral) from a facial image, i.e., facial expression recognition (FER), is extremely challenging and attracts much research interests. Large scale datasets and powerful inference models have been proposed to address the problem. Though considerable progress has been made, most of the state of the arts employing convolutional neural networks (CNNs) or elaborately modified Vision Transformers (ViTs) depend heavily on upstream supervised pretraining. Transformers are taking place the domination of CNNs in more and more computer vision tasks. But they usually need much more data to train, since they use less inductive biases compared with CNNs. To explore whether a vanilla ViT without extra training samples from upstream tasks is able to achieve competitive accuracy, we use a plain ViT with MAE pretraining to perform the FER task. Specifically, we first pretrain the original ViT as a Masked Autoencoder (MAE) on a large facial expression dataset without expression labels. Then, we fine-tune the ViT on popular facial expression datasets with expression labels. The presented method is quite competitive with 90.22% on RAF-DB, 61.73% on AfectNet and can serve as a simple yet strong ViT-based baseline for FER studies.


page 1

page 2

page 3


Learning Vision Transformer with Squeeze and Excitation for Facial Expression Recognition

As various databases of facial expressions have been made accessible ove...

CLIPER: A Unified Vision-Language Framework for In-the-Wild Facial Expression Recognition

Facial expression recognition (FER) is an essential task for understandi...

Facial Expression Recognition using Convolutional Neural Networks: State of the Art

The ability to recognize facial expressions automatically enables novel ...

Facial Expression Recognition Using a Hybrid CNN-SIFT Aggregator

Deriving an effective facial expression recognition component is importa...

Generating near-infrared facial expression datasets with dimensional affect labels

Facial expression analysis has long been an active research area of comp...

Facial Expressions Recognition with Convolutional Neural Networks

Over the centuries, humans have developed and acquired a number of ways ...

AU-Supervised Convolutional Vision Transformers for Synthetic Facial Expression Recognition

The paper describes our proposed methodology for the six basic expressio...