Learning Vision Transformer with Squeeze and Excitation for Facial Expression Recognition

07/07/2021
by   Mouath Aouayeb, et al.
0

As various databases of facial expressions have been made accessible over the last few decades, the Facial Expression Recognition (FER) task has gotten a lot of interest. The multiple sources of the available databases raised several challenges for facial recognition task. These challenges are usually addressed by Convolution Neural Network (CNN) architectures. Different from CNN models, a Transformer model based on attention mechanism has been presented recently to address vision tasks. One of the major issue with Transformers is the need of a large data for training, while most FER databases are limited compared to other vision applications. Therefore, we propose in this paper to learn a vision Transformer jointly with a Squeeze and Excitation (SE) block for FER task. The proposed method is evaluated on different publicly available FER databases including CK+, JAFFE,RAF-DB and SFEW. Experiments demonstrate that our model outperforms state-of-the-art methods on CK+ and SFEW and achieves competitive results on JAFFE and RAF-DB.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 12

07/12/2018

Multi-Region Ensemble Convolutional Neural Network for Facial Expression Recognition

Facial expressions play an important role in conveying the emotional sta...
09/20/2021

MFEViT: A Robust Lightweight Transformer-based Network for Multimodal 2D+3D Facial Expression Recognition

Vision transformer (ViT) has been widely applied in many areas due to it...
08/25/2021

TransFER: Learning Relation-aware Facial Expression Representations with Transformers

Facial expression recognition (FER) has received increasing interest in ...
02/22/2021

Deepfake Video Detection Using Convolutional Vision Transformer

The rapid advancement of deep learning models that can generate and synt...
07/12/2021

Spatial and Temporal Networks for Facial Expression Recognition in the Wild Videos

The paper describes our proposed methodology for the seven basic express...
11/14/2016

Baseline CNN structure analysis for facial expression recognition

We present a baseline convolutional neural network (CNN) structure and i...
04/02/2021

On the Pitfalls of Learning with Limited Data: A Facial Expression Recognition Case Study

Deep learning models need large amounts of data for training. In video r...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Year after year, human life is increasingly intertwined with Artificial Intelligence (AI)-based systems. As a result, there is a growing attention in technologies that can understand and interact with humans, or that can provide improved contact between humans. To that end, more researchers are involved in developing automated FER methods that can be summarised in three categories including Handcrafted, Deep Learning and Hybrid. Main handcrafted solutions 

[506414, 7026204, 9378702] are based on techniques like LBP, HOG and OF. They present good results on lab-made databases (CK+ [5543262] and JAFFE [670949]), in contrast, they perform modestly on wild databases (SFEW [6130508] and RAF-DB [li2017reliable]). Some researchers [9191181, NEURIPS2020_a51fb975, Farzaneh_2021_WACV] have taken advantage of advancements in deep learning techniques, especially in CNN architectures, to outperform previous hand-crafted solutions. Others [article1, 9084763] propose solutions that mix the handcrafted techniques with deep learning techniques to address specific challenges in FER.

Impressive results [NIPS2017_3f5ee243, devlin-etal-2019-bert, liu2019roberta]

from Transformer models on NLP tasks have motivated vision community to study the application of Transformers to computer vision problems. The idea is to represent an image as a sequence of patches in analogy of a sequence of words in a sentence in NLP domain. Transformers are made to learn parallel relation between sequence inputs through an attention mechanism which makes them theoretically suitable for both tasks NLP and image processing. The Transformer was firstly introduced by Vaswani

et al. [NIPS2017_3f5ee243] as a machine translation model, and then multiple variants [NIPS2017_3f5ee243, devlin-etal-2019-bert, liu2019roberta] were proposed to increase the model accuracy and overcome various NLP challenges. Recently, a ViT is presented for different computer vision tasks from image classification [dosovitskiy2020], object detection [carion2020end] to image data generation [jiang2021transgan]. The Transformer proves its capability and overcomes state-of-the-art performance in different NLP applications as well as in vision applications. However, these attention-based architectures are computationally more demanding than CNN and training data hunger.

In this paper, we propose to alleviate the problem, that ViT has, caused by the lack of training data for FER with a block of SE. We also provide an internal representations analysis of the ViT on facial expressions. The contribution of this paper can be summarized in four-folds:

  • Introduction of a SE block to optimize the learning of the ViT.

  • Fine-tuning of the ViT on FER-2013 [carrier2013fer] database for FER task.

  • Test of the model on four different databases (CK+ [5543262], JAFFE [670949], RAF-DB [li2017reliable], SFEW [6130508]).

  • Analysis of the attention mechanism of the ViT and the effect of the SE block.

The remaining of this paper is organized as follows. Section 2 reviews the related work. Section 3 firstly gives an overview of the proposed method and then describes the details of the ViT and the SE block. Section 4 presents the experimental results. Finally, Section 5 concludes the paper.

2 Related Works

In this section, we briefly review some related works on ViT and facial expression recognition solutions.

2.1 Vision Transformer (ViT)

The ViT is first proposed by Dosovitskiy et al. [dosovitskiy2020] for image classification. The main part of the model is the encoder part of the Transformer as first introduced for machine translation by Vaswani et al. [NIPS2017_3f5ee243]

. To transform the images into a sequence of patches they use a linear projection, and for the classification, they use only the token class vector. The model achieves state-of-the-art performance on ImageNet 

[5206848] classification using fine-tuning on JFT-300M [8237359]. From that and the fact that this model contains much more parameters (about 100M) than CNN, we can say that ViT are data-hungry models. To address this heavily relying on large-scale databases, Touvron et al. [touvron2020deit]

proposed DEIT model. It’s a ViT with two classification tokens. The first one is fed to an MLP head for the classification and the other one is used on the distillation process with a CNN teacher model pretrained on ImageNet 

[5206848]. The DEIT was only trained on ImageNet and outperforms both the ViT model and the teacher model. Yuan et al. [yuan2021tokens] overcome the same limitation of ViT using novel tokenization process. The proposed T2T-ViT [yuan2021tokens] model has two modules: 1) the T2T tokenization module that consists in two steps: re-structurization and soft split, to model the local information and reduce the length of tokens progressively, and 2) the Transformer encoder module. It achieves state-of-the-art performance on ImageNet [5206848] classification without a pretraining on JFT-300M [8237359].

2.2 Facial Expression Recognition

The FER task has progressed from handcrafted [506414, 7026204, 9378702] solutions to deep learning [9191181, Otberdout2018DeepCD, Farzaneh_2021_WACV, Wang2020RegionAN] and Hybrid [article1, 9084763, Ma2021RobustFE] solutions. In 2014, Turan et al. [7026204]

proposed a region-based handcrafted system for FER. They extracted features from the eye and mouth regions using LPQ and PHOG. A PCA is used as a tool for features selection. They fused the two groups of features with a CCA and finally, a SVM is applied as a classifier. More recent work 

[9378702], proposed an automatic FER system based on LBP and HOG as features extractor. A local linear embedding technique is used to reduce features dimensionality and a SVM for the classification part. They reached state-of-the-art performance for handcrafted solutions on JAFFE [670949], KDEF [kdef51] and RafD [rafd52]. Recently, more challenging and rich data have been made publicly available and with the progress of deep learning architectures, many deep learning solutions based on CNN models are revealed. Otberdout et al. [Otberdout2018DeepCD] proposed to use SPC to replace the fully connected layer in CNN architecture for facial expression classification. Wang et al. [Wang2020RegionAN]

proposed a region-based solution with a CNN model with two blocks of attention. They perform different crop of the same image and apply a CNN on each patch. A self-attention module is then applied followed by a relation attention module. On the self-attention block, they use a loss function in a way that one of the cropped image may have a weight larger than the weight given to the input image. More recently, Farzaneh

et al. [Farzaneh_2021_WACV]

have integrated an attention block to estimate the weights of features with a sparse center loss to achieve intra-class compactness and inter-class separation. Deep learning based solutions have widely outperformed handcrafted solutions especially on wild databases like RAF-DB 

[li2017reliable], SFEW[6130508], AffectNet [8013713] and others.

Other researchers have though about combining deep learning techniques with handcrafted techniques into a hybrid system. Levi et al. [article1] proposed to apply CNN on the image, its LBP and the mapped LBP to a 3D space using MDS. Xu et al. [9084763] proposed to fuse CNN features with LBP features and they used PCA as features selector. Newly, many Transformer models have been introduced for different computer vision tasks and in that context Ma et al. [Ma2021RobustFE] proposed a convolutional vision Transformer. They extract features from the input image as well as form its LBP using a ResNet18. Then, they fuse the extracted features with an attentional selective fusion module and fed the output to a Transformer encoder with a MLP head to perform the classification. To our knowledge, [Ma2021RobustFE] is considered as the first solution based on Transformer architecture for FER. However, our proposed solution differs in applying the Transformer encoder directly on the image and not on the extracted features which may reduce the complexity of the proposed system and aid to study and analyse the application of ViT on FER problem as one of the interesting vision tasks.

Table 8 (presented in the Supplementary Material) summarizes some state-of-the-art approaches with details on the used architecture and databases. We can notice that different databases are used to address different issues and challenges. From these databases we selected 4 of them to study our proposed solution and compare it with state-of-the-art works. The selected databases are described in the experiments and comparison Section 4. In the next section we will describe our proposed solution.

3 Proposed Method

In this section, we introduce the proposed solution in three separate paragraphs: an overview, then some details of the ViT architecture and the attention mechanism, and finally the SE block.

3.1 Architecture overview

The proposed solution contains two main parts, a vision Transformer to extract local attention features and a SE block to extract global relation from the extracted features which may optimize the learning process on small facial expressions databases.

Figure 1: Overview of the proposed solution. The used ViT is the base version with 14 layers of Transformer encoder and patch dimension of . The ViT is already trained on JFT-300M [8237359] database and fine-tuned to ImageNet-1K [5206848] database.

3.2 Vision Transformer

The vision Transformer consists of two steps: the tokenization and the Transformer encoder. In the tokenization step, the image is cropped onto equal dimension patches and then flattened to a vector. An extra learnable vector is added as a token for classification called "cls_tkn". Each vector is marked with a position value. To summarize, the input of the Transformer encoder is vectors of length .

As shown in Figure 1, the Transformer encoder is a sequence of blocks of the attention module. The main part of the attention block is the MHA. The MHA is build with heads of self-Attention, also called intra-attention. According to [NIPS2017_3f5ee243], the idea of the self-attention is to relate different positions of a single sequence in order to compute a representation of the sequence. For a given sequence, 3 layers are used: Q-layer, K-layer and V-layer and the self-attention function will be a mapping of a query (Q or Q-layer) and a set of key-value (K or K-layer; V or V-layer) pairs to an output. The self-attention function is summarized by Equation eq:attention:

(1)

And so the MHA Equation eq:mha will be:

(2)

where the projections and are parameters’ matrices.

3.3 Squeeze and Excitation (SE)

The Squeeze and Excitation block, shown on the right of the Figure 1, is also an attention mechanism. It contains widely fewer parameters than self-attention block as shown by Equation eq:se where two fully connected layers are used with only one operation of pointwise multiplication. It is firstly introduced in [iandola2016squeezenet] to optimize CNN architecture as a channel-wise attention module, concretely we use only the excitation part since the squeeze part is a pooling layer build to reduce the dimension of the 2d-CNN layers.

(3)

where and are fully connected layers with respectively neurons and neurons, is the length of the cls_tkn which is the classification token vector and is a pointwise multiplication. The idea of using SE in our architecture is to optimize the learning of the ViT by learning more global attention relations between extracted local attention features. Thus, the SE is introduced on top of the Transformer encoder more precisely on the classification token vector. Different from the self-attention block where it is used inside the Transformer encoder to encode the input sequence and extract features through cls_tkn, the SE is applied to recalibrate the feature responses by explicitly modelling inter-dependencies among cls_tkn channels.

4 Experiments and Comparison

In this section, we first describe the used databases, and then provide an ablation study for different contributions with other details on the proposed solution and an analysis of additional visualisation for in-depth understanding of the ViT applied on FER task. Finally, we present a comparison with state-of-the-art works.

4.1 FER Databases

CK+ [5543262] : published on 2010, and it is an extended version of CK database. It contains 593 sequences taken in lab environment with two data formats and . It encompasses the 7 basic expressions which are : Angry, Disgust, Fear, Happy, Neutral, sad and Surprise, plus the Contempt expression. In our case, we only worked on the 7 basic expressions to have a fair study with other databases and with most state-of-the-art solutions.
JAFFE [670949]: The JAFFE database is a 213 gray scale images of acted Japanese female facial expressions. All the images are resized onto . It contains the 7 basic expressions.
FER-2013 [carrier2013fer]: The FER-2013 database, or sometimes referred as FERPlus, is almost 35k facial expressions database on 7 basic expressions. Published in 2013 in a challenge on Kaggle plate-form111https://www.kaggle.com/msambare/FER-2013. The images are collected from the web converted to gray scale model and resized to . Theoretically, this database could suffer from mislabeling since a human accuracy is reported. However, since it is a large spontaneous databases of facial expressions we used it as a pre-training data for our model.
SFEW [6130508]: The SFEW is a very challenging databases with images captured from different movies. It contains 1,766 RGB images with size of . It is also labeled with the 7 basic expressions.
RAF-DB [li2017reliable]: The RAF-DB is a recent database with nearly 30K of mixed RGB and gray scale images collected from different internet websites. This database contains two separate sub-data: one with 7 basic expressions and the other with 12 compound facial expressions. In the experiments, we used the 7 basic expressions version.

Table 7 (presented in the Supplementary Material) summarizes previous presented databases with reference to the year and the publication conference and some other details. For FER task there are other publicly available databases that address different issues, but we restrained our choices on these databases because they are in the center of interest of major state-of-the-art solutions.

4.2 Architecture and training parameters

In all experiments, we use a pretrained ViT-B16-224 (weights222https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py), the base version of the ViT with patch size and input image size. Since ViT training needs large data to reach good performance we used the following list of data augmentation: Random Horizontal flip, Random GrayScale conversion, different values of brightness, contrast and saturation. All images are converted to 3 channels, resized to and normalized. The regularisation methods we used in this work are Cutout [devries2017cutout] and Mixup [zhang2018mixup]. The training is performed with categorical cross entropy as a loss function and AdamW [Loshchilov2019DecoupledWD] as an optimizer. The learning rate is fixed to

with a batch size of 16. When training on FER-2013 database, the number of epochs is fixed to 8 and for the rest of databases it is fixed to 10. The training process is carried-out on a Tesla K80 TPU with 8 cores using Pytorch1.7.

4.3 Ablation Study

In the ablation study, we assess the performance of the ViT architecture, the added SE block and the use of FER-2013 [carrier2013fer] as a pre-training data. Table 1 shows the result of different experiments on CK+, JAFFE, RAF-DB and SFEW.

Model Pre-train CK+ [5543262] JAFFE [670949] RAF-DB [li2017reliable] SFEW [6130508]
ViT * 0.9857 0.8823 0.8595 0.3828
ViT + SE * 0.9949 0.9061 0.8618 0.4084
ViT FER-2013[carrier2013fer]* 0.9817 0.9483 0.8703 0.5035
ViT + SE FER-2013[carrier2013fer]* 0.9980 0.9292 0.8722 0.5429

* The used ViT model is already trained on ImageNet [5206848].

Table 1: Ablation Study

From first line, we can notice that ViT can reach state-of-the-art performance on lab-made databases like CK+ [5543262] and JAFFE [670949], however on SFEW [6130508] the Transformer is less effective. In all cases, we can notice that there is a benefit of using SE and the pre-training phase on FER-2013 [carrier2013fer]. The two contributions may not be complementary on lab-made data (CK++ [5543262] and JAFFE [670949]). For example, on CK++ [5543262] we can notice that the pre-training improves the performance only when combined with the SE. On JAFFE [670949], the best solution is the one that relies on pre-training without the SE. Although, on wild databases (RAF-DB [li2017reliable] and SFEW [6130508]) the added value of both contributions is more noticeable, specially on SFEW [6130508] we can obtain a 16% gain on accuracy compared to the ViT without a SE neither a pre-training on FER-2013 [carrier2013fer].

Figure 2: Confusion matrices of ViT+SE on the validation set of RAF-DB (left) and the validation set of SFEW (right).

The confusion matrices of the proposed ViT+SE pre-trained on FER-2013 are reported in Figure 2, the left plot is for the validation set of RAF-DB [li2017reliable] and the right plot is for the validation set of SFEW [6130508]. The Happy and Neutral expressions are the best recognized on the SFEW [6130508] database with respectively an accuracy of 85% and 69%. For RAF-DB [li2017reliable], the Happy expression has the best accuracy with 96% followed by the Angry expression with 92% accuracy. On the two confusion matrices, we can notice that our model confront difficulties in recognizing the Fear expression, and that may be due to the less amount of data provided for that expression compared to the rest of expressions.

4.4 Transformer visualisation and analysis

In this section, we have conducted a various set of experiments in RAF-DB database. Specially, we evaluate the classification outputs of the model through t-SNE and we provide a visual analysis of the ViT model performance with the SE in comparison with CNN.

Figure 3: t-SNE plots corresponding to the 768-dimensional features from the ViT, ViT+SE before and after the SE block and the 512-dimensional features from the ResNet50. The features correspond to the RAF-DB images. The accuracy of ResNet50, ViT and ViT+SE on RAF-DB are respectively: 0.8061, 0.8595 and 0.8618.

Figure 3 shows the t-SNE of the extracted features form the ViT model without SE, the features of the ViT + SE after the SE block and before SE, and compared with t-SNE of ResNet50 [7780459] features trained also on RAF-DB. Based on t-SNE, the ViT architectures enable better separation of classes compared to CNN base-line architecture (ResNet50).

Figure 4: GRAD-CAM, Score-CAM, Eigen-CAM maps of the last layer before the classification block for the Happy expression (image from the validation set of RAF-DB [li2017reliable]).

In addition, the SE block enhances ViT model robustness, as the intra-distances between clusters are maximized. Interestingly, the features before the SE form a more compact clusters with inter-distance lower than the features after the SE, which may interpret the features before SE are more robust than those after the SE. However, we tried to use the before SE features directly in the classification task and no performance gain has been reported. Figure 4 shows different maps of attention of the ViT, the ViT+SE and the ResNet50, using Grad-Cam [8237336], Score-Cam [9150840] and Eigen-Cam [Muhammad2020EigenCAMCA] tools. This visualisation shows that ViT architectures succeed to focus more locally which confirm the interest of using the self-attention blocks for computer vision tasks. Once again, we can notice the gain of using the SE block with different tools but mostly using Eigen-CAM [Muhammad2020EigenCAMCA].

Other investigations of the ViT architecture are presented in the Supplementary Material (Figure 5) that shows the evolution of the attention form first attention block to a deeper attention blocks and we can notice that the focus of the ViT goes from global attention to more local attention. This particular behaviour of the ViT on FER task is the motivation of using SE block on top of it to build a calibrated relation between different local focuses. In Figure 6

(Supplementary Material), we show the focus of the ViT compared to the ViT + SE for different facial expressions and it shows how the SE can rectify the local attention feature extracted with the ViT, by searching for a global attention relations.

4.5 Comparison with state-of-the-art

In this paper, we compare our proposed model ViT+SE pre-trained on FER-2013 [carrier2013fer] database with state-of-the-art solution on 2 lab-made databases (CK+ [5543262] and JAFFE [670949]) and 2 wild databases (RAF-DB [li2017reliable] and SFEW [6130508]). Table 3 shows that we have the highest accuracy on CK+ [5543262] with a 99.80% using a 10-fold cross-validation protocol. Table 5 shows that we set the new state-of-the-art performance for single models on SFEW [6130508] with 54.29% accuracy, however a higher accuracy (56.4%) is reported in [Wang2020RegionAN] using ensemble models. Furthermore, in Table 3 the proposed solution have a good 10-fold cross validation accuracy on JAFFE [670949] with 92.92%. To our knowledge, it is the highest performance with a deep learning based solution but still less by almost 3% than the highest obtained accuracy with newly handcrafted proposed solution [9378702]. Table 5 shows that our solution has a good result on RAF-DB [li2017reliable] with an accuracy of 87.22%, to position as the third best solution among state-of-the-art on this database, less than the best record by nearly 3%.

Ref. Model Type Accuracy
[7026204] 2014 Handcrafted 0.9503
[NEURIPS2020_a51fb975] 2020 Deep Learning 0.9759
[Minaee2021DeepEmotionFE] 2021 Deep Learning 0.9800
ViT + SE Deep Learning 0.9980
Table 3: Comparison on JAFFE [670949] with 10-fold cross validation.
Ref. Model Type Accuracy
[6998925] 2015 Handcrafted 0.9180
[9378702] 2020 Handcrafted 0.9600
[Minaee2021DeepEmotionFE] 2021 Deep Learning 0.9280
ViT + SE Deep Learning 0.9292
Table 2: Comparison on CK+ [5543262] with 10-fold cross validation.
Ref. Model Type Accuracy
[Wang2020RegionAN] 2020 Deep Learning 0.8690
[Ma2021RobustFE] 2021 Hybrid 0.8814
[Shi2021LearningTA] 2021 Deep Learning 0.9055
ViT + SE Deep Learning 0.8722
Table 5: Comparison on the validation set of SFEW [6130508]
Ref. Model Type Accuracy
[Otberdout2018DeepCD] 2018 Deep Learning 0.4918
[Cai2018IslandLF] 2018 Deep Learning 0.5252
[Wang2020RegionAN] 2020 Deep Learning 0.5419
ViT + SE Deep Learning 0.5429
Table 4: Comparison on the validation set of RAF-DB [li2017reliable]

5 Conclusion

In this work, we introduced the ViT+SE, a simple scheme that optimize the learning of the ViT by an attention block called Squeeze and Excitation. It performs impressively well for improving the performance of ViT in FER task. Furthermore, it also improves the robustness of the model as shown in the t-SNE representation of the extracted features and in the attention maps. We have presented the classification performance on lab-made databases (CK+ and JAFFE) and wild databases (RAF-DB and SFEW) to evaluate the gain of the SE block and the use of FER-2013 as a pre-training database. By comparing to different state-of-the-art solutions, we have shown that our proposed solution achieves the highest performance with a single model on CK+ and SFEW, and competitive results on JAFFE and RAF-DB. As future work, we aim to extend the ViT architecture to address the temporal aspect for a more competitive task like micro-expressions recognition.

References

1 Cross-database evaluation and visual illustrations

Cross-database evaluation: To verify the generalisation ability of our model, we conduct a cross-database evaluation on CK+. The results are summarized in Table 6. It shows that the ViT generalizes better than a baseline CNN (ResNet50), and the proposed ViT+SE model enables the best generalization from different training databases when tested on CK+. However, the generalization ability is still modest and we aim to improve it in a future work.

Model Train Test Accuracy
ResNet50 CK+ CK+ 0.9488
RAf-DB CK+ 0.3517
SFEW CK+ 0.2905
FER2013 CK+ 0.3456
ViT CK+ CK+ 0.9817
RAf-DB CK+ 0.5443
SFEW CK+ 0.3812
FER2013 CK+ 0.4098
ViT+SE CK+ CK+ 0.9980
RAf-DB CK+ 0.5576
SFEW CK+ 0.5341
FER2013 CK+ 0.6514
Table 6: Crass-database evaluation on CK+.

Attention Maps: In this work, we used Grad-Cam [8237336], Score-Cam [9150840] and Eigen-Cam [Muhammad2020EigenCAMCA] as tools to provide visual analysis of the proposed deep learning architectures. (code available in333https://github.com/jacobgil/pytorch-grad-cam).
Grad-CAM [8237336] : the Gradient-weighted CAM uses the gradient of any target following to the selected layer in the model to generate a heat map that highlight the important region in the image for predicting the target.
Score-CAM [9150840] : the Score-weighted CAM is a linear combination of weights and activation maps. The weights are obtained by passing score of each activation map forward on target class.
Eigen-CAM [Muhammad2020EigenCAMCA] : it computes the principal components of the learned features from the model layers.

Figure 5: score-CAM maps and the guided back-propagation (GBP) at different layers of attention of the ViT for fear expression (image from the validation set of RAF-DB).
Figure 6: Attention maps based on GRAD-CAM for different expressions (images from the validation set of RAF-DB).

Confusion matrices: Figure 7 shows the confusion matrices of the validation set of RAF-DB for ResNet50, ViT and ViT+SE. ViT and ViT+SE have better performance on all expressions except the Happy expression compared to ResNet50 performance. Although, the ViT+SE is 0.19% more accurate than ViT, it only outperforms in 4 facial expressions out of 7 basic expressions, which are Fear, Happy, Sad and Surprise. The ViT performs better in Angry, Disgust and Neutral expressions.

Figure 7: Confusion Matrices of RAF-DB for ResNet50 (0.8061), ViT (0.8703) and ViT+SE (0.8722).

2 State-of-the-art

Survey on the used databases: Table 7 shows an overview of the facial experiments databases that are used in our experiments.
Summary of state-of-the-art: In Table 8 we summarize different proposed solutions in literature into 3 different approaches: Handcrafted, Hybrid and Deep Learning. The Table gives details about the year, the core of the proposed architecture and the databases used for the evaluation.

Database Publ. Year Annotation Condit. Data format Classes
CK+ [5543262] CVPRW 2010 593 sequences Lab , 8BE
JAFFE [670949] FG 1998 213 images Lab 7 BE
FER-2013 [carrier2013fer] ICONIP 2013 35,887 images Web 7 BE
SFEW [6130508] ICCV 2011 1,766 images Movie 7 BE
RAF-DB [li2017reliable] CVPR 2017 29,672 images Internet 7BE

Gray scale, RGB, 7BE + Contempt

Table 7: Survey on databases of Macro-Expressions. BE: Basic Expressions, CE: Compound Expressions, Publ.: Publications, Condit.: Conditions.
Methods Publ. Year Architecture Databases

Handcrafted

[506414] TPAMI 1996 OF Private database
[7026204] ICIP 2014 PHOG, LPQ CK+[5543262]
[6998925] Trans. AC. 2015 LBP JAFFE[670949], CK+[5543262]
[9378702] IHSH 2020 LBP, HOG JAFFE[670949], KDEF[kdef51], RafD[rafd52]
[Otberdout2018DeepCD] BMVC 2018 CNN Oulu-CASIA[4761697], SFEW [6130508]

Deep learning

[Wang2020RegionAN] Trans. IP. 2020 CNN FER-2013[carrier2013fer], RAF-DB[li2017reliable], SFEW[6130508], AffectNet[8013713]
[Farzaneh_2021_WACV] WACV 2021 CNN RAF-DB[li2017reliable], AffectNet[8013713]
[Minaee2021DeepEmotionFE] Sensors 2021 CNN FER-2013[carrier2013fer], CK+[5543262] , FERG[aneja2016modeling], and JAFFE[670949]
[Shi2021LearningTA] arXiv 2021 CNN FER-2013[carrier2013fer], RAF-DB[li2017reliable], AffectNet[8013713]
[article1] ICMI 2015 LBP, CNN EmotiW 2015[SFEW2015]

Hybrid

[9084763] ITNEC 2020 LBP, CNN FER-2013 [carrier2013fer]
[Ma2021RobustFE] arXiv 2021 LBP, CNN, ViT FERPlus [Barsoum2016TrainingDN], RAF-DB [li2017reliable], AffectNet [8013713], CK+ [5543262]
Table 8: Summary of representative approaches for facial expressions recognition.