Year after year, human life is increasingly intertwined with Artificial Intelligence (AI)-based systems. As a result, there is a growing attention in technologies that can understand and interact with humans, or that can provide improved contact between humans. To that end, more researchers are involved in developing automated FER methods that can be summarised in three categories including Handcrafted, Deep Learning and Hybrid. Main handcrafted solutions[506414, 7026204, 9378702] are based on techniques like LBP, HOG and OF. They present good results on lab-made databases (CK+  and JAFFE ), in contrast, they perform modestly on wild databases (SFEW  and RAF-DB [li2017reliable]). Some researchers [9191181, NEURIPS2020_a51fb975, Farzaneh_2021_WACV] have taken advantage of advancements in deep learning techniques, especially in CNN architectures, to outperform previous hand-crafted solutions. Others [article1, 9084763] propose solutions that mix the handcrafted techniques with deep learning techniques to address specific challenges in FER.
Impressive results [NIPS2017_3f5ee243, devlin-etal-2019-bert, liu2019roberta]
from Transformer models on NLP tasks have motivated vision community to study the application of Transformers to computer vision problems. The idea is to represent an image as a sequence of patches in analogy of a sequence of words in a sentence in NLP domain. Transformers are made to learn parallel relation between sequence inputs through an attention mechanism which makes them theoretically suitable for both tasks NLP and image processing. The Transformer was firstly introduced by Vaswaniet al. [NIPS2017_3f5ee243] as a machine translation model, and then multiple variants [NIPS2017_3f5ee243, devlin-etal-2019-bert, liu2019roberta] were proposed to increase the model accuracy and overcome various NLP challenges. Recently, a ViT is presented for different computer vision tasks from image classification [dosovitskiy2020], object detection [carion2020end] to image data generation [jiang2021transgan]. The Transformer proves its capability and overcomes state-of-the-art performance in different NLP applications as well as in vision applications. However, these attention-based architectures are computationally more demanding than CNN and training data hunger.
In this paper, we propose to alleviate the problem, that ViT has, caused by the lack of training data for FER with a block of SE. We also provide an internal representations analysis of the ViT on facial expressions. The contribution of this paper can be summarized in four-folds:
Introduction of a SE block to optimize the learning of the ViT.
Fine-tuning of the ViT on FER-2013 [carrier2013fer] database for FER task.
Test of the model on four different databases (CK+ , JAFFE , RAF-DB [li2017reliable], SFEW ).
Analysis of the attention mechanism of the ViT and the effect of the SE block.
2 Related Works
In this section, we briefly review some related works on ViT and facial expression recognition solutions.
2.1 Vision Transformer (ViT)
The ViT is first proposed by Dosovitskiy et al. [dosovitskiy2020] for image classification. The main part of the model is the encoder part of the Transformer as first introduced for machine translation by Vaswani et al. [NIPS2017_3f5ee243]
. To transform the images into a sequence of patches they use a linear projection, and for the classification, they use only the token class vector. The model achieves state-of-the-art performance on ImageNet classification using fine-tuning on JFT-300M . From that and the fact that this model contains much more parameters (about 100M) than CNN, we can say that ViT are data-hungry models. To address this heavily relying on large-scale databases, Touvron et al. [touvron2020deit]
proposed DEIT model. It’s a ViT with two classification tokens. The first one is fed to an MLP head for the classification and the other one is used on the distillation process with a CNN teacher model pretrained on ImageNet. The DEIT was only trained on ImageNet and outperforms both the ViT model and the teacher model. Yuan et al. [yuan2021tokens] overcome the same limitation of ViT using novel tokenization process. The proposed T2T-ViT [yuan2021tokens] model has two modules: 1) the T2T tokenization module that consists in two steps: re-structurization and soft split, to model the local information and reduce the length of tokens progressively, and 2) the Transformer encoder module. It achieves state-of-the-art performance on ImageNet  classification without a pretraining on JFT-300M .
2.2 Facial Expression Recognition
The FER task has progressed from handcrafted [506414, 7026204, 9378702] solutions to deep learning [9191181, Otberdout2018DeepCD, Farzaneh_2021_WACV, Wang2020RegionAN] and Hybrid [article1, 9084763, Ma2021RobustFE] solutions. In 2014, Turan et al. 
proposed a region-based handcrafted system for FER. They extracted features from the eye and mouth regions using LPQ and PHOG. A PCA is used as a tool for features selection. They fused the two groups of features with a CCA and finally, a SVM is applied as a classifier. More recent work, proposed an automatic FER system based on LBP and HOG as features extractor. A local linear embedding technique is used to reduce features dimensionality and a SVM for the classification part. They reached state-of-the-art performance for handcrafted solutions on JAFFE , KDEF [kdef51] and RafD [rafd52]. Recently, more challenging and rich data have been made publicly available and with the progress of deep learning architectures, many deep learning solutions based on CNN models are revealed. Otberdout et al. [Otberdout2018DeepCD] proposed to use SPC to replace the fully connected layer in CNN architecture for facial expression classification. Wang et al. [Wang2020RegionAN]
proposed a region-based solution with a CNN model with two blocks of attention. They perform different crop of the same image and apply a CNN on each patch. A self-attention module is then applied followed by a relation attention module. On the self-attention block, they use a loss function in a way that one of the cropped image may have a weight larger than the weight given to the input image. More recently, Farzanehet al. [Farzaneh_2021_WACV]
have integrated an attention block to estimate the weights of features with a sparse center loss to achieve intra-class compactness and inter-class separation. Deep learning based solutions have widely outperformed handcrafted solutions especially on wild databases like RAF-DB[li2017reliable], SFEW, AffectNet  and others.
Other researchers have though about combining deep learning techniques with handcrafted techniques into a hybrid system. Levi et al. [article1] proposed to apply CNN on the image, its LBP and the mapped LBP to a 3D space using MDS. Xu et al.  proposed to fuse CNN features with LBP features and they used PCA as features selector. Newly, many Transformer models have been introduced for different computer vision tasks and in that context Ma et al. [Ma2021RobustFE] proposed a convolutional vision Transformer. They extract features from the input image as well as form its LBP using a ResNet18. Then, they fuse the extracted features with an attentional selective fusion module and fed the output to a Transformer encoder with a MLP head to perform the classification. To our knowledge, [Ma2021RobustFE] is considered as the first solution based on Transformer architecture for FER. However, our proposed solution differs in applying the Transformer encoder directly on the image and not on the extracted features which may reduce the complexity of the proposed system and aid to study and analyse the application of ViT on FER problem as one of the interesting vision tasks.
Table 8 (presented in the Supplementary Material) summarizes some state-of-the-art approaches with details on the used architecture and databases. We can notice that different databases are used to address different issues and challenges. From these databases we selected 4 of them to study our proposed solution and compare it with state-of-the-art works. The selected databases are described in the experiments and comparison Section 4. In the next section we will describe our proposed solution.
3 Proposed Method
In this section, we introduce the proposed solution in three separate paragraphs: an overview, then some details of the ViT architecture and the attention mechanism, and finally the SE block.
3.1 Architecture overview
The proposed solution contains two main parts, a vision Transformer to extract local attention features and a SE block to extract global relation from the extracted features which may optimize the learning process on small facial expressions databases.
3.2 Vision Transformer
The vision Transformer consists of two steps: the tokenization and the Transformer encoder. In the tokenization step, the image is cropped onto equal dimension patches and then flattened to a vector. An extra learnable vector is added as a token for classification called "cls_tkn". Each vector is marked with a position value. To summarize, the input of the Transformer encoder is vectors of length .
As shown in Figure 1, the Transformer encoder is a sequence of blocks of the attention module. The main part of the attention block is the MHA. The MHA is build with heads of self-Attention, also called intra-attention. According to [NIPS2017_3f5ee243], the idea of the self-attention is to relate different positions of a single sequence in order to compute a representation of the sequence. For a given sequence, 3 layers are used: Q-layer, K-layer and V-layer and the self-attention function will be a mapping of a query (Q or Q-layer) and a set of key-value (K or K-layer; V or V-layer) pairs to an output. The self-attention function is summarized by Equation eq:attention:
And so the MHA Equation eq:mha will be:
where the projections and are parameters’ matrices.
3.3 Squeeze and Excitation (SE)
The Squeeze and Excitation block, shown on the right of the Figure 1, is also an attention mechanism. It contains widely fewer parameters than self-attention block as shown by Equation eq:se where two fully connected layers are used with only one operation of pointwise multiplication. It is firstly introduced in [iandola2016squeezenet] to optimize CNN architecture as a channel-wise attention module, concretely we use only the excitation part since the squeeze part is a pooling layer build to reduce the dimension of the 2d-CNN layers.
where and are fully connected layers with respectively neurons and neurons, is the length of the cls_tkn which is the classification token vector and is a pointwise multiplication. The idea of using SE in our architecture is to optimize the learning of the ViT by learning more global attention relations between extracted local attention features. Thus, the SE is introduced on top of the Transformer encoder more precisely on the classification token vector. Different from the self-attention block where it is used inside the Transformer encoder to encode the input sequence and extract features through cls_tkn, the SE is applied to recalibrate the feature responses by explicitly modelling inter-dependencies among cls_tkn channels.
4 Experiments and Comparison
In this section, we first describe the used databases, and then provide an ablation study for different contributions with other details on the proposed solution and an analysis of additional visualisation for in-depth understanding of the ViT applied on FER task. Finally, we present a comparison with state-of-the-art works.
4.1 FER Databases
CK+  : published on 2010, and it is an extended version of CK database. It contains 593 sequences taken in lab environment with two data formats and . It encompasses the 7 basic expressions which are : Angry, Disgust, Fear, Happy, Neutral, sad and Surprise, plus the Contempt expression. In our case, we only worked on the 7 basic expressions to have a fair study with other databases and with most state-of-the-art solutions.
JAFFE : The JAFFE database is a 213 gray scale images of acted Japanese female facial expressions. All the images are resized onto . It contains the 7 basic expressions.
FER-2013 [carrier2013fer]: The FER-2013 database, or sometimes referred as FERPlus, is almost 35k facial expressions database on 7 basic expressions. Published in 2013 in a challenge on Kaggle plate-form111https://www.kaggle.com/msambare/FER-2013. The images are collected from the web converted to gray scale model and resized to . Theoretically, this database could suffer from mislabeling since a human accuracy is reported. However, since it is a large spontaneous databases of facial expressions we used it as a pre-training data for our model.
SFEW : The SFEW is a very challenging databases with images captured from different movies. It contains 1,766 RGB images with size of . It is also labeled with the 7 basic expressions.
RAF-DB [li2017reliable]: The RAF-DB is a recent database with nearly 30K of mixed RGB and gray scale images collected from different internet websites. This database contains two separate sub-data: one with 7 basic expressions and the other with 12 compound facial expressions. In the experiments, we used the 7 basic expressions version.
Table 7 (presented in the Supplementary Material) summarizes previous presented databases with reference to the year and the publication conference and some other details. For FER task there are other publicly available databases that address different issues, but we restrained our choices on these databases because they are in the center of interest of major state-of-the-art solutions.
4.2 Architecture and training parameters
In all experiments, we use a pretrained ViT-B16-224 (weights222https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py), the base version of the ViT with patch size and input image size. Since ViT training needs large data to reach good performance we used the following list of data augmentation: Random Horizontal flip, Random GrayScale conversion, different values of brightness, contrast and saturation. All images are converted to 3 channels, resized to and normalized. The regularisation methods we used in this work are Cutout [devries2017cutout] and Mixup [zhang2018mixup]. The training is performed with categorical cross entropy as a loss function and AdamW [Loshchilov2019DecoupledWD] as an optimizer. The learning rate is fixed to
with a batch size of 16. When training on FER-2013 database, the number of epochs is fixed to 8 and for the rest of databases it is fixed to 10. The training process is carried-out on a Tesla K80 TPU with 8 cores using Pytorch1.7.
4.3 Ablation Study
In the ablation study, we assess the performance of the ViT architecture, the added SE block and the use of FER-2013 [carrier2013fer] as a pre-training data. Table 1 shows the result of different experiments on CK+, JAFFE, RAF-DB and SFEW.
|Model||Pre-train||CK+ ||JAFFE ||RAF-DB [li2017reliable]||SFEW |
|ViT + SE||*||0.9949||0.9061||0.8618||0.4084|
|ViT + SE||FER-2013[carrier2013fer]*||0.9980||0.9292||0.8722||0.5429|
* The used ViT model is already trained on ImageNet .
From first line, we can notice that ViT can reach state-of-the-art performance on lab-made databases like CK+  and JAFFE , however on SFEW  the Transformer is less effective. In all cases, we can notice that there is a benefit of using SE and the pre-training phase on FER-2013 [carrier2013fer]. The two contributions may not be complementary on lab-made data (CK++  and JAFFE ). For example, on CK++  we can notice that the pre-training improves the performance only when combined with the SE. On JAFFE , the best solution is the one that relies on pre-training without the SE. Although, on wild databases (RAF-DB [li2017reliable] and SFEW ) the added value of both contributions is more noticeable, specially on SFEW  we can obtain a 16% gain on accuracy compared to the ViT without a SE neither a pre-training on FER-2013 [carrier2013fer].
The confusion matrices of the proposed ViT+SE pre-trained on FER-2013 are reported in Figure 2, the left plot is for the validation set of RAF-DB [li2017reliable] and the right plot is for the validation set of SFEW . The Happy and Neutral expressions are the best recognized on the SFEW  database with respectively an accuracy of 85% and 69%. For RAF-DB [li2017reliable], the Happy expression has the best accuracy with 96% followed by the Angry expression with 92% accuracy. On the two confusion matrices, we can notice that our model confront difficulties in recognizing the Fear expression, and that may be due to the less amount of data provided for that expression compared to the rest of expressions.
4.4 Transformer visualisation and analysis
In this section, we have conducted a various set of experiments in RAF-DB database. Specially, we evaluate the classification outputs of the model through t-SNE and we provide a visual analysis of the ViT model performance with the SE in comparison with CNN.
Figure 3 shows the t-SNE of the extracted features form the ViT model without SE, the features of the ViT + SE after the SE block and before SE, and compared with t-SNE of ResNet50  features trained also on RAF-DB. Based on t-SNE, the ViT architectures enable better separation of classes compared to CNN base-line architecture (ResNet50).
In addition, the SE block enhances ViT model robustness, as the intra-distances between clusters are maximized. Interestingly, the features before the SE form a more compact clusters with inter-distance lower than the features after the SE, which may interpret the features before SE are more robust than those after the SE. However, we tried to use the before SE features directly in the classification task and no performance gain has been reported. Figure 4 shows different maps of attention of the ViT, the ViT+SE and the ResNet50, using Grad-Cam , Score-Cam  and Eigen-Cam [Muhammad2020EigenCAMCA] tools. This visualisation shows that ViT architectures succeed to focus more locally which confirm the interest of using the self-attention blocks for computer vision tasks. Once again, we can notice the gain of using the SE block with different tools but mostly using Eigen-CAM [Muhammad2020EigenCAMCA].
Other investigations of the ViT architecture are presented in the Supplementary Material (Figure 5) that shows the evolution of the attention form first attention block to a deeper attention blocks and we can notice that the focus of the ViT goes from global attention to more local attention. This particular behaviour of the ViT on FER task is the motivation of using SE block on top of it to build a calibrated relation between different local focuses. In Figure 6
(Supplementary Material), we show the focus of the ViT compared to the ViT + SE for different facial expressions and it shows how the SE can rectify the local attention feature extracted with the ViT, by searching for a global attention relations.
4.5 Comparison with state-of-the-art
In this paper, we compare our proposed model ViT+SE pre-trained on FER-2013 [carrier2013fer] database with state-of-the-art solution on 2 lab-made databases (CK+  and JAFFE ) and 2 wild databases (RAF-DB [li2017reliable] and SFEW ). Table 3 shows that we have the highest accuracy on CK+  with a 99.80% using a 10-fold cross-validation protocol. Table 5 shows that we set the new state-of-the-art performance for single models on SFEW  with 54.29% accuracy, however a higher accuracy (56.4%) is reported in [Wang2020RegionAN] using ensemble models. Furthermore, in Table 3 the proposed solution have a good 10-fold cross validation accuracy on JAFFE  with 92.92%. To our knowledge, it is the highest performance with a deep learning based solution but still less by almost 3% than the highest obtained accuracy with newly handcrafted proposed solution . Table 5 shows that our solution has a good result on RAF-DB [li2017reliable] with an accuracy of 87.22%, to position as the third best solution among state-of-the-art on this database, less than the best record by nearly 3%.
|[NEURIPS2020_a51fb975] 2020||Deep Learning||0.9759|
|[Minaee2021DeepEmotionFE] 2021||Deep Learning||0.9800|
|ViT + SE||Deep Learning||0.9980|
|[Minaee2021DeepEmotionFE] 2021||Deep Learning||0.9280|
|ViT + SE||Deep Learning||0.9292|
|[Wang2020RegionAN] 2020||Deep Learning||0.8690|
|[Shi2021LearningTA] 2021||Deep Learning||0.9055|
|ViT + SE||Deep Learning||0.8722|
|[Otberdout2018DeepCD] 2018||Deep Learning||0.4918|
|[Cai2018IslandLF] 2018||Deep Learning||0.5252|
|[Wang2020RegionAN] 2020||Deep Learning||0.5419|
|ViT + SE||Deep Learning||0.5429|
In this work, we introduced the ViT+SE, a simple scheme that optimize the learning of the ViT by an attention block called Squeeze and Excitation. It performs impressively well for improving the performance of ViT in FER task. Furthermore, it also improves the robustness of the model as shown in the t-SNE representation of the extracted features and in the attention maps. We have presented the classification performance on lab-made databases (CK+ and JAFFE) and wild databases (RAF-DB and SFEW) to evaluate the gain of the SE block and the use of FER-2013 as a pre-training database. By comparing to different state-of-the-art solutions, we have shown that our proposed solution achieves the highest performance with a single model on CK+ and SFEW, and competitive results on JAFFE and RAF-DB. As future work, we aim to extend the ViT architecture to address the temporal aspect for a more competitive task like micro-expressions recognition.
1 Cross-database evaluation and visual illustrations
Cross-database evaluation: To verify the generalisation ability of our model, we conduct a cross-database evaluation on CK+. The results are summarized in Table 6. It shows that the ViT generalizes better than a baseline CNN (ResNet50), and the proposed ViT+SE model enables the best generalization from different training databases when tested on CK+. However, the generalization ability is still modest and we aim to improve it in a future work.
Attention Maps: In this work, we used Grad-Cam , Score-Cam  and Eigen-Cam [Muhammad2020EigenCAMCA] as tools to provide visual analysis of the proposed deep learning architectures. (code available in333https://github.com/jacobgil/pytorch-grad-cam).
Grad-CAM  : the Gradient-weighted CAM uses the gradient of any target following to the selected layer in the model to generate a heat map that highlight the important region in the image for predicting the target.
Score-CAM  : the Score-weighted CAM is a linear combination of weights and activation maps. The weights are obtained by passing score of each activation map forward on target class.
Eigen-CAM [Muhammad2020EigenCAMCA] : it computes the principal components of the learned features from the model layers.
Confusion matrices: Figure 7 shows the confusion matrices of the validation set of RAF-DB for ResNet50, ViT and ViT+SE. ViT and ViT+SE have better performance on all expressions except the Happy expression compared to ResNet50 performance. Although, the ViT+SE is 0.19% more accurate than ViT, it only outperforms in 4 facial expressions out of 7 basic expressions, which are Fear, Happy, Sad and Surprise. The ViT performs better in Angry, Disgust and Neutral expressions.
Survey on the used databases: Table 7 shows an overview of the facial experiments databases that are used in our experiments.
Summary of state-of-the-art: In Table 8 we summarize different proposed solutions in literature into 3 different approaches: Handcrafted, Hybrid and Deep Learning. The Table gives details about the year, the core of the proposed architecture and the databases used for the evaluation.
|CK+ ||CVPRW||2010||593 sequences||Lab||,||8BE|
|JAFFE ||FG||1998||213 images||Lab||7 BE|
|FER-2013 [carrier2013fer]||ICONIP||2013||35,887 images||Web||7 BE|
|SFEW ||ICCV||2011||1,766 images||Movie||7 BE|
|RAF-DB [li2017reliable]||CVPR||2017||29,672 images||Internet||7BE|
Gray scale, RGB, 7BE + Contempt
|||Trans. AC.||2015||LBP||JAFFE, CK+|
|||IHSH||2020||LBP, HOG||JAFFE, KDEF[kdef51], RafD[rafd52]|
|[Otberdout2018DeepCD]||BMVC||2018||CNN||Oulu-CASIA, SFEW |
|[Wang2020RegionAN]||Trans. IP.||2020||CNN||FER-2013[carrier2013fer], RAF-DB[li2017reliable], SFEW, AffectNet|
|[Minaee2021DeepEmotionFE]||Sensors||2021||CNN||FER-2013[carrier2013fer], CK+ , FERG[aneja2016modeling], and JAFFE|
|[Shi2021LearningTA]||arXiv||2021||CNN||FER-2013[carrier2013fer], RAF-DB[li2017reliable], AffectNet|
|[article1]||ICMI||2015||LBP, CNN||EmotiW 2015[SFEW2015]|
|||ITNEC||2020||LBP, CNN||FER-2013 [carrier2013fer]|
|[Ma2021RobustFE]||arXiv||2021||LBP, CNN, ViT||FERPlus [Barsoum2016TrainingDN], RAF-DB [li2017reliable], AffectNet , CK+ |