Log In Sign Up

Noisy Student Training using Body Language Dataset Improves Facial Expression Recognition

by   Vikas Kumar, et al.
Penn State University

Facial expression recognition from videos in the wild is a challenging task due to the lack of abundant labelled training data. Large DNN (deep neural network) architectures and ensemble methods have resulted in better performance, but soon reach saturation at some point due to data inadequacy. In this paper, we use a self-training method that utilizes a combination of a labelled dataset and an unlabelled dataset (Body Language Dataset - BoLD). Experimental analysis shows that training a noisy student network iteratively helps in achieving significantly better results. Additionally, our model isolates different regions of the face and processes them independently using a multi-level attention mechanism which further boosts the performance. Our results show that the proposed method achieves state-of-the-art performance on benchmark datasets CK+ and AFEW 8.0 when compared to other single models.


page 4

page 6

page 7

page 10

page 11


Deep Multi-task Learning for Facial Expression Recognition and Synthesis Based on Selective Feature Sharing

Multi-task learning is an effective learning strategy for deep-learning-...

Facial Expression Recognition in the Wild using Rich Deep Features

Facial Expression Recognition is an active area of research in computer ...

The FaceChannel: A Fast Furious Deep Neural Network for Facial Expression Recognition

Current state-of-the-art models for automatic Facial Expression Recognit...

The FaceChannel: A Light-weight Deep Neural Network for Facial Expression Recognition

Current state-of-the-art models for automatic FER are based on very deep...

QUEST: Quadriletral Senary bit Pattern for Facial Expression Recognition

Facial expression has a significant role in analyzing human cognitive st...

Privileged Attribution Constrained Deep Networks for Facial Expression Recognition

Facial Expression Recognition (FER) is crucial in many research domains ...

Point Adversarial Self Mining: A Simple Method for Facial Expression Recognition in the Wild

In this paper, the Point Adversarial Self Mining (PASM) approach, a simp...

1 Introduction

Automatic facial expression recognition from images/videos has many applications such as human-computer interaction (HCI), bodily expressed emotions, human behaviour understanding, and has thus gained a lot of attention in academia and industry. Although there has been extensive research on this subject, facial expression recognition in the wild remains a challenging problem because of several factors such as occlusion, illumination, motion blur, subject-specific facial variations, along with the lack of extensive labelled training datasets. Following a similar line of research, our task aims to classify a given video in the wild to one of the seven broad categorical emotions. We propose an efficient model that addresses the challenges posed by videos in the wild while tackling the issue of labelled data inadequacy. The input data used for facial expression recognition can be multi-modal, i.e. it may have visual information as well as audio information. However, the scope of this paper is limited to emotion classification using only visual information.

Most of the recent research on the publicly-available AFEW 8.0 (Acted Facial Expressions in the Wild) [10] dataset has focused on improving accuracy without regard to computational complexity, architectural complexity, energy & policy considerations, generality, and training efficiency. Several state-of-the-art methods [13, 33, 50] on this dataset have originated from the EmotiW [11] challenge with no clear computational-cost analysis. Fan et al. [13] achieved the highest validation accuracy based on visual cues, but they used a fusion of five different architectures with more than 300 million parameters. In contrast, our proposed method uses a single model with approximately 25 million parameters and comparable performance.

While previous work focused on improving performance by increasing model capacity, our method focuses on better pre-processing, feature selection, and adequate training. Prior research

[29, 42, 24, 46]

uses simple aggregation or averaging operation on features from multiple frames to form a fixed-dimensional feature vector. However, such methods do not account for the fact that a few principal frames in a video can be used to identify the target emotion, while the rest of the frames have a negligible contribution. Frame-attention has been used

[37] for selectively processing frames in a video, but it can further be coupled with spatial-attention which could identify the most discriminative regions in a particular frame. We use a three-level attention mechanism in our model: a) spatial-attention block that helps to selectively process feature maps of a frame, b) channel-attention block that focuses on the face regions at a local and a global level, i.e. eyes region (upper face), mouth region (lower face) and whole face, and c) frame-attention block that helps to identify the most important frames in a video.

AFEW 8.0 [10] has several limitations (Sec. 2

) that restricts the generalization capabilities of deep learning models. To overcome these limitations, we use an unlabelled subset of the BoLD dataset

[35] for semi-supervised learning. Inspired by Xie et al. [53], we use a teacher-student learning method where the training process is iterated by using the same student again as the teacher. During the training of the student, noise is injected into the student model to force it to generalize better than the teacher. Results show that the student performs better with each iteration, hence improving the overall accuracy on the validation set.

The rest of the paper is organized as follows. Sec. 3 explains the datasets (AFEW 8.0 [10], CK+ [34] and BoLD [35]

) that are used for training our model along with the pre-processing pipeline used for face detection, alignment and illumination correction. Sec.

4.1.1 explains the backbone network and covers the three types of attention and its importance in detail. Sec. 4.2 covers the use of the BoLD dataset for iterative training and the experimental results of semi-supervised learning. Sec. 5.3 compares the results of our methods to other state-of-the-art methods on the AFEW 8.0 dataset. Additionally, we use another benchmark dataset CK+ [34] (posed conditions) as well as perform ablation studies (Sec. 5.4) to prove the validity of our model and training procedure.

2 Related Work

Facial Expression Recognition: A number of methods have been proposed on the AFEW 8.0 dataset [10] since the first EmotiW [11] challenge in 2013. Earlier approaches include non-deep learning methods such as multiple kernel learning [43], least-square regression on grassmanian manifold [31], and feature fusion with kernel learning [8], whereas recent approaches include deep-learning methods such as frame-attention networks [37], multiple spatial-temporal learning [33], and deeply supervised emotion recognition [13]. Although several methods [13, 33, 50, 30] have achieved impressive results on the AFEW 8.0 dataset, many have used ensemble (fusion) based methods and considered multiple modalities without commenting on the resources and time required to train such models. Spatial-temporal methods [50, 12] aim to model motion information or temporal coherency in the videos using 3D Convolution [47]

or LSTM (Long short-term memory)

[17]. However, owing to computational efficiency and the ability to treat sequential information with a global context, several studies [37, 2] related to facial expression recognition have successfully implemented attention-based methods by assigning a weight to each timestep in the video. Similarly, spatial self-attention has been used [2, 14, 28]

as a means to guide the process of feature extraction and find the importance of each local image feature. Our model builds upon the spatial self-attention mechanism and additionally uses a channel-attention mechanism to exploit the differential effects of facial feedback signals from the upper-face and lower-face regions

[51, 56].

Training Datasets: Despite being a long-established dataset, AFEW 8.0 [10] has several shortcomings. Firstly, the dataset contains significantly fewer training examples for fear, surprise and disgust categories which makes the dataset imbalanced. Secondly, the videos are extracted from mainstream cinema, and scenes depicting fear are often shot in the dark, which again makes the model biased towards other categories [33, 1]. Such limitations warrant the use of additional datasets for better generalization. However, not many in-the-wild labelled video datasets are publicly available for facial expression recognition. Several related datasets [34, 48, 36] are captured in posed conditions and are restricted to a certain country or community. Aff-Wild2 [25] is another popular dataset, but it contains per-frame annotations, and thus cannot be used in our work which performs video-level classification based on facial expressions. We use an unlabelled portion of the BoLD dataset [35] since the videos are of the desired length and are captured from movies similar to our labelled dataset.

Semi-Supervised Learning: The semi-supervised approach is effective in classification problems when the labelled training data is not sufficient. We use noisy student training [53] for semi-supervised learning, in which the trick involves the student to be deliberately noised when it trains on the combined labelled and unlabelled dataset. Input noise is added to the student model in the form of data augmentations, which ensures that different alterations of the same video should have the same emotion, hence making the student model more robust. Additionally, model noise is added in the form of dropout, which forces the student (single model) to match the performance of an ensemble model. Other techniques for semi-supervised learning include self-training [55, 40], data-distillation [38] and consistency training [4, 39]. Self-training is similar to noisy student training, but it does not use or justify the role of noise in training a powerful student. Data-distillation uses the approach of strengthening the teacher using ensembles instead of weakening the student; however, a smaller student makes it difficult to mimic the teacher. Consistency training adds regularization parameters to the teacher model during training to induce invariance to input and model noise, resulting in confident pseudo-labels. However, such constraints lead to lower accuracy and a less powerful teacher [53].

Figure 1: The pre-processing steps mainly include face detection and alignment (MTCNN [58]), illumination correction (Enlighten-GAN [19]) and landmark-based cropping. Examples from labelled dataset (AFEW 8.0) and unlabelled dataset (BoLD dataset) are shown. As seen in the figure, only videos with a close shot of the face are selected from the BoLD dataset.

3 Dataset

In this section, we first describe the datasets that we use in our experiments, followed by the pre-processing pipeline.

Labelled Sets: AFEW 8.0 (Acted Facial Expression in the Wild) [10] contains videos with seven emotion labels, i.e. anger (197 samples), neutral (207 samples), sad (179 samples), fear (127 samples), surprise (120 samples), happiness (212 samples), and disgust (114 samples) from different movies. The train set consists of 773 video samples (46,080 frames), and the validation set consists of 383 video samples (21,157 frames). The results are reported on the validation set since the test set labels are only available to EmotiW challenge [11] participants. Some of the example frames are shown in Fig. 1. CK+ (Cohn Kanade Extended) [34] contains 327 video sequences (5878 frames) divided into seven categories, i.e anger (45 samples), disgust (59 samples), fear (25 samples), happy (69 samples), sad (28 samples), surprise (83 samples), and contempt (18 samples). The motivation behind testing our method on a posed dataset is to establish the robustness of our model and semi-supervised learning method irrespective of the data source. Since CK+ does not have a testing set, we report the average accuracy obtained using 10-fold cross-validation as seen in other studies [37, 57, 20, 7, 44].

Unlabelled Set: BoLD (Body Language Dataset) [35] contains videos selected from the publicly available AVA dataset [15], which contains a list of YouTube movie IDs. While the gathered videos are annotated based on body language, the videos having a close shot of the face instead of the whole or partially-occluded body are unlabelled. To create an AFEW-like subset from the BoLD dataset, we impose two conditions to automatically validate a video. Firstly, a video should have such consecutive frames where only one actor’s face is detected by MTCNN (Multi-task Cascaded Convolutional Networks) [58]. Secondly, the bounding box of the face detected using MTCNN should exceed an occupied area threshold for the majority of those frames. If the video satisfies the above two conditions, a smaller video with those frames is added to the unlabelled dataset. Using this procedure, we create a subset of 3450 videos (224,258 frames) from the original BoLD dataset. Some of the examples gathered are shown in Fig. 1.

Pre-Processing: Previous work [33, 37] have used CNN-based detector provided by dlib [23] for face alignment. However, the alignment of faces is highly dependent on accurate detection of facial landmarks and CNN-based detector provided by dlib is not reliable for ‘in-the-wild’ conditions (especially non-frontal faces). We use MTCNN [58] for face detection and alignment. If MTCNN detects multiple faces in a frame, the face with the largest bounding box is selected. After obtaining the facial landmarks, its alignment is corrected using the angle between the line connecting the landmark points of the eyes and the horizontal line. After detection and alignment, the cropped face is resized to 224*224 pixels, which is the input size of our model.

We use the landmarks given by MTCNN to isolate the mouth (lower face) and eyes (upper face) region. The upper face is isolated using the eyes landmarks with the desired left eye normalized co-ordinates being (0.2, 0.6) and right-eye co-ordinates being (0.8, 0.6) in the new frame, which is enough to occlude the lower-half of the face in almost all frames (Fig. 1). A similar procedure is used for occluding the upper-half of the face and isolating the mouth region using left-mouth and right-mouth landmarks. All landmark-based crops are again resized to 224*224 pixels.

As addressed earlier, some of the categories of emotions are often captured in the dark in movies, which requires an illumination correction step. Several methods have been suggested for illumination normalization such as gamma correction [3, 32], Difference of Gaussians (DoG) [52] and histogram equalization [6, 21] which are effective for facial expression recognition. However, these methods tend to amplify noise, tone distortion, and other artefacts. Hence, we use a state-of-the-art pre-trained deep learning model, i.e. Enlighten-GAN [19] (U-Net [41] as generator) which provides appropriate results (Fig. 1) with uniform illumination and suppressed noise.

4 Methodology

Our proposed methodology is divided into two phases, i.e. a) architecture implementation that defines the backbone network with the three-level attention mechanism, and b) semi-supervised learning.

Figure 2: Figure shows the backbone network (ResNet-18) and the three-level attention mechanism. Inputs are first processed via Spatial-Attention, followed by Channel-Attention and finally by Frame-Attention.

4.1 Architecture

4.1.1 Backbone Network:

We use ResNet-18 [16] architecture as our backbone network, with minor modifications to increase its computational efficiency. Features from each residual block are combined to form the final feature vector (see Fig. 2). Hence, the final vector has a “multi-level knowledge” from all the residual blocks, ensuring more diverse and robust features. The model is first pre-trained on the FERPlus dataset [5]. Our input at frame-level is an image with nine channels (RGB channels from the face, eyes, and mouth region). To process them independently, the model uses group convolution [26] (groups = 3), i.e. it uses a different set of filters for each of the three regions to get the final output feature maps. Group convolution results in a lower computational cost since each kernel filter does not have to convolve on all the feature maps of the previous layer. Simultaneously, it allows data parallelism where each filter group is learning a unique representation and forms a global (face region) or local (eyes and mouth region) context vector from each frame of a video. To allow more filters per group, we increase the number of filters in each residual block, as shown in Fig. 2.

Figure 3: This figure shows how multi-level attention works in the proposed method. Spatial-attention (from last residual block) chooses the dominant feature maps from each region. Channel-attention picks the most important region that most clearly shows the target emotion. Frame-attention assigns the salient frames a higher weight.

4.1.2 Spatial-Attention:

A common approach in previous methods is a simple aggregation or average pooling of feature maps to form a fixed dimensional feature vector. However, we use spatial-attention [28] that concatenates the feature maps based on the attention weight it has been assigned. Let us assume the output from a residual block is of shape where and are the output height and width, and

is the number of output filters. This 3D tensor

is reshaped to a 2D matrix of shape where . The spatial-attention mechanism takes the input matrix and outputs a weight matrix M of shape (, is for multiple hops of attention). Each row of the output matrix represents a different hop of attention, and each column has normalized weights due to softmax (see Equation 1). The objective is to find the weighted average of R frame descriptors to obtain a vector of length (or with multiple hops).


Equation 1 represents multi-head spatial-attention where is of shape and is of shape (U can be set arbitrarily). From this, we obtain flattened vector using Equation 2. The spatial-attention module is applied on each residual block (see Fig. 2) and the output vectors are aggregated to obtain a final vector of length each for face (), eyes () and mouth () regions. The advantages of spatial attention can be seen in Fig. 3. While the feature vector from the face is encoded with a global context, the feature maps from the eyes and mouth region have additional information regarding the minute expressions such as furrowed brow or flared nostrils.

4.1.3 Channel-Attention:

Let , , and be the feature vectors obtained from the face, the eyes, and the mouth region respectively. We model the cross-channel interactions using a lightweight attention module. We use two fully-connected layers to obtain a weight (Equation 3) for each channel group using which we obtain a weighted average (Equation 4

) of the three feature vectors. ReLU (Rectified Linear Unit) activation is used after the first layer to capture non-linear interactions among the channels.



is the sigmoid activation function,

is a vector of length (set arbitrarily), and is a matrix of shape . In Fig. 3, we see that the model assigns more weight to the mouth region instead of the eyes region for an expression depicting happiness which is consistent with our findings that mouth region is more prominent for the happy category (Fig. 5).

4.1.4 Frame-Attention:

For a video having n frames, we obtain vector of length from each frame after the channel-attention module. Finally, we use frame-attention to assign the most discriminative frames a higher weight. Following a similar intuition as in channel-attention, we use two fully-connected layers to obtain a weight (Equation 5) for each frame using which we find a weighted average (Equation 6) of the frame features.


where is a vector of length (set arbitrarily), and is a matrix of shape . Fig. 3 shows how the model assigns a higher weight to the frames which distinctively contains expression depicting happiness. The feature vector is passed through a fully-connected layer to obtain the final 7-dimensional output.

4.1.5 Implementation Details:

We use weighted cross-entropy as our loss function where class weights are assigned based on number of training samples to alleviate the problem of unbalanced data. Additionally,

(Equation 1) is regularized by adding the frobenius norm of matrix to the loss function which enforces multi spatial-attention to focus on different regions [28]

. We use Adam optimizer with an initial learning rate of 1e-5 (reduced by 40% after every 30 epochs) and the model is trained for 100 epochs. The training takes around 8 minutes for 1 epoch for AFEW 8.0 training dataset with two NVIDIA Tesla K80 cards.

4.2 Noisy Student Training [53]

Once the model is trained on the labelled set and the best possible model is obtained, we use it as a teacher model to create pseudo-labels on the subset of BoLD dataset that we collected. After generating the pseudo-labels, a student model (same size or larger than teacher) is trained on the combination of labelled and unlabelled dataset. While training the student model, we deliberately add noise in the form of random data augmentations and dropout (with 0.5 probability at the final hidden layer). Random data augmentations (using RandAugment

[9]) include transformations such as brightness change, contrast change, translation, sharpness change and flips. RandAugment automatically applies random operations with a random magnitude . After the noisy student is trained on the combined data, the trained student becomes the new teacher that generates new pseudo-labels for the unlabelled dataset. The iterative training continues until we observe a saturation in performance. From Fig. 4, we see how noisy training helps the student become more robust with the addition of noise. While the teacher may give different predictions for different alterations of the same video, the student is more accurate and stable with its predictions.

Figure 4: Semi-supervised algorithm is presented in the flow-chart. We also show an example video from AFEW 8.0 dataset where the frames underwent different augmentations. Predictions without iterative training are shown in red and predictions after iterative training are shown in black.

5 Results

In this section, we show the results obtained with and without iterative self-training, followed by comparison with state-of-the-art methods and ablation studies.

Figure 5: This figure shows the confusion matrices, the accuracies, and the macro f1 scores achieved on the AFEW 8.0 dataset using different regions of the face. The proposed model (Face + Eyes + Mouth) achieves the highest accuracy. An=Angry, Sa=Sad, Ne=Neutral, Ha=Happy, Su=Surprise, Fe=Fear, Di=Disgust.

5.1 Without Student Training

Fig. 5

shows the results of processing individual regions (without group convolution and channel attention) on the AFEW 8.0 dataset, along with the proposed methodology. Our objective is to explore a) if upper face region and lower face regions have different feedback signals that dominate different categories of emotions, and b) if isolating the regions and processing them independently leads to an increase of accuracy. As seen in the confusion matrix (Fig.

5), the eyes region is better than the mouth region in the prediction of sadness and disgust categories. Intuitively, the squinted eyes expression in disgust and the droopy eyelids or furrowed eyebrows expression in sadness makes the eyes region pronounced. On the other hand, the mouth region is comparatively better with categories that require lip movements like happiness, anger, and surprise. Overall, 52.50% accuracy is achieved using the proposed model, which is slightly better than the model that only uses faces. Furthermore, we see a significant increase in the macro f1 score when we include the eyes and mouth region along with faces indicating that the predictions are comparatively more unbiased for the seven categories (an advantage for noisy student training). The proposed model is still biased against fear, surprise, and disgust categories, but performs better than several existing methods [33, 1, 54] where the reported accuracies for these categories are close to 0%.

5.2 With Iterative Training

Using noisy student training, we report our experimental results for four iterations on the AFEW 8.0 dataset and two iterations on the CK+ dataset.

Figure 6: This figure shows the experimental results of noisy student training for four iterations using AFEW 8.0 and BoLD datset.

Data Balancing: Since the model is biased, the number of pseudo-labels in the unlabelled dataset for some categories is smaller than in other categories. We try to match the distribution of the training set by duplicating images of fear, disgust, and surprise categories. Additionally, images of angry, happy, and neutral classes are filtered out based on confidence scores. Fig. 6 shows that balancing the pseudo-labels leads to better accuracy in each iteration compared to the student model without data balancing. The same trend is not observed for the CK+ dataset since the pseudo-labels roughly have the same distribution as the training set.

Unlabelled Dataset Size: As stated in the original paper [53], using a large amount of unlabelled data leads to better accuracy. After data balancing, we use a fraction of the BoLD dataset and report the accuracy after several iterations of training until the performance saturates (see Fig. 6). For both CK+ and AFEW 8.0 dataset, we observe that using the whole unlabelled training set is better as opposed to using just a fraction of the dataset. Fig. 6 shows a steady increase in all categories and overall accuracy with an increase in data size after four iterations of training on the AFEW 8.0 dataset.

Importance of Noise: Noise helps the student to be more robust than the teacher, as addressed in Sec. 2. The accuracy only reaches 53.5% on the AFEW 8.0 dataset without noise in student training, and no improvement is seen on the CK+ dataset. However, we achieve an accuracy of 55.17% after noisy training, which shows that input and model perturbations are vital while training the student. Additionally, Fig. 6 shows that it is better when the pseudo-labels are generated without noise, i.e. the teacher remains as powerful as possible.

Batch Size Ratio: When training on combined data, a batch of labelled images and a batch of unlabelled images are concatenated for each training step. If the batch sizes of labelled and unlabelled sets are equal, the model will complete several epochs of training on labelled data before completing one epoch of training on the BoLD dataset due to its larger size. To balance the number of epochs of training on both datasets, the batch size of the unlabelled set is kept higher than the labelled set. Fig. 4 shows that a batch size ratio of 2:1 or 3:1 is ideal for training when AFEW 8.0 is used as the labelled training set. Similarly, a batch size ratio of 5:1 is ideal for the CK+ dataset.

AFEW 8.0 CK+
Models Acc. Models Acc.
CNN-RNN (2016) [12] 45.43% Lomo (2016) [44] 92.00%
DSN-HoloNet (2017) [18] 46.47% CNN + Island Loss (2018) [7] 94.35%
DSN-VGGFace (2018) [13] 48.04% FAN (2019) (Fusion) [37] 94.80%
VGG-Face + LSTM (2017) [50] 48.60% Hierarchial DNN (2019) [22] 96.46%
VGG-Face (2019) [2] 49.00% DTAGN (2015) [20] 97.25%
ResNet-18 (2018) [49] 49.70% MDSTFN (2019) [45] 98.38%
FAN (2019) [37] 51.18% Compact CNN (2018) [27] 98.47%
DenseNet-161 (2018) [30] 51.44% ST Network (2017) [57] 98.47%
Our Model (w/o iter. training) 52.49% Our Model (w/o iter. learning) 98.77%
VGG-Face + BLSTM (2018) [33] 53.91% FAN (2019) [37] 99.69%
Our Model (iter. training) 55.17% Our Model (iter. learning) 99.69%

Table 1: We compare our results to the top-performing single models evaluated on the AFEW 8.0 dataset and state-of-the-art models evaluated on the CK+ dataset.

5.3 Comparison with other methods

We evaluate our model on the labelled datasets and show a comparison with the existing state-of-the art-methods (Table 1). On the AFEW 8.0 dataset, we achieve an accuracy of 52.5% without iterative training and 55.17% with iterative training. When comparing to existing best single models, our proposed method improves upon the current baseline [33] by 1.6%. Compared to static-based CNN methods that aim to combine frame scores for video-level recognition, we achieve a significant improvement of 3.73% over the previous baseline [30]. We conduct a comparison of performance and speed of the existing state-of-the-art models including fusion methods (only visual modality) with our proposed model. Several methods that show higher validation accuracy have significantly higher computational demand which may be impractical for real-time world applications. For instance, [49] uses an ensemble of 50 models with the same architecture and yet attains a 52.2% validation accuracy. Similarly, [33, 30] use a combination of multiple deep learning models where each model has a higher computational cost than ours. We measure the computational complexity of state-of-the-art methods using FLOPS (Floating point operations) and results show that our method is the most optimal based on performance and speed (Fig. 7).

Figure 7:

Comparison of performance (in accuracy) vs computational cost (in FLOPS - Floating point operations per second) of state-of-the-art models evaluated on AFEW 8.0 dataset. FLOPS for the models are estimated values based on the backbone network unless explicitly specified by the authors. Most optimal models will be closer to the top-left corner.

On the CK+ dataset, our method achieves an on par 10-fold cross-validation accuracy when compared to other state-of-the-art methods. While our model achieves an accuracy of only 98.77% without iterative learning, the accuracy improves by 0.92% when training data of each fold is combined with the unlabelled dataset for two iterations. This confirms our premise that self-training using noisy student is a robust procedure and can be used to increase the performance of a model on several other labelled data sources. Additionally, our results show that one can achieve better performance on a posed dataset when trained with an unlabelled in-the-wild dataset in a semi-supervised manner, which can be an effective alternative to labour-intensive tasks like gathering additional posed samples or labelling data.

Component Importance Noisy Student Training
Component Acc. Iteration Student Acc.
ResNet-18 (Baseline) 47.5% 0 - 52.5%
+ MTCNN, Enlighten-GAN (Sec. 3) 48.3% 1 ResNet-18 53.5%
+ Features from all blocks (Sec. 4.1.1) 49.3% ResNet-34 53.5%
+ Spatial-Attention (Sec. 4.1.2) 50.3% 2 ResNet-18 54.6%
+ Multiple Regions (Sec. 4.1.1) 51.2% ResNet-34 54.5%
+ Channel-Attention (Sec. 4.1.3) 51.7% 3 ResNet-18 54.9%
+ Frame-Attention (Sec. 4.1.4) 52.5% ResNet-34 54.8%
+ Iteration 1 - Self-training (Sec. 4.2) 53.5% 4 ResNet-18 55.2%
+ Iteration 2 - Self-training (Sec. 4.2) 54.6% ResNet-34 55.2%
+ Iteration 3 - Self-training (Sec. 4.2) 54.9% 5 ResNet-18 55.2%
+ Iteration 4 - Self-training (Sec. 4.2) 55.2% ResNet-34 55.2%

Table 2: This table shows the ablation studies conducted with AFEW 8.0 dataset. Component Importance shows the increase in accuracy with the addition of each component separately. Noisy Student Training shows the increase in accuracy with each loop of iterative learning and the effect of using a larger student.

5.4 Ablation Studies

Our baseline model is ResNet-18 where the video-level feature vector is an unweighted average of all the frame-level feature vectors. Without sophisticated pre-processing, the baseline achieves an accuracy of 47.5%. To better understand the significance of each component, we record our results after every change to the baseline model (Table 2). Significant improvements are observed when features are concatenated from multiple residual blocks using spatial-attention, and when frame features are combined from multiple regions using group convolution and channel-attention.

Additionally, Table 2 shows the increase in validation accuracy with each loop of iterative learning. As suggested by [53], noisy student learning may perform better if the student is larger in size than the teacher. Since ResNet-34 [16] has a comparatively larger capacity, we report its results besides ResNet-18 as the student model for each iteration. As seen in Table 2, our results do not show improvement when ResNet-18 in our student model is replaced with a larger backbone. A possible explanation is that the unlabelled dataset used by [53] is a hundred times larger than the labelled dataset and using a student with higher capacity may have resulted in better performance. On the contrary, our unlabelled dataset is only four times larger than the labelled dataset. Gathering additional unlabelled samples and using a larger student may result in a further increase in accuracy on the AFEW 8.0 dataset.

6 Conclusion

We propose a multi-level attention model for video-based facial expression recognition, which is trained using a semi-supervised approach. Our contribution is a cost-effective single model that achieves on par performance with state-of-the-art models using two strategies. Firstly, we use attention with multiple sources of information to capture spatially and temporally important features, which is a computationally economical alternative to the fusion of multiple learning models. Secondly, we use self-training to overcome the lack of labelled video datasets for facial expression recognition. The proposed training scheme can be extended to other related tasks in the field of affective computing.

7 Acknowledgements

The authors acknowledge the support of Professor James Wang for providing the opportunity to work on this project during his course on Artificial Emotion Intelligence at the Pennsylvania State University.


  • [1] D. Acharya, Z. Huang, D. Pani Paudel, and L. Van Gool (2018) Covariance pooling for facial expression recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    pp. 367–374. Cited by: §2, §5.1.
  • [2] M. Aminbeidokhti, M. Pedersoli, P. Cardinal, and E. Granger (2019) Emotion recognition with spatial attention and temporal softmax pooling. In International Conference on Image Analysis and Recognition, pp. 323–331. Cited by: §2, Table 1.
  • [3] S. Anila and N. Devarajan (2012) Preprocessing technique for face recognition applications under varying illumination conditions. Global Journal of Computer Science and Technology. Cited by: §3.
  • [4] P. Bachman, O. Alsharif, and D. Precup (2014) Learning with pseudo-ensembles. In Advances in neural information processing systems, pp. 3365–3373. Cited by: §2.
  • [5] E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang (2016) Training deep networks for facial expression recognition with crowd-sourced label distribution. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 279–283. Cited by: §4.1.1.
  • [6] R. I. Bendjillali, M. Beladgham, K. Merit, and A. Taleb-Ahmed (2019) Improved facial expression recognition based on dwt feature for deep cnn. Electronics 8 (3), pp. 324. Cited by: §3.
  • [7] J. Cai, Z. Meng, A. S. Khan, Z. Li, J. O’Reilly, and Y. Tong (2018) Island loss for learning discriminative features in facial expression recognition. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 302–309. Cited by: §3, Table 1.
  • [8] J. Chen, Z. Chen, Z. Chi, and H. Fu (2014) Emotion recognition in the wild with feature fusion and multiple kernel learning. In Proceedings of the 16th International Conference on Multimodal Interaction, pp. 508–513. Cited by: §2.
  • [9] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le (2019) RandAugment: practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719. Cited by: §4.2.
  • [10] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon (2012) Collecting large, richly annotated facial-expression databases from movies. IEEE multimedia (3), pp. 34–41. Cited by: §1, §1, §1, §2, §2, §3.
  • [11] A. Dhall (2019) EmotiW 2019: automatic emotion, engagement and cohesion prediction tasks. In 2019 International Conference on Multimodal Interaction, pp. 546–550. Cited by: §1, §2, §3.
  • [12] Y. Fan, X. Lu, D. Li, and Y. Liu (2016) Video-based emotion recognition using cnn-rnn and c3d hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 445–450. Cited by: §2, Table 1.
  • [13] Y. Fan, J. C. Lam, and V. O. Li (2018) Video-based emotion recognition using deeply-supervised neural networks. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 584–588. Cited by: §1, §2, Table 1.
  • [14] Y. Fang, J. Gao, C. Huang, H. Peng, and R. Wu (2019)

    Self multi-head attention-based convolutional neural networks for fake news detection

    PloS one 14 (9). Cited by: §2.
  • [15] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al. (2018) Ava: a video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056. Cited by: §3.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.1, §5.4.
  • [17] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.
  • [18] P. Hu, D. Cai, S. Wang, A. Yao, and Y. Chen (2017) Learning supervised scoring ensemble for emotion recognition in the wild. In Proceedings of the 19th ACM international conference on multimodal interaction, pp. 553–560. Cited by: Table 1.
  • [19] Y. Jiang, X. Gong, D. Liu, Y. Cheng, C. Fang, X. Shen, J. Yang, P. Zhou, and Z. Wang (2019) Enlightengan: deep light enhancement without paired supervision. arXiv preprint arXiv:1906.06972. Cited by: Figure 1, §3.
  • [20] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim (2015) Joint fine-tuning in deep neural networks for facial expression recognition. In Proceedings of the IEEE international conference on computer vision, pp. 2983–2991. Cited by: §3, Table 1.
  • [21] M. Karthigayan, M. R. M. Juhari, R. Nagarajan, M. Sugisaka, S. Yaacob, M. R. Mamat, and H. Desa (2007) Development of a personified face emotion recognition technique using fitness function. Artificial Life and Robotics 11 (2), pp. 197–203. Cited by: §3.
  • [22] J. Kim, B. Kim, P. P. Roy, and D. Jeong (2019) Efficient facial expression recognition algorithm based on hierarchical deep neural network structure. IEEE Access 7, pp. 41273–41285. Cited by: Table 1.
  • [23] D. E. King (2009)

    Dlib-ml: a machine learning toolkit

    Journal of Machine Learning Research 10 (Jul), pp. 1755–1758. Cited by: §3.
  • [24] B. Knyazev, R. Shvetsov, N. Efremova, and A. Kuharenko (2017) Convolutional neural networks pretrained on large face recognition datasets for emotion classification from video. arXiv preprint arXiv:1711.04598. Cited by: §1.
  • [25] D. Kollias and S. Zafeiriou (2018) Aff-wild2: extending the aff-wild database for affect recognition. arXiv preprint arXiv:1811.07770. Cited by: §2.
  • [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.1.1.
  • [27] C. Kuo, S. Lai, and M. Sarkis (2018) A compact deep learning model for robust facial expression recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2121–2129. Cited by: Table 1.
  • [28] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017) A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130. Cited by: §2, §4.1.2, §4.1.5.
  • [29] G. Littlewort, M. S. Bartlett, I. Fasel, J. Susskind, and J. Movellan (2004) Dynamics of facial expression extracted automatically from video. In 2004 Conference on Computer Vision and Pattern Recognition Workshop, pp. 80–80. Cited by: §1.
  • [30] C. Liu, T. Tang, K. Lv, and M. Wang (2018) Multi-feature based emotion recognition for video clips. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 630–634. Cited by: §2, §5.3, Table 1.
  • [31] M. Liu, R. Wang, Z. Huang, S. Shan, and X. Chen (2013) Partial least squares regression on grassmannian manifold for emotion recognition. In Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 525–530. Cited by: §2.
  • [32] Y. Liu, Y. Li, X. Ma, and R. Song (2017) Facial expression recognition with fusion features extracted from salient facial areas. Sensors 17 (4), pp. 712. Cited by: §3.
  • [33] C. Lu, W. Zheng, C. Li, C. Tang, S. Liu, S. Yan, and Y. Zong (2018) Multiple spatio-temporal feature learning for video-based emotion recognition in the wild. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 646–652. Cited by: §1, §2, §2, §3, §5.1, §5.3, Table 1.
  • [34] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews (2010) The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In 2010 ieee computer society conference on computer vision and pattern recognition-workshops, pp. 94–101. Cited by: §1, §2, §3.
  • [35] Y. Luo, J. Ye, R. B. Adams, J. Li, M. G. Newman, and J. Z. Wang (2020) Arbee: towards automated recognition of bodily expression of emotion in the wild. International Journal of Computer Vision 128 (1), pp. 1–25. Cited by: §1, §1, §2, §3.
  • [36] M. J. Lyons, S. Akamatsu, M. Kamachi, J. Gyoba, and J. Budynek (1998) The japanese female facial expression (jaffe) database. In Proceedings of third international conference on automatic face and gesture recognition, pp. 14–16. Cited by: §2.
  • [37] D. Meng, X. Peng, K. Wang, and Y. Qiao (2019) Frame attention networks for facial expression recognition in videos. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 3866–3870. Cited by: §1, §2, §3, §3, Table 1.
  • [38] I. Radosavovic, P. Dollár, R. Girshick, G. Gkioxari, and K. He (2018) Data distillation: towards omni-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4119–4128. Cited by: §2.
  • [39] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko (2015) Semi-supervised learning with ladder networks. In Advances in neural information processing systems, pp. 3546–3554. Cited by: §2.
  • [40] E. Riloff (1996) Automatically generating extraction patterns from untagged text. In

    Proceedings of the national conference on artificial intelligence

    pp. 1044–1049. Cited by: §2.
  • [41] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.
  • [42] C. Shan, S. Gong, and P. W. McOwan (2009) Facial expression recognition based on local binary patterns: a comprehensive study. Image and vision Computing 27 (6), pp. 803–816. Cited by: §1.
  • [43] K. Sikka, K. Dykstra, S. Sathyanarayana, G. Littlewort, and M. Bartlett (2013) Multiple kernel learning for emotion recognition in the wild. In Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 517–524. Cited by: §2.
  • [44] K. Sikka, G. Sharma, and M. Bartlett (2016) Lomo: latent ordinal model for facial analysis in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5580–5589. Cited by: §3, Table 1.
  • [45] N. Sun, Q. Li, R. Huan, J. Liu, and G. Han (2019) Deep spatial-temporal feature fusion for facial expression recognition in static images. Pattern Recognition Letters 119, pp. 49–61. Cited by: Table 1.
  • [46] Y. Tang (2013)

    Deep learning using linear support vector machines

    arXiv preprint arXiv:1306.0239. Cited by: §1.
  • [47] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: §2.
  • [48] M. Valstar and M. Pantic (2010) Induced disgust, happiness and surprise: an addition to the mmi facial expression database. In Proc. 3rd Intern. Workshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect, pp. 65. Cited by: §2.
  • [49] V. Vielzeuf, C. Kervadec, S. Pateux, A. Lechervy, and F. Jurie (2018) An occam’s razor view on learning audiovisual emotion recognition with small training sets. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 589–593. Cited by: §5.3, Table 1.
  • [50] V. Vielzeuf, S. Pateux, and F. Jurie (2017) Temporal multimodal fusion for video emotion classification in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 569–576. Cited by: §1, §2, Table 1.
  • [51] K. Wang, X. Peng, J. Yang, D. Meng, and Y. Qiao (2020) Region attention networks for pose and occlusion robust facial expression recognition. IEEE Transactions on Image Processing 29, pp. 4057–4069. Cited by: §2.
  • [52] S. Wang, W. Li, Y. Wang, Y. Jiang, S. Jiang, and R. Zhao (2012) An improved difference of gaussian filter in face recognition.. Journal of Multimedia 7 (6), pp. 429–433. Cited by: §3.
  • [53] Q. Xie, E. Hovy, M. Luong, and Q. V. Le (2019) Self-training with noisy student improves imagenet classification. arXiv preprint arXiv:1911.04252. Cited by: §1, §2, §4.2, §5.2, §5.4.
  • [54] J. Yan, W. Zheng, Z. Cui, C. Tang, T. Zhang, and Y. Zong (2018) Multi-cue fusion for emotion recognition in the wild. Neurocomputing 309, pp. 27–35. Cited by: §5.1.
  • [55] D. Yarowsky (1995) Unsupervised word sense disambiguation rivaling supervised methods. In 33rd annual meeting of the association for computational linguistics, pp. 189–196. Cited by: §2.
  • [56] X. Zeng, Q. Wu, S. Zhang, Z. Liu, Q. Zhou, and M. Zhang (2018) A false trail to follow: differential effects of the facial feedback signals from the upper and lower face on the recognition of micro-expressions. Frontiers in psychology 9, pp. 2015. Cited by: §2.
  • [57] K. Zhang, Y. Huang, Y. Du, and L. Wang (2017) Facial expression recognition based on deep evolutional spatial-temporal networks. IEEE Transactions on Image Processing 26 (9), pp. 4193–4203. Cited by: §3, Table 1.
  • [58] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. Cited by: Figure 1, §3, §3.