Facial expression is the external reflection of emotion, and in essence, the movement of subcutaneous muscles. In interpersonal communication, the facial expression is one of the most important media for people to convey feelings and attitudes to each other. Therefore, facial expression recognition is of great significance for human-computer interaction.
With GPU performance growing drastically, deep learning has come a long way in just over a decade. Gradually, deep learning has become the mainstream method to promote FER. Relying on the famous neural networks in the classification era, some scholars put forward their algorithms to deal with the challenges of FER. Wang[scn] propose Self-Cure Network based on ResNet-18 [resnet] to suppress the uncertainties of data annotation in network training.
The well-known neural networks applied to FER have achieved certain effects, whereas they cannot fully give play to the learning advantages of the deep network because of the limitation to the scale of the FER dataset. Creating larger datasets brings prohibitive annotation costs. How to improve the recognition accuracy from the limited datasets has become a major challenge for FER.
Not only that, but we find that certain intrinsic properties of the convolutional neural network limit and hinder its ability. To encode information accurately on the image, the edge information cannot be ignored. However, the convolution kernel perceives the edges of the image, especially the corners, less than the inside. The way the convolutional kernel works makes the weight of the information on the edges lower than that in the middle. We call the negligent property of convolutional kernel as perception bias. Therefore, it is accustomed to utilize padding, a simple process of adding zeros to the periphery of the images, to balance the perceptual frequency of each pixel. The same is true for the feature map. When the network is fairly shallow, it seems unnecessary to consider the influence of a few zeros on the formation of features, because the image pixels are relatively huge. Normally, as the network deepens, the channels increase exponentially. Conversely, the area of the feature map keeps shrinking. The pixel information becomes scarce and refined. That makes the deep convolutional layers show less and less receptive to the fade of the edges, as edge pixels account for a considerable part of the total. The only countermeasure is to keep padding to ensure the capture of key information. Since then, the network has fallen into the padding trap. The essence of convolution is to calculate the sum of the products of the pixel values in each region and those of the kernel. Excessive zeros involved will inevitably lead to serious information distortion. This is a serious issue that we have been neglecting in the past. As shown in Figure 1, just one convolution operation with padding, the periphery of the image becomes blurred visible to the naked eye. Layer by layer, padding gradually erodes the feature map from outside to inside. We call the information that gravely impaired in this way albino features. Actually, the unprocessed albino features directly affect the representations downward.
Hence, in order to avoid the problems mentioned above, we need to consider not only the internal correlation hide in the FER datasets but also how to eliminate the erosion of representation. In this work, we propose a module that can be embedded in any network with a pooling layer, called Amend Representation Module (ARM), which consists of an auxiliary block and two functional blocks.
To protect representation from erosion, we remove the pooling layer and replace it with De-albino (DA) block. The Feature Arrangement (FA) block, customized to assist the DA block to work better, arranges the raw feature map into a single channel under the premise of minimum position change of feature pixels. As shown in Figure 2, all the pixels that are severely eroded are concentrated in the outermost part of the feature map. Then we take advantage of the perception bias of the convolutional kernel to subtly reduce the weight of the albino features. The perception bias is positively correlated with the size of the convolutional layer kernel. It is worth noting that padding is resolutely denied throughout the module. Because we regard it as the culprit of the problem.
Secondly, inspired by the nature of facial expressions, which are variants of the human face, we consider that there are certain affinities among expression features. We explore the role of this relationship and obtain a benign response of FER performance. Sharing Affinity (SA) block combines a certain proportion of global average features for each expression, which improves the feature learning of the classic network on the FER dataset.
Most of the previous FER strategies are to extend the network structure or supplement the algorithm on the classic CNNs. While our ARM is an internal modification of the Convolutional Neural Network (CNN). By training the ARM variant of ResNet-18 [resnet] on RAF-DB, we promote the performance of pre-trained ResNet-18 from 86.4 to 90.55, which is superior to the-state-of-art methods.
Our contribution is mainly divided into the following parts:
(1) We put forward for the first time the Padding Challenge, which is to eliminate the albino erosion existing in the padding convolution.
(2) We propose a de-albino strategy, which serves all the convolutional neural networks, to protect representations from albino features.
(3) We elaborately design an auxiliary block to rearrange the feature map, enhancing the de-albino effect.
(4) We make the first attempt to optimize the FER with affinity features among different facial expressions.
(5) We design a minimal random resampling scheme to deal with data imbalance.
2 Related Work
2.1 Convolutional Neural Networks
In the late 1990s, Yann LeCun [lecun]
, in order to solve the task of handwritten digit recognition, proposed LeNet, in which he first applied BP algorithm to the training of neural network structure. This feat established the modern structure of CNN: convolutional layer, pooling layer, and fully connected layer. However, ascribe to the lack of data and computing power, LeNet is difficult to deal with complex problems. Despite the effectiveness of identifying numbers, LeNet is not as good as machine learning algorithms in practical tasks.
Gradually, with the rapid progress of the GPU unit and the completion of large-scale datasets, CNN came back to the attention of researchers. In the 2012 ImageNet Image Recognition Contest, AlexNet, a brand new structure introduced by Hinton[alexnet]
, was the winner by an absolute advantage of 10.9 percentage points over the runner-up. It came to use padding to balance the weights of the pixels between the edges and the inside. Instead of Sigmoid, they applied the ReLU[relu]activation function to solve the gradient dispersion of the deep network. The proposed Dropout and Data Augmentation methods [alexnet] are effective in preventing overfitting. It was the first time that people realized that the structure of neural networks had a lot of room for improvement. In the years that followed, various CNNs sprang up.
In the 2014 ILSVRC Competition, GoogLeNet [googlenet], the winner, was characterized by the application of the Network in Network structure, that is, the original node is also a Network. Karen Simonyan and Andrew Zisserman [vgg]
who proposed the VGG architecture won second place. They tended to use smaller kernels to deepen the network. This attempt effectively improved the performance of the model. It was also proved that VGG had a good ability to generalize on other datasets and outperformed GoogLeNet in multiple transfer learning tasks. In 2015, He Kaiming[resnet] from MSRA solved the degradation problem of the extremely deep network through the cross-layer transmission of Identity. They deepened the network to hundreds or even thousands of layers. The error rate also dropped to 3.6%, lower than the human rate of 5.1%. In the last ILSVRC competition, Hu J [senet] proposed SENet, which recalibrates the channel dimension by squeezing features in the spatial dimension and extracting the correlation between channels, thus further reducing the error rate.
2.2 Facial Expression Recognition
Facial Expression Recognition is a classification task of identifying the facial changes in response to human emotions from visual information. This technology enables machines to understand human emotions and make judgments in response to mental states. The research on facial expressions developed earlier in psychology and medicine. As early as the 19th century, Charles Darwin and Phillip Prodger [darwin] explained the difference and connection of facial expressions between humans and animals. In 1971, Ekman and Friesen [ekman] defined the six basic human expressions (that is happiness, sadness, surprise, fear, anger, and disgust), then systematically established a facial expression image library, which describes the details of each expression. In 1978, Suwa [suwa] proposed automatic expression analysis in facial video. However, due to the limitations of contemporary computer capabilities, facial expression research in the computer field started relatively late. In the 1990s, Kenji Mase [mase] proposed to leverage the optical flow method for automatic expression analysis, opening the computer era of facial expression recognition. In 2001, Tian [tian]
utilized action units (AU) instead of the traditional expression images to classify facial expressions, creating a precedent for automatic expression analysis.
Early feature learning methods (, non-negative matrix factorization (NMF) [nmf], local binary patterns (LBP) [lbp], histograms of oriented gradients (HOG) [hog], scaled-invariant feature transform (SIFT) [sift], etc.) are costly and inefficient, remaining at the level of manual annotation and calculation. As datasets expand into the wild, such as FER2013 [fer2013], AffectNet [affectnet], EmotioNet [emotionet], SFEW 2.0 [sfew2.0], and RAF-DB [raf-db]
, the engineered learning methods are difficult to cope with data extremely large-scale and complex. With the rapid development of GPU units, such algorithms have gradually faded and been replaced by deep learning. Not only learning deep features, but convolutional neural networks can also dynamically adjust neuron weights based on the deviation between the prediction and the target. Heechul Jung[jung] design an integration network to process facial images and landmark points jointly. Huiyuan Yang [yang] believe that expression is a superposition of neutral components and expression components. They exploit a network that generates neutral expressions to isolate residual expression components. Feifei Zhang [zhangfeifei] combine different poses and expressions to train GAN for pose-invariant expression recognition. Kai Wang [scn] propose the Self-Cure algorithm to relabel uncertain expressions, which inhibited the network from overfitting incorrectly labeled samples. In order to recognize the occluded expressions, Bowen Pan [pan] apply a network pre-trained on an unoccluded data set to guide the fine-tuning of the occluded network. To solve the problem of label inconsistency when multiple datasets are merged, Jiabei Zeng [zeng] generate multiple pseudo-labels for images and then train the model in this environment to learn the underlying truth. Shikai Chen [chenshikai] generate multi-labels for label distribution learning by classifying action units and facial landmarks separately.
The padding is crucial to convolutional layers because of its functionality. Although boosting the performance of the network [lecun, alexnet], it may be harmful to the representation. Sticking to the regular pooling layer does nothing about the padding’s negative impacts. Conjunctively, considering the particularity of the Facial Expression Recognition (FER) task, that is, the affinity features between facial expressions, we propose the Amend Representation Module (ARM). In this section, we overview the ARM and then describe the function of each component in detail.
Given a dataset of facial expressions, firstly, we extract the feature of each image with the front part of the ResNet-18 [resnet], which is bounded by Global Average Pooling (GAP). Then, the learned feature map is fed into our ARM, which outputs high-quality representations.
As shown in Figure 3, our ARM consists of three crucial blocks, namely Feature Arrangement (FA) block, De-albino (DA) block, and Sharing Affinity (SA) block. The FA block is an auxiliary block to amplify the function of the DA block. The latter realizes the weight distribution of features by means of convolution. Naturally, the weight with a higher degree of albino erosion is lower, and vice versa. The final SA block extracts global average representation (GAR) over mini-batch and then fuses it with individual representation according to their affinities.
3.2 Feature Regularization
Due to convolutional padding, the formation of albino features is inevitable. With reference to the position of the padding, it can be inferred that the edges and corners are where the de-albino features gather. The albino pixels are evenly dispersed in each channel, which is not conducive to “targeted therapy” because a slight expansion of the de-albino range may damage the key information. We rearrange the feature pixels into a single channel on the premise of the minimum relative position change. So that all pixels with the same degree of the albino are concentrated from the outside to the inside. The specific implementation is shown in Figure 2. Then a no-padding block is used to weaken the weight of the albino information at the edge and neutralize the baneful influence of padding.
3.3 Assigning Weights by Means of Convolution
To eliminate the impact of albino features, we elaborately designed a dedicated convolutional layer to dilute the weight of the peripheral pixels. It is characterized by no padding, large convolutional kernel, and big stride.
As shown in Figure 2, the most eroded points are concentrated at the edge of the rearranged feature map. We subtly exploit the perception bias illustrated in Figure 4 to reduce the weights of albino points. The pixels at the edge are perceived less frequently by the convolutional kernel than the internal ones. Considering the absolutely large scale of the concentrated albino points, a regular-sized convention kernel is not suitable for our method. And it can be concluded that the larger the convolutional kernel, the more information is ignored, which happens to favor the network in this case. Because of abundant channel information flowing into space, the convolutional kernel that we take should be several times larger than the regular one. Obviously, the convolutional kernel has to be single-channel, consistent with the feature map. The stride of the kernel also has an effect on the perception of peripheral points. A shorter stride means less neglect, which limits the effectiveness of our de-albino block. In the trade-off between ignoring the albino features and retaining the key information, we set the convolution stride to about half of the kernel size.
3.4 Sharing Affinity
Different from other classification tasks, facial expression recognition (FER) is a task based on face variants. Since the face is the main carrier, there are affinity features between the facial expressions. Convolutional neural networks tend to assign weights to features that favor evaluation indicators, which makes the learning short-sighted. We force the network to learn the global representation of the face to improve the recognition of unpopular categories.
We introduce the SA block to optimize the representation learning of the network. Firstly, given a batch of facial expressions, we extract the representations with FA and DA blocks, froming a set of representations , where and denote the batch size and the original representation of the -th image, respectively. Affinity is extracted by averaging global representations, generating the global average representation as follows:
And then we incorporate it into the representation of each expression as follows:
where denotes the synthetic representation output by SA block. The affinity is denoted as , which controls the contribution of the global average representation to . is a tunable parameter, which can be either a fixed value or a ratio that learns as the network iterates.
The confusion matrixes on the test set for the RAF-DB, AffectNet (7cls & 8cls), and FER2013.
In this section, the ARM, proposed deep learning architecture, is evaluated on three FER datasets. We first explain why we choose these datasets and what they are. A description of metrics followed closely. We introduce the overall details of the experiment and indicate the fine-tuning on the AffectNet dataset. Then we conduct a verification experiment to demonstrate the adverse effect of albino features. To evaluate each block of the ARM, an ablation study is performed on RAF-DB dataset. Finally, our ARM is compared with the state-of-the-art methods.
To evaluate our module, we select three public FER datasets collected in the wild that fully reflect real-world emotions. Compared with those produced in the laboratory, there is no deliberate pose before generating. They are more complex, variable, and challenging for feature learning. The details are as follows.
RAF-DB [raf-db] is a crowdsourced dataset. Each image was individually labeled by 40 taggers for the sake of rigor. There are two distinct subsets, the basic one being single-tagged and the compound one being double-tagged. Only the former is used in our experiment. The basic dataset contains seven categories: surprise, fear, disgust, happiness, sadness, anger, and neutral. The number of the training set and testing set is 12,271 and 3,068 respectively, and the expressions of them have near-identical distribution.
AffectNet [affectnet] is the largest facial expression dataset so far, with a total of more than one million images. The main collection methods are searching through three search engines and downloading from the Internet. Images are annotated in two ways: manual and machine. Additionally, it contains ten categories, eight of which are basic expressions. In the manually labeled subset, excluding the extra contempt, the same seven basic expressions as in RAF-DB are used, namely 283,901 training images and 3,500 validation images. We also conduct eight types of basic expression experiments for better comparison. Because of the lack of a testing set, we fill this void with the validation set. Remarkably, while the categories are balanced on the validation set, they are extremely unbalanced on the training set, with a maximum gap of over 35 times.
FER2013 [fer2013] contains 28,709 training samples, 3,589 validation samples and 3,589 test samples, all of which are 48x48 pixel grayscale faces. These facial images also fall into seven basic categories and have close distribution in three sets.
4.2 Metrics and Results
The overall sample accuracy is the most common and straightforward metric, namely accuracy or weighted accuracy [psr]. In simple terms, it is calculated by dividing the number of correctly predicted samples by that of testing samples. Unlike the former, mean class accuracy, which is also called unweighted accuracy [psr], is the average of the accuracy of each category. The sum of the accuracy of each class is first calculated and then divided by the number of classes.
Ascribe to the learning bias of CNN, categories that are easy to distinguish or have a large number of samples will gain more weight in neurons [imbalance]. Under the circumstances, the overall accuracy determined by a biased category may still be at a high level, whereas its performance is certainly not as good as the accuracy appears. To prevent overall sample accuracy from masking the illusion of good performance, we introduced unweighted accuracy as a backup metric.
Figure 5 shows the confusion matrixes of the ARM (ResNet-18) on RAF-DB, AffectNet (7cls & 8cls), and FER2013. It’s rarely seen in other strategies that neutral hits the highest accuracy of 98% on RAF-DB, slightly exceeding the happiness.
4.3 Implementation Details
Prior to training the network, all the images are resized to 224x224x3 pixels. Specifically, the images of FER2013 are single-channel. We duplicate its channel twice to meet the requirements. The backbone network is part of ResNet-18, where the layers before the GAP are retained and initialized with weights pre-trained on ImageNet[imagenet]. We leverage this part of the network to extract the albino features, which are in the size of 512x7x7.
Based on the rearrangement principle described in Section 3.2, only the feature map of two channels can be obtained because the channel number is not an integer multiple of 4. As the channels of the feature map decrease, the spatial area of the feature map increases to 112x112 pixels. It is essential to emphasize that the feature map does not ideally lie flat in a single channel, whereas its channels should still be considered parallel. Considering the correlation of the two channels, the kernel of the De-albino block must be single-channel. The point set composed of features that are from the same position of each channel is called feature clusters. It’s in the size of 16x16x1.
First, there is no doubt that padding will not be in our ARM. Then a proper convolutional kernel is definitely crucial to the experimental results. Obviously, if the size of the convolution kernel is slightly larger than that of the feature cluster, sometimes a feature cluster plays an absolute role within the sphere of convolution kernel, which weakens the diversity of the representation. So we set the convolutional kernel as 32x32x1 pixels, much larger than the feature cluster. Meanwhile, the stride of the convolutional kernel is set to half of the cluster size, which is 8. The two channels output from the DA block are averaged as the representation of each expression.
As for the SA block, the gain hyperparametervaries depending on the affinity within the dataset. By default, it is initialized to 1.0 and updated with the gradient.
According to the outputs of these three blocks, we can efficiently obtain high-quality representations from the variant of ResNet-18. To optimize the parameters of the network, we adopt the Adam solver [adam] with mini-batch size 256 on RAF-DB. The learning rate is initialized as 0.001, and we apply a decay learning rate strategy with a coefficient of 0.9.
The Amend Representation Module (ARM) is implemented using the PyTorch toolbox[pytorch]. All of the reported results are obtained by running the Python code on an NVIDIA GeForce RTX 3090 GPU.
4.4 Fine-tuning on the AffectNet Dataset
As shown in Table 1, the sample numbers in each category in AffectNet [affectnet] is extremely unbalanced. Intuitively speaking, in the 8 basic categories, the ratio of the largest to smallest sample number is as high as 35.
For superior performance, we prudently designed a minimal random resampling (MRR) strategy to balance the numbers of categories. The implementation details are as follows: First, we assign a probability of selection to each expression, which is the reciprocal of the sample size. In this case, each image has the same probability of being picked after a random selection. During training, the sample size of each category is uniformly set to be consistent with the minimum category. Although the size of our training set has shrunk greatly from the original, we retain the scale advantage of the original dataset beacause the training set is different for each epoch. After multiple epochs of training, even the largest category can basically be traversed. It should be noted that we did not discard the massive amount of original data like Undersampling because the training set of each epoch is randomly re-sampled. The Oversampling strategy is not adopted because Oversampling will duplicate the minimum category dozens of times, which inevitably leads to serious overfitting of the model. The samples inside the superclasses may be completely different in different epochs of training. After all, these classes are only sampled at a minimum of one-35th, so we reduced the learning rate by adjusting its attenuation rate from 0.9 to 0.78.
4.5 The Adverse Effect of Albino Features
For the sake of scientific rigor, we conduct an independent experiment to demonstrate the negative effect of albino features. For the network part, ResNet-18 [resnet] was modified to meet experimental requirements. Specific implementation details are as follows: De-albino block is inserted to replace the GAP to carry out convolution processing on the output 512x7x7 feature map. In the De-albino block, the kernel size is denoted as , adjustable and the stride is fixed as 1. Finally, keep the fully connected layer and make it adaptive.
The experimental results are shown in Figure 6. Viewpoints can be analyzed from the figure. With the increase of , the gain effect presents a hump-like trend which fits with the fluctuation of the perception frequency ratio between the center and the edge. Experiments on RAF-DB show beyond doubt that, by reducing the weight of albino features through our De-albino block, the quality of representation can be effectively improved, thus slightly boosting the performance of the model. On the other hand, albino features do erode the representation of images.
4.6 Ablation Study
The effectiveness of three blocks in ARM. In order to better understand the respective function of each block, we conducted an ablation study to evaluate them on RAF-DB. We show the experiment results in Table 2. According to the original intention of the design, the Feature Arrangement block is auxiliary, so we did not conduct a separate experimental evaluation for it. Analyzing the data, we can draw the following points. First of all, the main functional blocks, namely the DA and SA blocks, have obvious effects, and they can improve the network performance separately. Secondly, the DA block and the SA block react differently with the FA block. With the assistance of the FA block, the DA block can increase performance by 1% again, but the SA block does not seem to benefit from it. Observing the complete coordination performance of the three blocks, it can be seen that the SA block performs significantly on high-quality representation purified by the FA block and DA block. Therefore, it is better to use the SA block when the current representation is sufficiently accurate.
4.7 Method Comparison
In Table 3, we compare the proposed method with a series of state-of-the-art baselines, which are briefly described below. IPA2LT [zeng] introduces a method of enriching dataset, which is creating a multi-label dataset by manual annotation and model prediction, to train an end-to-end network that can discover the latent truth. RAN [ran] proposes a cascaded attention network based on the part and the whole of the face. SCN [scn] relabels low-quality images to suppress the influence of annotation errors. PSR [psr] develops a multi-branch pyramid network, which can split images of different sizes for processing, to solve the problem of different image sizes.
The additional network parameters and running time triggered by our method are negligible. Our ARM outperforms current state-of-the-art methods by a large margin, with 90.55%, 64.49%, and 71.38% on RAF-DB, AffectNet, and FER2013, respectively.
For feature extraction, properly we are not able to reject padding directly because of data loss. But we can eliminate its side effects before representation by replacing the pooling layer with the De-albino block. Not only that, we also designed an auxiliary Feature Arrangement block to assist it. The Sharing Affinity block shares the global representation with each expression, which strengthens the representation learning. Of course, this is limited to tasks similar to facial expression recognition, in which, all samples are variants of human faces. There is no special customization for the de-albino method, just the perception bias of convolution. More targeted methods will be explored in our later work.