MicroExpNet: An Extremely Small and Fast Model For Expression Recognition From Frontal Face Images
This paper is aimed at creating extremely small and fast convolutional neural networks (CNN) for the problem of facial expression recognition (FER) from frontal face images. We show that, for this problem, translation invariance (achieved through max-pooling layers) degrades performance, especially when the network is small, and that the knowledge distillation method can be used to obtain extremely compressed CNNs. Extensive comparisons on two widely-used FER datasets, CK+ and Oulu-CASIA, demonstrate that our largest model sets the new state-of-the-art by yielding 1.8 previous best results, on CK+ and Oulu-CASIA datasets, respectively. In addition, our smallest model (MicroExpNet), obtained using knowledge distillation, is less than 1MB in size and works at 1408 frames per second on an Intel i7 CPU. Being slightly less accurate than our largest model, MicroExpNet still achieves a 8.3 dataset, over the previous state-of-the-art, much larger network; and on the CK+ dataset, it performs on par with a previous state-of-the-art network but with 154x fewer parameters.READ FULL TEXT VIEW PDF
Deep neural networks have rapidly become the mainstream method for face
Deriving an effective facial expression recognition component is importa...
Despite being the appearance-based classifier of choice in recent years,...
Presence of noise in the labels of large scale facial expression dataset...
Recently, Generative Adversarial Networks (GANs) and image manipulating
We propose a network for Congested Scene Recognition called CSRNet to pr...
Deep convolutional neural networks continue to advance the state-of-the-...
MicroExpNet: An Extremely Small and Fast Model For Expression Recognition From Frontal Face Images
Expression recognition from frontal face images is an important aspect of human-computer interaction and has many potential applications, especially in mobile devices. Face detection models have long been deployed in mobile devices, and relatively recently, face recognition models are also being used, e.g. for face based authentication. Arguably, one of the next steps is the mobile deployment of facial expression recognition models. Therefore, creating small and fast models is an important goal. In order to have an idea about the current situation, we looked at the size and runtime speeds of two representative, currently state-of-the-art models, namely PPDN and FN2EN . In terms of the number of total parameters in the network, both models are in the order of millions (PPDN has 6M and FN2EN has 11M). In terms of runtime speed, both models run at ms per image on a GTX GPU, however, on an Intel i7 CPU, while PPDN takes ms, FN2EN takes ms. Further details on size and speed can be found in Tables 6 and 7.
The central question that motivated the present work was how much we could push the size and speed limits so that we end up with a compact expression recognition model that still works reasonably well. To this end, we first explored training large models on two widely used benchmark FER datasets, CK+  and Oulu-Casia . Then, using the “knowledge distillation” method , we were able to create a family of small and fast, yet reasonably accurate models. Our smallest model, which we dub MicroExpNet (Fig. 1), is extremely compact (less than 1MB in size) and fast (runs at frames per second, or ms per image on Intel i7 CPU). Such a model would enable deployment in mobile devices. MicroExpNet, achieves slightly less accurate results than our large model. However, compared to our largest model it still yields better or on par results, using a much smaller number of parameters.
In this work, we focused only on frontal face images. First, we trained a large network (called the teacher or TeacherNet), and then used the “knowledge distillation” method 
to train compact networks. In this method, the softmax output of the teacher network is used to “guide” the training of the student network via a hyperparameter called the “temperature.” We conducted comprehensive temperature and network size experiments. Compared to the teacher network, MicroExpNet (i.e. the smallest student network) isx smaller in size, has x fewer parameters and is x faster. We also hypothesized that invariance to translation – typically achieved using max-pooling layers – would not be useful for the facial expression recognition problem as the expressions are sensitive to small, pixel-wise changes around the eye and the mouth. In our experiments, we found that, especially for small networks, max-pooling hurts the performance.
Our contributions in this work are three-fold:
We show that it is possible to create extremely small and fast, yet accurate, models for the FER problem. Our smallest model, MicroExpNet, has thousand parameters (cf. millions in PPDN and FN2EN) and runs at frames per second on an Intel i7 CPU (cf. for PPDN, for FN2EN).
We show that translation invariance degrades the performance in FER – especially when the network is small (see Section 4.1). To the best of our knowledge, all current state-of-the-art methods use a mix of max and average pooling. Our finding could guide the design of future models on this problem.
We showed that the effect of “knowlegde distillation” (compared to training from scratch) increases as the network size gets smaller. To the best of our knowledge, this kind of analysis has not been done before. Whether this effect is specific to the FER problem is yet to be seen (left as future work).
In order to support our contributions, we provide standard classification performance comparison for CK+ and Oulu-CASIA datasets; parameter count comparison; runtime speed comparison; max pooling vs. no pooling analysis; extensive temperature and size combinations to provide answers to “what if” questions.
We categorize the previous work as image (or frame) based and sequence based. While image-based method analyse individual images independently, sequence-based methods exploit the spatio-temporal information between frames.
There are three groups of work. Models that use 1) hand-crafted features (HCFs), 2) deep representations, and 3) both. Our work falls into the second group.
We do not focus on HCF models [2, 7, 32, 38] here because they are obsolete (with the emergence of deep models) and in general, they do not achieve competitive results. These works typically extract Local Binary Patterns (LBP) , SIFT  or Gabor features and use SVM  or AdaBoost 
on top of these features, as the classifier.
proposed a loopy boosted deep belief network framework which consists of a bottom-up unsupervised feature learning and a ”boosted top-down supervised feature strengthen” process. Final features are combined to form an AdaBoost classifier. Our approach does not contain any feature selection or “strengthen” operation, and also we did not employ an Adaboost classifier. Mollahosseiniet al.  introduced an inception network for expression recognition. They preprocessed face images via applying an affine transformation using facial landmark points. Their model is much larger compared to ours considering the two large fully connected layer at the end of their network. In addition, we did not use any inception layers or preprocessing step for face images. Zhao et al.  proposed a peak-piloted deep network architecture where they use both peak and non-peak expression images as pairs during training. The architecture used GoogLeNet  as basis, and employed the peak-piloted approach in the last two fully-connected layers. Training peak and non-peak images in pairs naturally requires their proposed back-propagation algorithm which adds complexity to implementation compared to our work. FaceNet2ExpNet  employs a multi staged model production for expression recognition. First, the authors trained five convolutional layers with the supervision of a pre-trained FaceNet’s  pool5 layer outputs. Then, they append a fully connected layer to trained convolutional layers and start training to classify expressions. This is similar to the FitNet  approach. In our work, teacher net is trained in isolation and its guidance is applied only through its soft targets (see Section 3.1 for details). Recently, Kim et al.  introduced a deep generative contrastive model for facial expression recognition. They combined encoder-decoder networks and CNNs into a unified network that simultaneously learns to generate, compare, and classify samples on a dataset.
We can categorize sequence based facial expression classifiers in the same three groups as in the case of image-based classifiers.
We do not focus on HCF based sequence models [11, 9, 30, 31] for the following reasons: 1) we do not use any HCFs, 2) HCF based models became obsolete with the emergence of deep models and they do not yield competitive results.
Deep representations are the core ingredient of [21, 5]. Liu et al.  identified facial expression recognition task as a combination of temporal alignment and semantics-aware dynamic representation problems, and proposed a manifold modeling of videos based on mid-level deep representations (expressionlet). These representations gathered via learned spatio-temporal filters. Kahou et al. 
fused CNNs with recurrent neural networks (RNNs) to build a hybrid model working on videos. CNN is used on static images to gather high-level representations, then the final classification is done by RNN trained with those representations. They also used an SVM to process audios which is beyond the scope of this paper.
In a study done by Jung et al.  both HCFs and deep representations are used to form a hybrid approach. They presented two deep network models. First a 3D-CNN to extract the temporal appearance features from image sequences. Second, a fully connected deep neural network which captures geometrical information about the motion of the facial landmark points. By combining these two networks, they built a model which uses both landmark points and image sequences.
Romero et al.  built their FitNets using the “knowledge distillation” method to produce deep and thin student networks with comparable or better performance compared to the teacher. They built student networks that are thinner but deeper than their teacher by training some layers of the student beforehand with the teacher’s supervision for better initialization. They trained the whole student network using knowledge distillation to finalize their model. They applied their model to object recognition, handwriting recognition and face recognition where the FitNet failed to outperform the state-of-the-art solutions, but achieved superior performance against its teacher. To the best of our knowledge, we are the first to apply knowledge distillation to the facial expression recognition (FER) problem. In addition, we choose a model that is much shallower than the teacher and avoid any pre-training of the student to prevent increasing the complexity of the overall training procedure. Another important point is that, Romero et al. did not give much information on the selection of the temperature parameter, in which we do a systematic analysis.
Iandola et al.  proposed a CNN with no fully connected layers to reduce the model size, and preserved the classification performance via their fire modules. Like FitNets, they also did not test their model on FER.
Knowledge distillation was introduced by Hinton et al.  in 2015. The main idea is to have a cumbersome network called the teacher to supervise the training of a much smaller network called the student
via soft outputs. The algorithm is as follows: first, a large teacher network is trained for the task using an empirical loss calculated with respect to one-hot vector of true labels. Then, a much smaller student network is trained using both one-hot vectors of the true labels and the softmax outputs (Eq.2
) of the teacher network. The aim is to increase the information about the target classes by introducing uncertainty into probability distributions. Since these distributions contain similarity information on different classes, Hintonet al
. further used this similarity information coming from the teacher to correctly classify a target class intentionally removed from the training set of the student. Additionally, in order to prevent the teacher’s strong predictions to dominate the similarity information, softmax logitsof the teacher are softened using a hyperparameter called temperature denoted as in Eq. 1.
Formally, let be the softened output of the teacher’s softmax, be the logits of the teacher, be the hard and be the soft output of the student’s softmax, be the logits of the student, be the weight of distillation, be the ground truth labels, be the batch size and function refers to the cross-entropy. Then:
and the loss becomes
In this work, we use two convolutional networks, namely the teacher and the student. The teacher is deep and large whereas the student is shallow and small. There are several versions of the student network having different number of parameters. We call our smallest network as MicroExpNet.
as the activation function and a final fully connected layer as a bridge to the softmax. We hypothesized that translation invariance would not be useful to the FER problem as facial expressions seem to be sensitive to pixel-wise location changes. We report detailed examination about using max-pooling vs no-pooling, in Section4.1. After this analysis, we decided not to have any pooling layers. Next, we squeezed the student network by reducing the size of its last fully connected layer to have a fairly compact CNN, and we used the knowledge distillation method  to keep the high performance. We created four student models, from largest to smallest: M, S, XS and XXS, to determine the most suitable size-performance balance for our final proposal. Table 1 presents the architectures of these four models. We compare their classification performances in sections 4.2 and 4.3, speeds in Section 4.6, and memory requirements in Section 4.5.
We choose to use the Deep Residual Learning 
network (ResNet) as the teacher network for its proven record of success on classification tasks. ResNets won the 1st places in: ImageNet classification, ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation . Specifically, we employed ResNet with 50 layers due to memory constraints. Although ResNet-50 is far from being the largest of its genre, it meets our requirement for a large and deep teacher network.It is denoted as TeacherExpNet throughout the paper. We present its classification performance in sections 4.2 and 4.3, speed in Section 4.6, and memory usage in Section 4.5.
|Model||# of Parameters||Architecture|
For each image in CK+, we apply the Viola Jones  face detector, and for each image in Oulu-CASIA we use the already cropped versions. All images are converted to grayscale. Then, in order to augment the data, we extract crops ( from each corner and from each side) from an image with dimensions of x for students and x for the teacher. There is no difference on hyperparameter selections for the trainings on these two datasets. Therefore, the settings explained below apply for both CK+ and Oulu-CASIA experiments.
We employ a ResNet trained on the million training images of ImageNet  with 50 layers, and fine-tune it on a FER dataset. The base learning rate is set as and remained constant through iterations. Weight decay is , momentum is , and mini-batch size is . All fine-tuning operations on ResNet-50 finalized after epochs. As previously done in the literature, the results we report for this model are obtained by averaging the 10-fold cross validation performances.
We have the same hyperparameters across all of the different model sizes for both vanilla and student trainings. “Vanilla” training means that the network is trained from scratch without any teacher guidance. Weights and biases are initialized using Xavier initialization 
. Network architectures are implemented via Tensorflow. Adam  optimizer is adopted as the optimization algorithm. The base learning rate is set as , dropout  is , mini-batch is and the weight of the distillation is (see Section 3.1) for all student models. Selected model sizes are M, K, K and K parameters respectively, which are produced by decreasing the size of the layer (see Table 1). Training operations are finalized after epochs for all models and an additional XXS student model is saved after epochs as our proposed model which is denoted as MicroExpNet which can be considered as a graduated student model. Empirical results are given in Table 4 and Table 5, note that for student networks we only put the best performers across different temperatures (selected using cross-validation). Furthermore, student models are used in temperature selection tests (for detailed explanation see Section 4.4). The results we report for these models are obtained by averaging the 10-fold cross validation performances.
Facial expressions are located mostly on eyes and mouth , and they form only a small fraction of a frontal face image. The idea is to capture these subtle indicators of an emotion by preserving the pixel information across layers. Therefore, our starting point was a CNN with no pooling layers. However, in order to validate our intuition, we build three variations containing max pooling layers for each student. All pooling layers have x filters with stride . All hyperparameters mentioned at Section 3.3 apply to these variations as well. We call them candidate expression networks. These candidates are explained in Table 2.
From the results in Table 2, we draw the following conclusions. When models are large enough, the provided capacity for learning dominates pooling effects. For instance, for the size M, classification performances of candidates are very close to each other. For size S, poolings in later layers drops the performance but early pooling is still the most profitable. After this point (XS and XXS), we begin to see the advantage of not having any pooling layers with significant gains in performance. Combining this observation with our intention to reduce the model size, we decided to employ the architecture with no pooling layer as the foundation of our student networks.
Note that adding a pooling layer drops the number of parameters, thus prevents a proper performance comparison. Therefore, we did two modifications to increase the model size, in order to make it a fair comparison. First, when we add a pooling layer after the first convolutional layer, we decrease the stride of the first conv layer from to . This directly recovers all parameters that has been lost. Second, when we add a pooling layers after the second convolutional layer, we increase the number of outputs of the first fully connected layer by -fold. This results in having slightly more parameters than the original one (CandidateExpNet).
CK+ is a widely used benchmark database for facial expression recognition. This database is composed of image sequences with eight emotion labels: anger, contempt, disgust, fear, happiness, sadness, surprise and neutral. There are subjects. As done in previous work, we extract last three and the first frames of each expression sequence when images are labeled. When unlabeled, we only extracted first frames as neutral. The total number of images is (see at Table 3), which is split into folds. Each fold contains equal number of frames for each emotion except the last fold in which there are also leftover frames resulted from division to 10. Viola and Jones’  face detection algorithm is used to crop the smallest rectangular area containing the face from the image.
We evaluate the pre-trained ResNet-50 via fine-tuning on CK+. Then, we train four models, namely VanillaExpNet, VanillaExpNet, VanillaExpNet, and VanillaExpNet, from scratch. At this stage, we did not employ knowledge distillation. For all models, we used epochs for training, and the classification performances are shown in Table 4. In the light of these results, we choose ResNet-50 as the teacher for the knowledge distillation stage.
We evaluate four students, namely StudentExpNet, StudentExpNet, StudentExpNet, and StudentExpNet, via knowledge distillation on CK+. At this stage, we use the teacher’s supervision to improve the learning. As explained in Section 3.3, we need to tune the temperature for each student since it is regarded as correlated with model size. Therefore, we conducted an extensive experiment on classification performances for a wide range of . The results are reported in Figure 2. According to these results, fluctuations between performances are increased while models are getting smaller. Consequently, it suggests that large networks are more tolerant to the changes in the temperature. Best performers, regarding their average classification performances for -fold cross validation, across different temperatures are then used for performance comparison in Table 4. Our findings (see Fig. 3) show that knowledge distillation can be used to gain back some of the performance lost by decreasing the model size.
|Method||Accuracy||# of Classes|
Oulu-CASIA has 480 image sequences taken under dark, strong, weak illumination conditions. In this experiment, as also done in previous work, we used only videos with strong condition captured by a VIS camera. In total, there are 80 subjects and six expressions: anger, disgust, fear, happiness, sadness, and surprise. Similar to CK+, the first frame is always neutral while the last frame has the peak expression. All studies we have encountered on Oulu-CASIA database use only the last three frames of the sequences, so we also use the same frames. Therefore, the total number of images is 1440. A 10 fold cross validation is performed, with random split.
We fine-tune the pre-trained ResNet-50 on the Oulu-CASIA dataset. Then, we train four models, namely VanillaExpNet, VanillaExpNet, VanillaExpNet, and VanillaExpNet, from scratch. At this stage, we did not employ knowledge distillation. For all models, training lasted epochs, and the classification performances are shown in Table 5. Since we want the best performer for supervision, we again choose ResNet-50 as the teacher for the knowledge distillation stage.
We evaluate four students, namely StudentExpNet, StudentExpNet, StudentExpNet, and StudentExpNet, via knowledge distillation on Oulu-CASIA. At this stage, we use the teacher’s supervision to improve the learning. As explained for CK+, we again need to tune the temperature for each student. Therefore, we also conducted an extensive temperature selection experiment for Oulu-CASIA dataset. The results are reported in Figure 4 from which, we can observe a similar fluctuating behavior as seen in the CK+ experiments. Once again, we can see that large networks are more tolerant to the changes in the temperature than the smaller ones. Best performers across different temperatures are then used for performance comparison in Table 5. We can still observe that the student models perform better than vanilla models (which are trained from scratch without any teacher supervision) for facial expression recognition.
Temperature is a tool to enforce the uncertainty of the teacher network to emerge. This uncertainty may be used as similarity information between different classes to enhance the training. However, there is no formulation for selecting the most effective temperature; it is set empirically. We did a grid search for temperatures of [2, 4, 8, 16, 20, 32, 64, 128] with 10-fold cross validation across all of our student networks using both CK+ (see Figure 2) and Oulu-CASIA (see Figure 4) datasets. According to the results, smaller models are more prone to temperature changes in general, and performances for a given temperature seem rather stochastic. Nevertheless, when calibrated adequately, it improves the overall classification performance as it can be seen at figures 3 and 5.
One of the most important benefits of a small neural network is its modest need for memory space. Table 6 shows the comparison of the model sizes in megabytes. Our ultimate facial expression recognition model MicroExpNet takes less than MB to store which is x smaller than our teacher network (ResNet-50). In addition, MicroExpNet has x fewer parameters than the teacher.
Note that, the results reported for PPDN 
are our own estimations as they were not available. Moreover, since FN2EN
used Caffe framework to deploy their model, although they have more parameters than PPDN, their model has smaller memory requirement.
Another important benefit of a small neural network is its speed. In order to measure the speed, we ran each model for times with single input image and measure the average run time. Table 7 shows the comparison of the elapsed times to process one image in milliseconds. According to the table, MicroExpNet achieves the best performance by classifying the facial expression in an image in less than ms on an Intel i7-7700HQ CPU. Also, it can be seen that all of the students achieved speeds that are well above the requirements of real-time processing. Ultimately, our final facial expression recognition model, when compared to our teacher network ResNet-50, MicroExpNet is x faster on Intel i7-7700HQ CPU, x faster on GTX1050 GPU, and x faster on Tesla K40 GPU.
Note that, the numbers reported for PPDN  and FN2EN  are obtained from our own experiments (as they were not available). We used a custom Tensorflow implementation for PPDN and a Caffe based implementation of the authors for FN2EN.
We presented an extremely small (less than 1MB in size) and fast (1408 frames per second) model, called the MicroExpNet, for facial expression recognition (FER) from frontal face images.
From our experimental work, we have drawn the following conclusions. (1) Translation invariance – achieved via max-pooling – degrades performance especially when the network is small. (2) We showed that the knowledge distillation methods works very well the FER problem. (3) “Knowledge distillation”s effect (compared to training from scratch) gets more prominent as the network size decreases. If this effect is generalizable to other problems/datasets (this is yet to be seen in future work), knowledge distillation can become a mainstream training method when the goal is to produce small networks. (4) The temperature hyperparameter (in knowledge distillation) should be tuned carefully for optimal performance. Especially when the network is small, the final performance fluctuates with temperature.
We are curious about (1) how would incorporating spatio-temporal information (i.e. sequence-based modeling) change our results, (2) how would our model do for non-frontal face images. We left these two items as future work.
Our codes are available at GitHub.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.
Recognizing facial expression: machine learning and application to spontaneous behavior.In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 568–573. IEEE, 2005.
European conference on computational learning theory, pages 23–37. Springer, 1995.
Geometric feature-based facial expression recognition in image sequences using multi-class adaboost and support vector machines.Sensors, 13(6):7714–7734, 2013.
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 249–256, 2010.
Rectified linear units improve restricted boltzmann machines.In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.