Automatic facial expression analysis is important for detecting, recognising and interpreting the emotional state underlying a given facial image. One of the most common descriptors of facial expressions are the Action Units (AUs, [ekman02]). The Facial Action Coding System (FACS, [ekman02]) defines Action Units as atomic non-overlapping facial muscle actions that when combined in different configurations can describe any facial expression. There are 32 AUs in total. The Facial Action Coding System also establishes a six-point ordinal ranking of intensities which ranges from to , with denoting the absence of a specific Action Unit, and referring to the maximum level of expressivity.
Action Units are in many ways inherently correlated: in time, as facial actions vary smoothly within a sequence, in their co-occurrence, as Action Units are often activated in certain meaningful combinations, and spatially, as they adhere to anatomically defined local and global geometric structure. In the field of automatic facial expression analysis, these correlations can indeed be regarded as a line of research, either alone or in combination with others. For example, co-occurrence correlation is exploited to perform joint prediction of multiple AUs, either through shared feature representations [zhang2014, zhu2014] or through methods that impose correlations among labels, usually by employing graphs [walecki16, walecki17]. Furthermore, a significant amount of works attempt to exploit spatial correlations of AUs by extracting local representations in the facial regions where AUs are known to produce appearance changes, namely Regions of Interest (ROIs) [jaiswal2016, li2017b, li2017c]. Typically spatially-aware approaches employ a two-stage pipeline where facial landmarks, are firstly detected in order to define AU locations, and then, local features for each AU are extracted and, adaptation or fusion mechanisms are introduced to jointly predict AU intensity levels. Our method significantly simplifies the aforementioned two-stage pipeline by localising AUs and estimating their intensity, while also modelling their inter-dependencies, in a single step.
In this paper, we firstly make the simple observation that AU recognition should be treated as localisation problem where the task is to both localise and classify the Action Units. Motivated by this observation, we propose a new task that consists of jointly localising and estimating the intensity of Action Units, and we formulate this problem using Heatmap Regression. Heatmap Regression is arguably the most successful approach to landmark (keypoint) localisation[newell2016, bulat2017]. It boils down to a pixel-wise regression task that indicates the likelihood of a landmark being present at the corresponding spatial location. However, unlike landmarks, AUs can be present or absent in a given image with their intensity spanning from to
. Hence, their amplitude modelling cannot be treated with the standard probabilistic approach of Heatmap Regression (i.e. with the heatmaps relating to the confidence of a detected landmark). To overcome such limitation, we extend Heatmap Regression to include maps that are modelled according to the corresponding AU intensity. In particular, motivated by the fact that Action Units produce appearance changes around the facial region where they are known to occur, we propose to model the size of a heatmap (i.e. its amplitude and extend) according to the intensity of a given Action Unit. Under this setting, our idea boils down to a simple yet efficient Heatmap Regression approach. This approach is efficient in a way that not only merges the co-occurrence and the spatial correlation of AUs in a single task, but also in a way that bypasses all complexities associated to the typical two-stage pipeline, consisting of registration followed by local feature extraction and classification, often found in AU related works.
By jointly tackling the problem of AU intensity estimation and localisation using Heatmap Regression one could choose to dismiss the commonly required step of face alignment. Rather doing so, in this work we further propose to integrate this task into our system using transfer learning. In particular, on one hand it is known that AU annotations are scarce and hence it is difficult to train a system for AU recognition that can work well across all types of facial variability (e.g. facial pose, illumination, occlusion). On the other hand, there is a large pool of facial landmark annotations available for all types of facial variability. Since heatmap regression can be used to tackle both tasks, we investigate how and to what extent one can transfer knowledge from a network trained for landmark localisation into a network for AU localisation and intensity estimation. We explore several alternatives for transfer learning through a) fine-tuning, b) adaptation layers, c) attention maps, and d) reparameterisation (see Fig. 2). We show that our approach allows for robust AU modelling across a wide range of poses and illumination conditions.
This paper reformulates our previous manuscript on Action Unit intensity estimation using Heatmap Regression [sanchez2018] and devises a more robust approach using transfer learning. We show that this approach benefits from the robustness of the facial features given by a similar network trained for the task of facial landmark localisation, yielding state of the art results in three different datasets (FERA 2015, DISFA, FERA 2017). The contributions of this paper are as follows:
We propose to reformulate the problem of Action Unit intensity estimation in a way that absorbs their localisation. In particular, we are the first to propose to jointly localise and estimate the intensity of Action Units using Heatmap Regression. The use of variable size heatmaps allows the joint modelling of AUs localisation and intensity estimation in a single yet efficient way.
We propose the use of transfer learning to exploit the knowledge of a network trained for a similar task, that of face alignment, in a large-scale of images ranging a wide variety of poses, expressions, and illumination, conditions often hard to find in AU datasets. To this end, we explore several variants and identify an incremental learning approach which significantly reduces the number of weights to be trained, and increases robustness against different views.
We extensively validate our approach in three challenging datasets, namely BP4D [valstar2015], DISFA [mavadati2013], and FERA 2017 [valstar2017], yielding state of the art results with an approach that requires little complexity.
2 Related Work
We firstly review the closely related work in the area of facial Action Unit detection and intensity estimation, and then we provide some insight into existing approaches to transfer learning.
2.1 Action Unit detection and intensity estimation
Facial Action Unit modelling is a longstanding problem in Computer Vision and Affective Computing, that is often split into works that either target their detection, i.e. estimating whether an AU is present or not on a given facial image [almaev13, almaev2015, Baltrusaitis2015, chu13, chu2017, ertugrul2019, he2017, jaiswal2016, jiang12, li2017b, li2017c, Niu_2019_CVPR, shao2018, wu16, yang2019], or the more challenging task of estimating their intensity [eleftheriadis2016, jeni13, rudovic12, rudovic13a, rudovic13b, rudovic2015, sandbach13, tran2017, walecki16, walecki17], given as a value that ranges from (i.e. absence of an AU), to (maximum intensity). Regardless the task, many methods partially share the underlying methodology: some works attempt to leverage co-occurrence and static dependencies among AUs [walecki17, ertugrul2019], some exploit their geometric structure [zhao2016, li2017c], or their temporal correlation in time [jaiswal2016] and works that combine different means of correlation [ming2015, li2017b, chu2019, jaiswal2016].
One of the most exploited means of AU correlation refers to their spatial structure. Action Units have a geometric structure, i.e. they are spatially correlated to specific facial regions. Early works on AU modelling were targeting the design of some handcrafted features that can inform about local appearance variations that are ultimately related to each AU [almaev13]. With the development of Convolutional Neural Nets this design was no longer needed, and other techniques to extract local features appeared. A simple approach to extracting local features is that of [jaiswal2016], where the face is first registered according to some landmark detection, and each part is cropped independently. Then, CNN-based features can be extracted independently. In a similar fashion, [zhao2016] proposed to incorporate an intermediate region-specific layer to a CNN to extract separate features at different facial sub-areas, while [li2017c] incorporated to a pre-trained CNN two extra layers - coined as the enhancing and the cropping layer - to enforce the network pay more attention to spatial regions with high AU correlation. With such locally-based modelling, [chu2017] proposed to introduce a temporal model in a hybrid manner, to jointly exploit the local and temporal correlation of AUs. Building on top of region layers, [li2017b] introduced the CNN-based Region of Interest (ROI) detection, which was then incorporated into the local modelling of AUs. In particular, [li2017b] proposed independent ROI networks to learn separate filters for different facial regions, which were later used to feed a fully-connected LSTM network to also exploit the temporal correlation. More recently, [tran2017]
proposed to model AU regions by incorporating a Variational Autoencoder framework (VAE,[kingma2013auto]), and [yang2019] combined a 2D with 3D CNN for frame-level AU detection at an attempt to leverage spatio-temporal dependencies. All of these works require, however, a good pre-processing step that consists of registering the input image according to some detected facial landmarks. In this paper, we observe that both tasks can be performed together in a rather simple way. While the methodology proposed in this paper is completely new, some works have attempted to jointly detect facial landmarks and perform AU modelling in a unified framework. In an early work, [wu16]
proposed an iterative framework whereby a cascaded regression approach was used to detect facial landmarks, and where a Restricted Boltzmann Machine was used to detect the Action Units. Recently,[shao2018] proposed to jointly perform facial landmark localisation and AU intensity estimation through a hierarchical, multi-scale region learning pipeline that employs attention maps refinement to ease the learning process. In [Niu_2019_CVPR] the landmarks are instead used to regularise the features extracted by a CNN, towards driving these to be person-specific. Both [Niu_2019_CVPR] and [shao2018] observe that landmarks carry over important information, either to regularise the features or to generate attention maps. In a more recent approach, [shao2019] proposed the use of attention maps that are landmark-free, showing how locally-based features can better model the AU occurrence. In this paper, we propose a rather less computationally expensive method that can deal with estimating the AU intensity by first re-formulating the problem in a way that includes their localisation, and then by incorporating the rich features acquired by a network trained to detect facial landmarks. Our approach offers a significantly less complex yet efficient method that delivers state-of-the-art results in the more challenging problem of estimating the intensity of Action Units.
Finally, it is worth mentioning the recent appearance of methods that work on a weakly supervised or even unsupervised manner [li2019, Zhang_2018_CVPR, Zhang_2018_CVPR_BN, wang_2019, Zhang_2018_CVPR_weakly, Zhang_2019_ICCV, niu2019multi]. While these works have shown some interesting advances, they are still behind the performance achieved by fully supervised methods. Although it is out of the scope of this paper, the low computational complexity of our method suggests that it could also be a good approach for learning with scarce labels. We leave this problem for future work.
2.2 Transfer Learning
The goal of transfer learning, often found in the literature as incremental learning, is to adapt the knowledge acquired for a strong task for which a large pool of labels is available to learn a set of potentially unrelated tasks [rosenfeld2018incremental, rebuffi2017learning, rebuffi2018efficient]. One of the simplest approaches to transfer learning consists of fine-tuning [huh2016makes], often used in face analysis works. For example, it is a common practice to initialise a network with the weights of the pre-trained VGG-Face2 [cao2018vggface2]
, trained with thousands of images for the task of face recognition. Some examples of this can be found in[li2017c, li2017b, Niu_2019_CVPR]. Other works have proposed to add knowledge incrementally, i.e by extracting features from a model trained on a specific task and use them to train another model for a new task. This method advances over fine-tuning in the sense that a model is trained for a new task without forgetting old representations. An example is the progressive networks [rusu2016progressive], or the adaptive filters [rebuffi2017learning, rebuffi2018efficient].
In this paper, we explore different alternatives for transfer feature learning, as well as an approach based on network reparameterisation, which consists of applying a transformation over the existing weights of the pre-trained network. Network reparametrisation is often found in incremental learning approaches, where new tasks are added sequentially to a strong core. For instance, [kossaifi2019t]
applies a tensor decomposition that allows each of the tensor dimensions to adapt a new task. A simple approach that has been shown to perform well in practice consists of projecting the given weights in a convolutional network with a simple projection matrix, learned over the new task[rosenfeld2018incremental]. This approach was also recently applied to the unsupervised adaptation of object landmark detectors [sanchez2019]. We observe that we can use such a framework in a supervised manner to transfer the knowledge of a face alignment network to our proposed AU intensity estimation network. We observe that this approach offers significant computational benefits whilst yielding state of the art results.
3 Joint AU localisation and intensity estimation
In this Section, we present a novel approach to AU intensity estimation using heatmap regression. The main novelty of this approach lies in the fact that heatmap regression allows joint localisation and intensity estimation of the AUs, as the machine learning task gains a spatial aspect. By using an encoder-decoder approach, we are able to gather features at different spatial levels to yield a dense pixel-wise prediction, facilitating inference, and allowing the network to learn both the spatial relation and co-occurrence of AUs. Our approach differentiates from previous works that apply multi-task learning through joint estimation of AU intensities and landmark localisation, in the sense that AUs’ spatial relation and co-occurrence is inherently embedded in the heatmap regression method. In addition, Heatmap Regression does not make use of attention maps to predict the score, itjust regresses the heatmaps. Our approach offers significant advances: it yields state of the art results without requiring any complex pre-processing or face alignment, thus effectively reducing the computational cost of inference.
3.1 Problem formulation
Our goal is to train a network that can predict the intensity of a set of AUs. To do so, we propose to reformulate the training process in a way that makes the network jointly detect the location of the AUs, as well as their intensity. This way, we can formulate the joint problem using Heatmap Regression, thus relaxing the training process. This relaxation comes from the fact that Heatmap Regression boils down to local pixel-wise regression, making the network penalise local errors more efficiently, rather than in a global way, as is the case of direct regression. While some works use the facial landmarks to perform Multi-Task learning or generate some attention, we want to formulate a joint training, where the set of images and AUs is augmented by the locations of the latter. Formally, let be a set of RGB images, for which the corresponding AU intensities are known, with the number of annotated AUs in the dataset. Our first goal is to extend the training annotations with a set of locations , where each corresponds to the coordinates of the locations where an AU is known to produce appearance changes for that particular AU when it is activated. Each AU will often be placed at two locations, symmetric with respect to the facial axis of symmetry. Some of the AUs will also produce changes at a third location, placed along the symmetry axis. Given that there are no annotations available to place the ground-truth locations, we place them using a priori knowledge about their location and with respect to the facial landmarks used for landmark localisation. The positions of the AUs as derived from the given landmarks is given in Fig. 3. The exact correspondences are illustrated in Table I. After having generated the AU positions, our training set is now defined as .
|AU 4||21||22||21,22, 27|
3.2 Heatmap Regression
Once the training set has been augmented with the AU locations, we can define the training procedure. Inspired by the success of Heatmap Regression for facial landmark localisation, we propose to formulate the training problem in a similar way. However, locating the AUs only, would not solve the problem of estimating their intensity, and thus we need to accommodate the latter into the localisation problem. To do so, we propose to attach each AU to a corresponding heatmap. Each heatmap will contain one, two or three Gaussians, according to the number of points defined in (see Table I for all different correspondences). Following existing works in Heatmap Regression, we will work with heatmaps that are quarter the size of the input image, i.e. . Each Gaussian is defined as a 2D map having the same size as that of the input image, where the value at each position is given by:
with the intensity of the -th AU for image . The heatmap for AU is thus defined as . Under this representation, the heatmaps form an tensor, with the number of AUs, and and the width and height of the images, respectively ( in our setting).
Our goal is then to train a network that, given an image , regresses a set of heatmaps . The network is parametrised by the parameters . The learning is formulated in a standard heatmap regression fashion, i.e. as finding the weights that minimise the squared loss between the output and the ground-truth maps:
For the sake of clarity, we introduce the network description and training herein, as it will constitute what we refer to as the backbone throughout the next section. In particular, the network follows a similar architecture as that of the Face Alignment Network (FAN, [bulat2017]), with small differences. In our setting, the network receives an input image , and first applies a downsampling convolutional filter to it, halving its resolution and increasing the number of channels to . Then, a set of Convolutional Blocks (referred to as [bulat1027b], see Fig. 4), are used to bring the number of channels to and the spatial resolution to . We will refer to these layers as , , and , respectively. Then, the features after the layer are fed into a single Hourglass network [newell2016] (Fig. 4), which is an encoder-decoder network, composed of several and skip connections that aggregate the features at different scales. While [newell2016, bulat2016] used a set of stacked Hourglass with -channel , we opt for a lighter version that consists of a single Hourglass with of channels. The output of the Hourglass is finally passed through an extra and a convolutional layer to bring the number of channels to the target described above, i.e. it outputs the desired tensor. With such a lightweight model, the network comprises only parameters. We empirically validate that such a simple network yields competitive results whilst being computationally efficient.
Inference in this setting is straightforward. To get the AU intensities, one simply needs to crop the face image according to some face detection and forward it to the trained network. The network returns a set of heatmaps, from which the AU intensities can be inferred by just finding the maximum of each map. Note that our method does not require to register the face image before inserting it to the network.
4 Incremental Heatmap Regression for AU localisation
Using heatmap regression for AU recognition and localisation allows us to make use of the great progress that we have witnessed recently for the problem of facial landmark localisation. More specifically, we propose to transfer knowledge from a network trained for face alignment with hundreds of thousands in-the-wild images spanning a large set of poses, expressions, and illumination into the proposed network for AU intensity estimation and localisation. This in turn allows us to overcome to some extent the limitations of existing AU datasets related to facial variability (e.g. number of subjects, facial pose, occlusions etc.).
In contrast to the previous works that have attempted to exploit the correlation between facial expressions and localisation [wu16, shao2018] through a multi-task learning framework, we propose to use transfer learning to learn AU intensities from rich facial features retrieved from a pre-trained face alignment model.
The first and simplest approach to the proposed transfer learning approach consists of fine-tuning the pre-trained network for the target task. Besides fine-tuning, we propose and explore three different alternatives to accomplish the task of transfer learning, which are described below. First, we briefly describe the architecture and pre-training of the face alignment network (Sec. 4.1) and then, we explain how we fine-tune an AU estimation model from a face alignment one (Sec. 4.2). Finally, we describe our three alternatives, namely that of adaptation layers (Sec. 4.3), attention maps (Sec. 4.4) and network reparametrisation (Sec. 4.5). In what follows, we will refer to the face alignment network as FAN, whereas the corresponding part of the network targeted with the AU heatmap regression will be referred, for simplicity, as AU-Net.
4.1 FAN pre-training
Our in-house implementation of FAN follows that of the AU-Net described in Sec. 3.3. The output of the network is a set of heatmaps, each corresponding to a single landmark. The FAN is trained for epochs on LS3D-W [bulat2017] training set which is the largest and most challenging facial landmark dataset to date (approximately images). The network yields a validation accuracy of point-to-point Euclidean distance, on par with that reported in [bulat2017].
4.2 Method 1: Fine - tuning
Our first and simplest approach to transfer learning consists of fine-tuning the pre-trained network. In particular, we observe that one can depart from the face alignment network and fine-tune it for the proposed AU localisation and intensity estimation by using a small learning rate. This will allow the network to slightly move from a very similar problem (that of facial landmark localisation) to our target task. We experimentally validate that, in line with existing works that suggest fine-tuning as a strong adaptation mechanism, such a simple technique already improves performance over training the network from scratch.
4.3 Method 2: Adaptation layers
Our second approach to incremental learning consists of transferring the features generated from the Face Alignment Network (FAN) to a second network, targeted with producing the Action Unit heatmaps. In this paper, we conjecture that the early features produced by a strong FAN provide with rich facial representations, and we thus propose to inject this knowledge into a second learnable network. In addition to the early features, we also inject the produced heatmaps, as these are nothing but a geometric representation of the face. The heatmaps consolidate the spatial configuration of all landmarks, and hence encode information regarding location, pose, shape and expression of a face in an image. Finally, given that they are probabilistic maps, they provide both coordinate and confidence information which can be useful for understanding spatial context and modelling part relationships. Overall, we posit that the generated landmark heatmaps encode rich facial geometry representations that could operate as an attention mechanism that drives focus on regions of the face that are very informative for the task of AU prediction. Hence, it is reasonable to attempt to incorporate to the new task of AU estimation, this rich facial geometry information from face alignment. We study the impact of these heatmaps on the task of AU intensity estimation through several ablation studies in Sec. 6.
The AU network has a similar structure than that of the Hourglass described in Sec. 3.3. However, rather than using as input the facial image used to extract the facial landmarks, we use as input a combination of the features produced after the block of the FAN, and the produced heatmaps ().
In order to inject the early features and produced heatmaps into the AU network, we use an adaptation layer, as depicted in Fig. 6, a). This adaptation layer is composed of a branch that processes the early features coming from the FAN, and another branch that processes the generated landmark heatmaps. Let be the output heatmaps corresponding to the facial landmarks, and let be the features from the layer of the FAN. In order to integrate these two tensors, we apply to each of them a
filter, followed by a Batch Normalisation layer and a ReLU activation layer. Thefilters produce, for both cases, a tensor. The output of both branches are then added and sent to the AU network. The AU network then receives the combined features, and passes it through an Hourglass network with all filters set to have a kernel size of , rather than the of the original FAN network. The output of this network is then a set of heatmaps. The training is done through the classical heatmap regression depicted in Sec. 3. With the filters, and the removal of the first convolutional blocks, the new network comprises only parameters.
4.4 Method 3: Attention maps
A different alternative consists of generating attention maps from the generated heatmaps. This approach is depicted in Fig. 6, b). In particular, it is important to recall, from Sec. 3, that at training time the target heatmaps, in the original setting, are located according to the ground-truth landmarks. In other words, there is a clear relation between the facial landmarks and the location of the Action Unit heatmaps. Thus, if we are to transfer the knowledge from the FAN network to the AU network, it is natural to explore the use of attention maps, generated from the heatmaps produced for the facial landmarks.
In this setting, we use the corresponding heatmaps from the FAN, and extract the corresponding landmarks by applying an argmax operator. Then, using the method described in Sec. 3.2, we generate a new set of heatmaps . The heatmaps are then forwarded to the corresponding branch described above. However, it is worth noting that we do not have the ground-truth labels to produce a set of heatmaps that vary according to the Action Unit intensities. Instead, we are interested in generating an attention map, i.e. a heatmap that locates the Action Units, without regarding to whether a given AU is actually present or not. To this end, the heatmaps in are generated using a fixed intensity .
Note that the attention maps are generated on the fly i.e during training the generated heatmaps from the FAN are used to generate the attention maps. After having acquired the attention maps we follow the process described in Sec. 4.3. The new network is again parameters.
4.5 Method 4: Reparametrisation of FAN
A different alternative consists of using the reparametrisation approach of [rosenfeld2017, sanchez2019]. A visual representation is depicted in Fig. 2, iv). In particular, we depart from the FAN network , with parameters , trained to detect facial landmarks, as described in Sec. 4.1. We now wish to adapt the model for the task of AU intensity estimation using heatmap regression, to yield a new set of weights , so that would produce the desired AU heatmaps. As previously mentioned, is chosen to be an Hourglass, which is uniquely parametrised by convolutional and batch normalisation layers. The network is modified to return heatmaps, rather than the heatmaps of facial landmarks, i.e. replicates the same structure for all layers but the very last one. Under this setting, the adaptation method proposed in [sanchez2019] boils down to reparameterising the convolutional layers by learning a series of weights that are projected onto the original filters, to yield a new set of weights for the target task. Let us denote the weights of the convolutional layer of the original network as , with and the number of input and output channels, respectively, and the kernel size. Then, following [sanchez2019] we use the following reparametrisation of the weights :
where is the learnable projection matrix, and denotes the -mode product of tensors. The set of weights are of the same size than those of , and can thus be replaced into the original network. Then, the learning is formulated in a heatmap regression fashion, although now the weights remain frozen, and only the weights are to be learned. This approach, besides advancing the field of unsupervised adaptation, offers significant computational savings, as now the learnable weights have only parameters, contrary to the original set of parameters. Considering that the majority of filters in the Hourglass are of , the computational saving is, for , about times the number of parameters. We observe that, while the original FAN network comprises parameters, the new set of learnable weights reduce to only parameters. This method allows to efficiently transfer the knowledge from the pre-trained network to the target one.
5 Training and implementation details
In the following Sections, we evaluate each of the discussed alternatives for transfer learning, and compare them to training a model from scratch, as presented in Sec. 3. We then compare our method against existing works reporting AU intensity estimation. Note that, while we show qualitatively how our method is capable of localising the Action Units across a wide span of poses and expressions, we are primarily interested in demonstrating the superiority of our method at the task of estimating the intensity of AUs, and we are not interested in the precise localisation error.
We evaluate our models on three benchmark databases - FERA 2015 [valstar2015], DISFA [mavadati2013] and FERA 2017 [valstar2017]. All these datasets contain a set of videos each showing an individual responding to emotion-elicitation tasks.
FERA2015 [valstar2015]: The corpus of the FERA2015 challenge is based on the BP4D dataset [zhang2014], which is composed of subjects performing tasks, plus an extra test set of subjects, performing additional tasks. The original corpus was released as part of the training and development partitions of the FERA 2015 challenge, whereas the test set, which is not publicly available, was used to rank participants. The training and development partitions are split into and subjects. In total, there are videos corresponding to the training/validation partitions, and videos corresponding to the test partition. In this paper, we use the official partitions, and report results on both the validation and test set. Given that the test set is not publicly available, we compare our results with those of the challenge winners. All partitions are annotated with Action Units intensity levels. The training set comprises frames, whereas the validation and test set contain and frames, respectively.
DISFA [mavadati2013]: The DISFA dataset contains video recordings of 27 subjects while watching Youtube videos. Each clip is length, and has been manually annotated with the intensity of AU. Given that no official partitions are defined for DISFA, we follow existing works and perform a three-fold cross validation evaluation, where we train a model for each fold, and we report the ICC measured on the aggregated predictions for the whole dataset, each returned by its corresponding model.
FERA2017 [valstar2017]: The FERA2017 corpus extended that of the FERA2015, by augmenting the existing videos with 3D models that are synthesised in different views. The FERA2017 incorporates in its official training and development partitions the subjects from the test set of the FERA2015 challenge. An additional set of videos was added as the official test, emanating from the BP4D+ dataset [zhang2016]. This dataset poses a great challenge in multi-view facial expression recognition, which we prove can be efficiently performed with a computationally simple model. The AU intensity annotations were extended to cover a total number of . For FERA 2017, we use again the official partitions and we report the predictions against the corresponding annotations. FERA 2017 contains roughly frames for training, frames for validation, and for testing.
5.2 Evaluation metrics
We use standard error measures to evaluate AU intensity estimation models. The first measure is the intra-class correlation (ICC(3,1),[shrout79]), commonly used in behavioural sciences to measure agreement between annotators, and used to rank participants in the FERA challenges. The second measure is the mean squared error (MSE) mainly used for prediction problems.
All experiments are carried out using the Pytorch library for Python[paszke2017]. The adaptation layers along with all versions of the AU estimation network (AU-NET) are trained from scratch with Adam optimiser [kingma2014] and batch size . The weight decay is set to and momentum to . We additionally use cosine annealing scheduler with step . Note that during training, FAN weights remain frozen. The ground-truth target heatmaps are generated according to the method described in Sec. 3. For BP4D and DISFA, we use the landmarks extracted from the publicly available code of iCCR [sanchez2016, sanchez2017], whereas for FERA2017 we used the official implementation of FAN [bulat2017]. These landmarks are used to define the heatmaps for training. Note that the landmarks are not needed at test time. The facial images are then tightly cropped to resolution to be passed through the corresponding networks. In addition, we use some random augmentation, consiting of flipping, rotation (from °to °), color jittering, scale noise (from to ) and random occlusion. In order to ensure a fair comparison, we re-implemented and re-evaluated the Heatmap Regression model proposed in Sec. 3 and [sanchez2018]. Our new results account for the stronger augmentation and training strategy applied herein.
6 In-house evaluation
In this Section we analyse and evaluate all our proposed approaches for AU prediction. To do so, we experimentally evaluate the performance of all our methods using ICC score and Mean Square Error on all three aforementioned datasets (FERA2015, FERA2017 and DISFA). Results on FERA 2015 and DISFA are shown in Table II, while results on FERA 2017 are on Table III. Also, to further allow for a comprehensive evaluation of the strengths and weaknesses of each of the methods we include a thorough evaluation of their complexity, including the capacity of each model, the number of floating point operations per second, and the average time per forward pass. The summary of complexity is illustrated in Table V, along with the performance each method reports on the validation set of BP4D, often used as the referent benchmark for comparison in existing works. In addition to our proposed methods, we include a strong baseline based on a ResNet-18 [he2016], which is a rather deep network of approximately parameters. For the task of AU intensity estimation, we modify the last layer to generate predictions that match the number of Action Units. Then, we simply regress AU intensity levels. All models are trained under the same training configuration, i.e same learning rate, optimisers and augmentation.
6.1 Heatmap regression vs. regression
We approach the task AU intensity estimation from a geometric perspective. We train our models in a Heatmap Regression fashion which allows us to jointly localise the AUs and estimate their intensity. This approach is simple and straightforward: an image is passed forward to the AU-net that generates a set of heatmaps from which simple retrieval of the maximum value gives AUs predictions. To evaluate the impact of heatmap regression, we train a model to simply regress AU intensity levels rather than regressing heatmaps. This model is ResNet-18 which also serves as our baseline. As shown in Table II and Table III, Heatmap Regression significantly outperforms ResNet-18 both in terms of ICC score and Mean Square Error in all three datasets. Notably, our method achieves better results than direct regression with a model of much less capacity.
|Num. of parameters||flops||secs/img||FERA 2015||DISFA||FERA 2017|
6.2 Heatmap regression vs transfer learning
We extend our approach of AU intensity estimation through Heatmap Regression by proposing methods that absorb knowledge of a pre-existing network for face alignment. Thus, we further investigate whether our four different methods, i.e fine-tuning, adaptation layers, reparametrisation and attention maps, that leverage information from FAN, benefit the task of Heatmap Regression for AU estimation. To evaluate their impact, we first train all methods under the very same training scenario, i.e same data augmentation, same training configuration etc., and test the performance of each in terms of ICC and MSE score. Then, we further evaluate their computational requirements with regards results in Table V. We refer to the Heatmap Regression method presented in Sec. 3 as trained from scratch, and to each of the transfer learning methods by their corresponding technique.
As shown in Table II and Table III transfer learning improves over training from scratch in almost all three datasets in terms of both ICC and MSE. For FERA 2015, all transfer learning methods yield an ICC score that ranges between and , while the model trained from scratch achieves an ICC score of . The same behaviour is observed for DISFA, where transfer learning improves over training from scratch, with ICC scores ranging from to for the former, vs. the given by the latter. However, we observe that for FERA 2017 both training from scratch and applying transfer learning appear to deliver similar results, which we attribute to the fact that indeed FERA 2017 is a large-scale dataset that includes a large variety of poses. Under such a large pool of videos, training the model from scratch suffices to yield competitive results.
Regarding complexity (see Table V), we observe that all our methods deploy models with a small number of parameters that roughly range between to . The transition from having a model trained from scratch to transfer learning requires no extra parameters when fine-tuning, and a negligible number of extra parameters for the reparametrisation approach. The use of adaptation layers and attention maps incur in only an extra number of parameters. We can observe that this increase is negligible compared to the original Hourglass of [bulat2017], which comprises . Similarly, the number of slightly increase in the case of adaptation layers and attention maps.
|Dataset||FERA 2017 (development set)||FERA 2017 (test set)|
6.3 Comparison between transfer learning methods
We now turn our analysis to the comparison between each of the proposed methods for transfer learning. While the discussed approaches deliver state of the art results (see Sec. 7) it is worth discussing the pros and cons of each, according to their performance and complexity.
The first proposed approach to transfer learning, that of fine-tuning, is undoubtedly the simplest in terms of complexity, which matches that of training the network from scratch. We observe that fine-tuning brings a considerable gain in performance w.r.t. training from scratch, especially in DISFA, which shows to be an effective and efficient way to transfer learning.
In the same line, we can observe that, while the reparametrisation approach results in an even more efficient approach to that of fine-tuning (much less number of learnable parameters, with same inference complexity), its performance is slightly worse.
Arguably, the best gain in performance comes from the adaptation layers and the attention maps. While the latter includes an extra step to convert the facial landmark heatmaps into AU-attention maps, the complexity can be said to be the same. However, we observe that using the adaptation layer directly from the heatmaps returned by the FAN outperforms the results given by the attention maps. We attribute this to the fact that the detected heatmaps provide some confidence, which can be more effectively used by the network to automatically infer the attention.
In summary, while fine-tuning and reparametrisation seem to be the methods with the least complexity, the adaptation layers method yields the best performance. However, it is worth noting that, regardless this method being the most complex from the proposed ones, its complexity and number of parameters compared to those of the Resnet-18 suggest this as an efficient method for AU localisation, and thus we choose it for comparison w.r.t. state of the art works.
6.4 Core task with random weights
The proposed transfer learning methods depart from a core network pre-trained for the task of Face Alignment, and include an extra set of learnable weights to perform the target task of AU localisation and intensity estimation. With the great success of the Heatmap Regression method proposed in Sec. 3, it is natural to explore whether the gain in performance comes from having a more constrained network, or from the actual features inherited from the FAN network. To validate that the contribution of the transfer learning methods does not come from the little capacity added to the core network, we study the performance of using the adaptation layers using as a core network a FAN-like network that is initialised with random weights, and that remains frozen with these randomly initialised weights. The results of this study are those referred to as random backbone in Table II and Table III. It can be seen that, while learning only the extra network still produces competitive results, having the rich representations given by the FAN is crucial to achieve state of the art results. Note that, despite the network receiving the features from a randomly initialised FAN, the generated features after the layer are still conditioned to the input image, through some fixed random non-linear projections.
6.5 Fine-tuning after transfer learning
In addition to the aforementioned studies, we also explored an alternative approach that consists of fine-tuning the whole pipeline after the transfer learning step. In particular, we unfreeze the FAN network and we fine-tune the whole pipeline. We, however, observed no improvement in the performance.
7 Comparison with state of the art
In this Section, we report the results of our proposed approach w.r.t. state of the art results in both the validation and test partitions of FERA 2015 and FERA 2017, as well as after the 3-fold cross-validation experiment on DISFA. Similarly to Sec. 6, we use as a baseline a ResNet-18 [he2016] which is trained to directly regress AU intensity levels. In addition to that, we report the results of the network trained from scratch, herein referred simply as HR. Finally, for the sake of clarity, we report the results of the best performing method from the proposed transfer learning approaches, that of the Adaptation Layers (Sec. 4.3), herein simply referred to as Ours.
7.1 FERA 2015 dataset
The results for FERA 2015 - Development are shown in Table IV, whereas those regarding the test partition are shown in Table VII. It is important to recall that, given that FERA 2015 is not publicly available, current works report on the development set, hence the lack of up-to-date results on the test partition. Despite the recent advances and the improved results on the development set, our method outperforms state of the art results by a considerable margin. We observe that the transfer learning approach results crucial to attain a new state of the art result in both partitions, proving the effectiveness of such approach when working with small datasets.
7.2 DISFA dataset
We report the results of the 3-fold cross-validation experiment on DISFA in Table IV. We can observe that both Heatmap Regression and transfer learning outperform existing methods in such a challenging dataset. We attribute this gain to the fact that localising the AUs is more effective than resorting to complex Autoencoder networks such as the one proposed in the 2DC [tran2017]. With Heatmap Regression, the network returns a structured representation that already captures the AU dependencies in a geometric way, and thus no additional dependencies need to be learned.
7.3 FERA2017 dataset
Results on FERA 2017 validation and test set are given in Table VI. Note that for FERA 2017 we choose to report the Root Mean Squared Error as this was the measure of choice for the challenge. Our approach achieves an ICC score of on the validation set which is by better than the ICC score of FERA 2017 challenge winners, that attained an ICC score of . Similarly, in terms of RMSE score our method method outperforms challenge winners by a margin. The same pattern is also found on the test set, where our method reports an ICC score of , which surpasses ICC reported by the challenge winners. We can also observe that both HR-Scratch and our transfer learning approach deliver similar results, which we attribute to the fact that FERA 2017 is already a large-scale dataset.
Qualitative evaluation: In addition to the reported results, we show the capabilities of our method to actually infer both the location and the intensity of Action Units in Fig. 7. We observe that for FERA 2017, that spans a large set of poses, our method is capable of estimating the location and intensity accurately, thus proving the efficacy of Heatmap Regression for the task of AU intensity estimation.
We have proposed a simple yet efficient approach for the problem of facial Action Unit intensity estimation: that of joint localisation and intensity estimation through heatmap regression. To accommodate the varying AU levels in the framework of heatmap regression, we modify the ground-truth heatmaps by changing their size and amplitude according to the corresponding AU intensity. Then, motivated by the similarities of our approach with these of the face alignment task, along with the fact that the task of face alignment is equipped with rich annotations, we reform the task of AU heatmap regression with an incremental learning approach. To do so, we incorporate to our setting a pre-trained facial landmark network that provides us with rich face related features across a variety of poses and illuminations. We conducted extensive experiments illustrating how the proposed approach systematically improves Intra Class Correlation (ICC) and thus achieve state of the art results on three benchmark datasets: FERA2015, DISFA and FERA2017.
The work of Ioanna Ntinou was supported by the Horizon Centre for Doctoral Training, School of Computer Science, University of Nottingham. The work of Michel Valstar was co-funded by the NIHR Nottingham Biomedical Research Centre.