Action Units Recognition by Pairwise Deep Architecture

10/01/2020 ∙ by Junya Saito, et al. ∙ 0

In this paper, we propose a new automatic Action Units (AUs) recognition method used in a competition, Affective Behavior Analysis in-the-wild (ABAW). Our method tackles a problem of AUs label inconsistency among subjects by using pairwise deep architecture. While the baseline score is 0.31, our method achieved 0.67 in validation dataset of the competition.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Automatic Action Units (AUs) recognition is useful and important in facial expression analysis [zhi2020comprehensive, martinez2017automatic]. AUs are defined as atomic facial muscle actions and for example AU04 indicates brow lowerer and AU06 indicates cheek raiser. AUs are scored by occurrence or intensity. AUs occurrence is described by binary scale and AUs intensity is described by neutral or five-point ordinal scale, A-B-C-D-E, where E refers to maximum evidence. AUs intensity represents facial muscle contraction level. AUs occurrence or intensity (AUs label) is determined by human experts, called as coders, based on facial appearance of target subjects.

At this time, a competition including automatic AUs recognition task, Affective Behavior Analysis in-the-wild (ABAW), was held in FG2020 [kollias2020analysing, kollias2019expression, kollias2018aff, kollias2018multi, kollias2019deep, zafeiriou2017aff, kollias2017recognition]

. In the competition, training and validation datasets that include multiple videos and AUs occurrence annotation for each frame image of the videos are provided. Participants are required to submit AUs occurrence recognition results for each frame image of test dataset videos and are compared based on an evaluation metric composed of F1 and accuracy. In this paper, we explain a new automatic AUs recognition method used in the competition.

There are several problems in automatic AUs recognition and many methods to solve the problems are proposed [zhi2020comprehensive]. In this paper we assume that there is a problem about AUs label criteria, and we propose a method to tackle the problem. The problem is that AUs label criteria for facial appearance are inconsistent in different videos and this makes it difficult to recognize AUs label. A simple previous method to predict AUs label by using only single image [niinuma2019unmasking] implicitly supposes that there is a correspondence relationship between facial appearance and AUs label, and the criteria are consistent, thus the problem might degrade a performance of the method.

We assume that a mechanism of coders’ determination causes the problem. AUs are defined as muscle actions, however coder cannot observe the muscle directly and can observe only facial appearance to determine AUs label. Thus, we assume that coder first observes the whole video of a target subject and understands temporal facial appearance change in the video. The facial appearance indicates subject-independent feature such as inner brow position or frown line depth in the case of AU04. Then coder infers a mapping from facial appearance change level to AUs intensity level based on the temporal change. Finally, coder determines AUs label for each frame image of the video according to the mapping. As mentioned above, the mapping changes by the target subject video. This means that AUs label criteria for facial appearance changes in different videos.

Our new automatic AUs recognition method tackles the problem. The method trains a model to output pseudo-intensity, that represents subject-independent facial appearance change level, by using pairwise deep architecture like siamese network [doughty2018s], and trains a mapping model to convert pseudo-intensity to AUs label based on the temporal facial appearance change.

The followings are contributions in this paper.

  • We present a problem of AUs label criteria change.

  • We propose a new automatic AUs recognition method to tackle the problem using pairwise deep architecture.

Fig. 1: Pairwise deep architecture for training pseudo-intensity model

Ii Related Works

In this section we introduce related previous methods and explain relation to our method.

Ii-a Methods using temporal features

Tadas et al. proposed a method to normalize feature vector by median of temporally varying feature vector in target video 

[baltruvsaitis2015cross] and a method to normalize recognition result value by n-th percentile of temporally varying recognition result value [baltruvsaitis2016openface]. These methods can capture neutral face by using median or n-th percentile but cannot capture various features of temporal facial appearance change such as range or distribution.

Jun et al. and Wen-Sheng et al. proposed a method to be able to capture temporal features by using RNN or LSTM [he2017multi, chu2017learning]. If we set sequence length long enough and employ temporally bidirectional network, specifically bidirectional-LSTM, and train with a sufficiently large amount of AUs dataset, then the RNN approach conceptually may be able to solve the problem of AUs label criterion change. Each method however does not satisfy both condition of sequence length and employment of bidirectional network, and it is difficult to prepare large amount of AUs dataset because coders’ annotation takes a very long time [martinez2017automatic]. Thus, it is practically difficult to solve the problem by the RNN approach.

Ii-B Methods using pairwise architecture

Tadas et al. and Paul Pu et al. proposed a method to calculate pseudo-intensity based on global ranking in a target video and to convert pseudo-intensity to label by using normalization function or RNN [baltruvsaitis2017local, liang2018multimodal]. The global ranking is calculated by merging local rankings in the target video. The local ranking indicates an intensity ranking of image pair. The normalization function is similar to our mapping model but does not capture temporal facial appearance feature change and there are some problems in RNN approach as mentioned above. Thus, it is difficult to solve the problem by these methods. Additionally, their pseudo-intensity is based on only comparative relationship in target video, thus the method does not perform well for video of which intensity is high stationarily. Besides, our pseudo-intensity is based on subject-independent feature, thus our method is expected to perform well for the case.

Iii Methodology

Fig. 2: Architecture for training mapping model

In this section we explain our new method for automatic AUs recognition to tackle the problem of AUs label criteria change.

The method consists of two steps in training phase. In training phase, first, the method trains a model to output pseudo-intensity, that represents subject-independent facial appearance change level. Training dataset for the model consists of image pairs in same video with labels created from videos of various subjects. The label is AUs intensity ranking of the image pair. The model is trained to make pseudo-intensity ranking and intensity ranking equal by using the training dataset and pairwise deep architecture on the basis that AUs label criteria for facial appearance is consistent in same video.

Second, the method trains a mapping model to convert pseudo-intensity with temporal feature of pseudo-intensity to AUs label. In this paper, temporal feature is composed of range and distribution feature of pseudo-intensities in the video. In predicting phase, the method receives target image and frame images of video including the target image and calculates pseudo-intensities and converts to AU label by using the mapping model.

In the rest of this section, we explain the detail of training phase.

Step1. Training Pseudo-intensity Model

Given a set of input images with their corresponding labels , where is subject id and is video frame id of the subject, we construct training dataset of size for pseudo-intensity model. we define as:

(1)

The training dataset is made by sampling from a set of input images and labels.

We next construct pairwise deep architecture for training pseudo-intensity model as Fig.1

. Let the model be Convolutional Neural Network (CNN). The model gets image

and outputs pseudo-intensity and another model of which weight is shared gets image and outputs pseudo-intensity

. A loss function consists of

, and as:

(2)

where . The loss function is similar in [doughty2018s]. The pseudo-intensity model is trained by the training dataset and the pairwise deep architecture.

Step2. Training Mapping Model

We generate pseudo-intensities by using trained pseudo-intensity model and train a mapping model by the pseudo-intensities and AUs label. We present an architecture for training mapping model at Fig.2. Let training dataset for mapping model be of size , which is number of video frames of subject and is feature extractor of pseudo-intensities of video of subject . The feature extractor generates feature that represents range and distribution of the pseudo-intensities. Specifically, the feature is percentile feature (0-th percentile, 10-th percentile, …) and frequency feature (frequency in range , frequency in range , …). The training dataset is made by sampling from a set of input images and labels.

We compose the mapping model by Fully Connected Network (FCN). Let a loss function be cross entropy loss.

Iv Experiment

In this section we explain a experiment result using the competition dataset.

Iv-a Datasets

We used a dataset provided in the competition, called as Aff-Wild2, and several additional datasets. The additional datasets are BP4D [zhang2014bp4d, zhang2013high] with AUs intensity. From BP4D, nine different face orientations were created in FERA2017 [valstar2017fera] and we used it. Moreover, we created additional different face orientations, that is 60 and 80 degrees yaw and we created mirrored images of these and we used it. As training dataset for pseudo-intensity model, we used BP4D with AUs intensity and Aff-Wild2 with AUs occurrence. As training dataset for mapping model, we used Aff-Wild2 with AUs occurrence. As validation dataset, we used Aff-Wild2 with AUs occurrence.

Iv-B Settings

The CNN of pseudo-intensity model is configured with VGG16 network pre-trained on ImageNet 

[simonyan2015very]

and the FCN of mapping model is configured as classifier layer of VGG16. As pre-processing, we applied procrustes analysis for images according to 

[niinuma2019unmasking]. We selected best result based on validation dataset score in 5 training trials of same conditions because a randomness at initialization or training process may change performance.

Iv-C Evaluation Metric

In the competition, an evaluation metric is defined. The metric is:

(3)

where F1 score is the unweighted mean and Accuracy is the total accuracy.

Iv-D Result

Table I presents results of baseline and our method in validation dataset. The baseline is in  [kollias2020analysing]. Test dataset is not released, thus we evaluated by validation dataset. The result indicates that our method outperforms the baseline. It however does not represent that the problem certainly exists or our method performs as expected. The analysis about this is our future work.

Average F1 Total Accuracy Competition Metric
Baseline [kollias2020analysing] 0.22 0.4 0.31
Ours 0.34 0.95 0.65
TABLE I: Results on validation dataset

V Conclusion

We proposed a new automatic AUs recognition method used in a competition, ABAW. Our method uses pairwise deep architecture to tackle a problem of AUs label criteria change in different videos. Moreover, we compared our method and a baseline in the competition evaluation metric, and the result presented our method outperforms the baseline. As future work, we will analysis that the problem certainly exists and our method performs as expected.

References