Spotting Macro- and Micro-expression Intervals in Long Video Sequences

12/18/2019 ∙ by Ying He, et al. ∙ 0

This paper presents baseline results for the Third Facial Micro-Expression Grand Challenge (MEGC 2020). Both macro- and micro-expression intervals in CAS(ME)^2 and SAMM Long Videos are spotted by employing the method of Main Directional Maximal Difference Analysis (MDMD). The MDMD method uses the magnitude maximal difference in the main direction of optical flow features to spot facial movements. The single frame prediction results of the original MDMD method are post processed into reasonable video intervals. The metric F1-scores of baseline results are evaluated: for CAS(ME)^2, the F1-scores are 0.1196 and 0.0082 for macro- and micro-expressions respectively, and the overall F1-score is 0.0376; for SAMM Long Videos, the F1-scores are 0.0629 and 0.0364 for macro- and micro-expressions respectively, and the overall F1-score is 0.0445. The baseline project codes is publicly available at https://github.com/HeyingGithub/Baseline-project-for-MEGC2020_spotting.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Facial expressions are important non-verbal cues that convey emotions. Macro-expressions are the common facial expressions in our daily life, which are the types we usually know. There is a special type of expressions called “micro-expressions” that were first found by Haggard and Isaacs [5]. Micro-expressions (MEs) are involuntary facial movements occurring spontaneously when a person attempts to conceal the experiencing emotion in a high-stakes environment. The duration of MEs is very short. The general duration is less than 500 milliseconds (ms) [21, 10]. The close connection between MEs and deception makes the relevant research have great significance on many applications such as medical care [3] and law enforcement [4].

Spotting expressions is to find the moment when expressions occurs in the whole video sequences. In the Second Micro-Expression Spotting Challenge (MEGC 2019) 

[14], methods for spotting ME intervals in long videos were explored [7]. In the past decades, several explorations for spotting MEs have been done [12, 20, 18, 17, 16, 24, 9, 19, 8, 11]. However, MEs are often accompanied by macro-expressions, and both of the two types of expressions are valuable for affect analysis. Therefore, developing methods to spot both macro- and micro-expressions is the main theme of MEGC 2020.

In this paper, we provide the baseline method and results for the Third Facial Micro-Expression Grand Challenge (MEGC 2020), spotting macro- and micro-expression intervals in long video sequences from the dataset CAS(ME) and SAMM Long Videos. The main method is the Main Directional Maximal Difference Analysis (MDMD) [19]. The original MDMD only predicts whether a frame belongs to facial movements. To obtain target intervals, the adjacent frames consistently predicted to be macro- or micro-expressions form an interval, and the intervals that are too long or too short are removed. Parameters are adjusted to specific expression types for specific datasets. The performance metric, F1-scores, is used for the evaluation on the two long video datasets.

The rest of paper is organized as follows: Section II presents the methodology and performance metrics. Section III introduces the detailed experiment results. Section IV concludes the paper.

Ii Methodology

This section describes the benchmark datasets, the baseline method, and the performance metrics.

Ii-a Datasets

CAS(ME) [13]: In the part A of CAS(ME)

database, there are 22 subjects and 98 long videos. The facial movements are classified as macro- and micro-expressions. The video samples may contain multiple macro or micro facial expressions. The onset, apex, offset index for these expressions are given in the excel file. In addition, the eye blinks are labeled with onset and offset time.

SAMM Long Videos [22] : The original SAMM dataset [2] contains 159 micro-expressions, which was used for the past two micro-expressions recognition challenge [23, 14]. Recently, the authors [22] released the SAMM Long Videos dataset, which consists of 147 long videos. There are 343 macro-movements and 159 micro-movements in the long videos. The index of onset, apex and offset frames of micro- and macro-movements are outlined in the ground truth excel file.

More detailed and comparative information of these two datasets is presented in Table I.

Dataset CAS(ME) SAMM Long Videos
Participants 22 32
Video samples 98 147
Macro-expressions 300 343
Micro-expressions 57 159
Resolution 640480 20401088
FPS 30 200
TABLE I: A Comparison between CAS(ME) and SAMM Long Videos.

Ii-B Baseline method

1) Preprocess

Expression spotting focuses on facial regions. So we preprocess every video sample by cropping and resizing facial regions in all frames. For each video, we locate the rectangular box that exactly bounds facial regions in the first frame, and then all the frames of the video are cropped and resized according to the box located in the first frame. We locate the bounding box according to facial landmarks detected by the corresponding function in the ”Dlib” toolkit [6], as we found that applying a face detecting algorithm directly cannot behavior very well. The preprocess details are as follows.

Firstly, we use the landmark detecting function in the ”Dlib” toolkit to obtain 68 facial landmarks on the face in the first frame of the video, as illustrated in the Fig. 1 – the first frame of s23_0102 in CAS(ME). The landmarks are marked as in the sequence of the list returned by the landmark detecting function in ”Dlib”, and the corresponding coordinates are marked as . The coordinate system is consistent with the one in the OpenCV toolkit [1], i.e. x-axis means the horizontal direction from left to right, and y-axis means the vertical direction from top to bottom. The green dots in Fig. 1 are the landmarks, and some of the serial numbers are marked by the red text.

Secondly, in order to form a rectangular box that bounds the facial region exactly, the leftmost, rightmost, topmost and bottommost landmarks are marked as with coordinates , respectively. Rather than forming the box directly according to , we form two points: to obtain the box B with as the upper left corner and as the lower right corner. The coordinate means that the upper edge of the box is moved up a relative distance to maintain more regions around eyebrows. In Fig. 1, the box B is illustrated by the blue rectangular.

Thirdly, as shown by Fig. 1, which is the region in B, we found there are too many regions in the bottom for several subjects in the two datasets because of the inaccuracy of landmark detecting, and so, we detect landmarks again on the region of the first frame in B for cropping faces more precisely. It is shown in the Fig. 1. Then, we get a new bottommost landmark . is updated to , where is the smaller one of and . Then a new rectangular box is formed with as the upper left corner and as the lower right corner. In Fig. 1, the box is illustrated by the blue rectangular. And the region of the first frame in is illustrated in Fig. 1, in which we can find that the facial region is located better.

Finally, after obtaining the box , we crop all the frames of the video in the rectangular box , and thus get the facial regions. The cropped regions are then resized to the size of .

Fig. 1: Diagram of how we obtain facial regions in the preprocessing step: (a) detect facial landmarks and form the rectangular box B; (b) the region in B; (c) detect facial landmarks in the region in B and form the rectangular box ; (d) the region in .

2) Mdmd

The method of Main Directional Maximal Difference Analysis (MDMD) is proposed in the literature [19]. The main idea is that: when an expression happens, the face will experience a process of producing an expression and returning to a neutral face. The main movement directions will be opposite in the process. By analyzing it, expressions can be spotted. Here we review the MDMD method.

Given a video with frames, the current frame is denoted as . is the -th frame before the , and is the -th frame after the . The robust local optical flow (RLOF) [15] between the frame (Head Frame) and the frame (Current Frame) is computed. We denote the optical flow by . For convenience, means the displacement of any point. Similarly, the optical flow between the frame (Head Frame) and the frame (Tail Frame) is denoted by . Then, and are converted from Euclidean coordinates to polar coordinates and , where and represent, respectively, the magnitude and direction.

Based on the directions

, all the optical flow vectors

are divided into directions. Fig. 2 illustrates the condition when . The Main Direction is the direction that has the largest number of optical flow vectors among the directions. The main directional optical vector is the optical flow vector that falls in the Main Direction .

Fig. 2: Four directions in the polar coordinates.
(1)

The optical flow vector corresponding to between frame and is denoted as .

(2)

After the differences is sorted into a descending order, the maximal difference is defined as the mean difference value of the first 1/3 of the differences to characterize the frame as in the formula:

(3)

where is the number of elements in the subset , and denotes a set comprised of the first maximal elements in the subset .

Since our method is a block-based analysis, the cropped facial region of each frame is divided into blocks, as shown in Fig. 3. And we calculate the maximal difference for each block in the frame. For frame , there are maximal differences due to the block structure. Then, we arrange the maximal differences in a descending order where is the first maximal difference and characterizes the frame feature:

(4)
Fig. 3: Examples of facial block structure.

If a person maintained a neutral expression at , her/his emotional expression, such as disgust, starts at the onset frame between and , is repressed at the offset frame between and , and then the facial expression recovers a neutral expression at , which is presented in Fig. 4. In this circumstance, the movement between and is more intense than the movement between and because the expression is neutral at both and . Therefore, the value will be large. Another situation is that a person maintains a neutral expression from to . The movement between and is similar to the movement between and ; thus, the value will be small. In a long video, sometimes an emotional expression starts at the onset frame before and is repressed at the offset frame after , which is presented in Fig. 4. In this case, the value will also be small if is set to be a small value. However, cannot be set as a large value because this would influence the accuracy of the computing optical flow.

Fig. 4: Two situations: (a) An emotional expression starting at the onset frame between and is repressed at the offset frame between and and recovers a neutral expression at ; (b) An emotional expression starting at the onset frame before is repressed at the offset frame after .

We employed a relative difference vector for eliminating the background noise, which was computed by:

(5)

Therefore, the frame is characterized by . A threshold is used to obtain the frames that have peaks representing the facial movements in a video:

(6)

where

is a variable parameter in the range . The frames with larger than the are the frames where expressions appear.

3) Parameter settings and post process

In the literature [19], several parameter combinations are explored to spot micro-expressions on the CAS(ME) dataset. For spotting both macro- and micro-expressions on the two datasets for MEGC 2020, i.e. CAS(ME) and SAMM Long Videos, we select the best combination of blocks and directions explored in [19]. and we set other parameters according the FPSs of the two datasets. Moreover, since the original MDMD only predicts whether a frame belongs to facial movements, a post process is added in order to output target intervals required by MEGC 2020. The details are as follows.

The number of blocks is set to and the number of directions is set to 4. In CAS(ME) dataset, the is set to 12 for micro-expressions, and 39 for macro-expressions; in SAMM Long Videos dataset, the is set to 80 for micro-expressions, and 260 for macro-expressions. Concerning the threshold, varies from 0.01 to 0.99 with a step-size of 0.01. And the final results are reported under the setting of . The original MDMD only predicts whether a frame belongs to facial movements. To output target intervals, the adjacent frames consistently predicted to be macro- or micro-expressions form an interval, and the intervals that are too long or too short are removed. The number of micro-expression frames is limited between 7 and 16 for the CAS(ME) dataset, and between 47 and 105 for the SAMM Long Videos dataset. The number of macro-expression frames is defined as larger than 16 for the CAS(ME) dataset, and larger than 105 for the SAMM Long Videos dataset.

Ii-C Performance metrics

In order to avoid the inaccuracy caused by annotation, we propose to evaluate the spotting result per interval in MEGC 2020.

1. True positive in one video definition

The true positive (TP) per interval in one video is first defined based on the intersection between the spotted interval and the ground-truth interval. The spotted interval is considered as TP if it fits the following condition:

(7)

where is set to 0.5, represents the ground truth of the macro- or micro-expression interval (onset-offset). If the condition is not fulfilled, the spotted interval is regarded as false positive (FP).

2. Result evaluation in one video

Supposing there are ground truth interval in the video, and intervals are spotted. According to the overlap evaluation, the TP amount in one video is counted as ( and ), therefore FP = , FN = . The spotting performance in one video can be evaluated by following metrics:

(8)
(9)

Yet, the videos in real life have some complicated situations which influences the evaluation per single video:

  • There might be no macro- nor micro-expression in the test video. In this case, , the denominator of recall would be zeros.

  • If there is no spotted intervals in the video, the denominator of precision would be zeros since .

  • It is impossible to compare two spotting methods when both TP amounts are zero. The metric (recall, precision or F1-score) values both equal to zeros. However, the Method outperforms Method, if Method spots less intervals than Method.

Thus, to avoid these situations, we propose for single video spotting result evaluation, we just note the amount of TP, FP and FN. Other metrics are not considered for one video.

3. Evaluation for entire database

Supposing in the entire dataset,

  • There are videos including macro-expressions (MaEs) sequences and micro-expression (MEs) sequences, where and ;

  • The method spot MaE intervals and ME intervals in total, where and ;

  • There are TPs for MaE and TPs for ME in total, where and .

The dataset could be considered as one long video. The results are firstly evaluated for the MaE spotting and ME spotting separately. Then the overall result for macro- and micro spotting is evaluated. The recall and precision for entire dataset can be calculated by following formulas:

  • for macro-expression:

    (10)
  • for micro-expression:

    (11)
  • for overall evaluation:

    (12)

Then, the values of F1-score for all these three evaluations are obtained based on:

(13)

The champion of the challenge will be the best score for overall results in spotting micro- and macro-expressions.

Iii Results and Discussion

For the parameter , we have studied the evaluation results by varying from 0.01 to 0.99 with step-size of 0.01, and the 20 results from 0.01 to 0.20 are shown in Table II. In Table II, we list the information of TPs and F1-scores for macro- and micro-expression spotting respectively. We observe that, for both types of expressions in the two datasets, the number of TP is decreasing with the increase of . Regarding the F1-score, it also shows a decreasing trend in SAMM Long Videos. Yet, in CAS(ME), the F1-score increases at first and then begins to decrease. The initial increase of the F1-score in CAS(ME) is mainly because the number of the totally predicted intervals () become smaller with the increase of , making the precision () increase.

Dataset CAS(ME) SAMM Long Videos
Expression macro-expression micro-expression macro-expression micro-expression
(%) TP F1-score TP F1-score TP F1-score TP F1-score
1 109 0.1196 21 0.0082 22 0.0629 29 0.0364
2 107 0.1408 18 0.0093 20 0.0627 25 0.0356
3 96 0.1455 15 0.0100 18 0.0627 19 0.0309
4 92 0.1573 14 0.0115 16 0.0588 17 0.0306
5 91 0.1738 12 0.0121 16 0.0626 14 0.0282
6 88 0.1857 10 0.0120 14 0.0574 11 0.0245
7 81 0.1879 10 0.0142 12 0.0510 11 0.0266
8 74 0.1876 8 0.0131 10 0.0443 9 0.0239
9 73 0.1984 8 0.0155 9 0.0407 7 0.0201
10 68 0.1954 8 0.0176 8 0.0371 7 0.0214
11 61 0.1863 6 0.0150 8 0.0378 7 0.0228
12 61 0.2013 6 0.0173 8 0.0382 7 0.0245
13 57 0.1949 6 0.0190 7 0.0337 6 0.0219
14 56 0.2007 6 0.0214 7 0.0340 6 0.0227
15 50 0.1859 5 0.0197 6 0.0299 5 0.0200
16 50 0.1927 5 0.0214 6 0.0301 5 0.0210
17 48 0.1886 5 0.0236 6 0.0304 5 0.0222
18 46 0.1855 5 0.0253 6 0.0305 4 0.0183
19 43 0.1795 5 0.0275 6 0.0310 3 0.0146
20 42 0.1783 3 0.0179 6 0.0313 3 0.0152
TABLE II: Baseline results in CAS(ME) and SAMM Long Videos with varying from 0.01 to 0.20 with an step-size of 0.01.

Since the amount of TP is an important metric for the spotting result evaluation, we select the results under the condition of as the final baseline results. The details of the final baseline results for spotting macro- and micro-expressions are shown in Table III. For CAS(ME), the F1-scores are 0.1196 and 0.0082 for macro- and micro-expressions respectively, and 0.0376 for overall result. For SAMM Long Videos, the F1-scores are 0.0629 and 0.0364 for macro- and micro-expressions respectively, and 0.0445 for overall result. More details about the number of true labels, TP, FP, FN, precision, recall and F1-score for various situations are shown in the Table III.

Dataset CAS(ME) SAMM Long Videos
Expression macro-expression micro-expression overall macro-expression micro-expression overall
Total number 300 57 357 343 159 502
TP 109 21 130 22 29 51
FP 1414 5014 6428 334 1407 1741
FN 191 36 335 314 130 451
Precision 0.0716 0.0042 0.0198 0.0618 0.0202 0.0285
Recall 0.3633 0.3684 0.3641 0.0641 0.1824 0.1016
F1-score 0.1196 0.0082 0.0376 0.0629 0.0364 0.0445
TABLE III: Baseline results for macro- and micro-spotting () in CAS(ME) and SAMM Long Videos.

Iv Conclusions

This paper addresses the challenge in spotting macro- and micro-expressions in long video sequence, and provides the baseline method and results for the Third Facial Micro-Expression Spotting Challenge (MEGC 2020). The Main Directional Maximal Difference Analysis (MDMD) [19] is employed as the baseline method, and the parameter settings are adjusted to CAS(ME) and SAMM Long Videos for the spotting challenge in MEGC 2020. Slight modification are done to predict more reasonable intervals on the post-processing of results. Experiments were done and the predicted results were evaluated using the metrics in MEGC 2020. The results have shown that the MDMD method can produce reasonable performance, but there are still a huge challenge to reduce the amount of FPs.

References

  • [1] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.
  • [2] A. K. Davison, C. Lansley, N. Costen, K. Tan, and M. H. Yap. SAMM: A spontaneous micro-facial movement dataset. IEEE Transactions on Affective Computing, 9(1):116–129, 2018.
  • [3] J. Endres and A. Laidlaw. Micro-expression recognition training in medical students: a pilot study. BMC Medical Education, 9(1):47, 2009.
  • [4] M. Frank, D. Kim, S. Kang, A. Kurylo, and D. Matsumoto. Improving the ability to detect micro expressions in law enforcement officers. 2014.
  • [5] E. A. Haggard and K. S. Isaacs. Micromomentary facial expressions as indicators of ego mechanisms in psychotherapy. In Methods of Research in Psychotherapy, pages 154–165. 1966.
  • [6] D. E. King.

    Dlib-ml: A machine learning toolkit.

    Journal of Machine Learning Research, 10(3):1755–1758, 2009.
  • [7] J. Li, C. Soladie, R. Sguier, S. Wang, and M. H. Yap. Spotting micro-expressions on long videos sequences. In IEEE International Conference on Automatic Face and Gesture Recognition, pages 1–5, 2019.
  • [8] X. Li, X. Hong, A. Moilanen, X. Huang, T. Pfister, G. Zhao, and M. Pietikäinen. Towards reading hidden emotions: A comparative study of spontaneous micro-expression spotting and recognition methods. IEEE Transactions on Affective Computing, 9(4):563–577, 2018.
  • [9] S.-T. Liong, J. See, K. Wong, A. C. Le Ngo, Y.-H. Oh, and R. Phan. Automatic apex frame spotting in micro-expression database. In

    IAPR Asian Conference on Pattern Recognition

    , pages 665–669, 2015.
  • [10] D. Matsumoto and H. S. Hwang. Evidence for training the ability to read microexpressions of emotion. Motivation and Emotion, 35(2):181–191, 2011.
  • [11] A. Moilanen, G. Zhao, and M. Pietikäinen. Spotting rapid facial movements from videos using appearance-based feature difference analysis. In International Conference on Pattern Recognition, pages 1722–1727, 2014.
  • [12] S. Polikovsky, Y. Kameda, and Y. Ohta. Facial micro-expression detection in hi-speed video based on facial action coding system (FACS). IEICE Transactions on Information and Systems, 96(1):81–92, 2013.
  • [13] F. Qu, S. J. Wang, W. J. Yan, H. Li, S. Wu, and X. Fu. CAS(ME): A database for spontaneous macro-expression and micro-expression spotting and recognition. IEEE Transactions on Affective Computing, 9(4):424–436, 2017.
  • [14] J. See, M. H. Yap, J. Li, X. Hong, and S.-J. Wang. MEGC 2019 - the second facial micro-expressions grand challenge. In IEEE International Conference on Automatic Face and Gesture Recognition, pages 1–5, 2019.
  • [15] T. Senst, V. Eiselein, and T. Sikora. Robust local optical flow for feature tracking. 22(9):1377–1387, 2012.
  • [16] M. Shreve, J. Brizzi, S. Fefilatyev, T. Luguev, D. Goldgof, and S. Sarkar. Automatic expression spotting in videos. Image Vision Computing, 32(8):476–486, 2014.
  • [17] M. Shreve, S. Godavarthy, D. Goldgof, and S. Sarkar. Macro- and micro-expression spotting in long videos using spatio-temporal strain. In IEEE International Conference on Automatic Face and Gesture Recognition, pages 51–56, 2011.
  • [18] M. Shreve, S. Godavarthy, V. Manohar, D. Goldgof, and S. Sarkar. Towards macro-and micro-expression spotting in video using strain patterns. In

    Workshop on Applications of Computer Vision

    , pages 1–6, 2009.
  • [19] S.-J. Wang, S. Wu, X. Qian, J. Li, and X. Fu. A main directional maximal difference analysis for spotting facial movements from long-term videos. Neurocomputing, 230:382–389, 2017.
  • [20] Q. Wu, X. Shen, and X. Fu. The machine knows what you are hiding: an automatic micro-expression recognition system. In International Conference on Affective Computing and Intelligent Interaction, pages 152–162, 2011.
  • [21] W.-J. Yan, Q. Wu, J. Liang, Y.-H. Chen, and X. Fu. How fast are the leaked facial expressions: The duration of micro-expressions. Journal of Nonverbal Behavior, 37(4):217–230, 2013.
  • [22] C. H. Yap, C. Kendrick, and M. H. Yap. Samm long videos: A spontaneous facial micro-and macro-expressions dataset. arXiv preprint arXiv:1911.01519, 2019.
  • [23] M. H. Yap, J. See, X. Hong, and S.-J. Wang. Facial micro-expressions grand challenge 2018 summary. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 675–678. IEEE, 2018.
  • [24] Z. Zhang, T. Chen, H. Meng, G. Liu, and X. Fu.

    SMEConvNet: A convolutional neural network for spotting spontaneous facial micro-expression from long videos.

    IEEE Access, 6:71143–71151, 2018.