Spotting Micro-Expressions on Long Videos Sequences

by   Jingting Li, et al.

This paper presents baseline results for the first Micro-Expression Spotting Challenge 2019 by evaluating local temporal pattern (LTP) on SAMM and CAS(ME)2. The proposed LTP patterns are extracted by applying PCA in a temporal window on several facial local regions. The micro-expression sequences are then spotted by a local classification of LTP and a global fusion. The performance is evaluated by Leave-One-Subject-Out cross validation. Furthermore, we define the criteria of determining true positives in one video by overlap rate and set the metric F1-score for spotting performance of the whole database. The F1-score of baseline results for SAMM and CAS(ME)2 are 0.0316 and 0.0179, respectively.



There are no comments yet.


page 2


Spotting Macro- and Micro-expression Intervals in Long Video Sequences

This paper presents baseline results for the Third Facial Micro-Expressi...

A Dynamic 3D Spontaneous Micro-expression Database: Establishment and Evaluation

Micro-expressions are spontaneous, unconscious facial movements that sho...

Micro-expression Action Unit Detection withSpatio-temporal Adaptive Pooling

Action Unit (AU) detection plays an important role for facial expression...

Micro-expression detection in long videos using optical flow and recurrent neural networks

Facial micro-expressions are subtle and involuntary expressions that can...

Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition

Being spontaneous, micro-expressions are useful in the inference of a pe...

Using Spatial Pooler of Hierarchical Temporal Memory to classify noisy videos with predefined complexity

This paper examines the performance of a Spatial Pooler (SP) of a Hierar...

Deception Detection in Videos

We present a system for covert automated deception detection in real-lif...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Facial micro-expression (ME) is a local brief facial movement, which can be triggered under high emotional pressure. The duration is less than 500ms [1]. It is a very important non-verbal communication clue, the involuntary nature make it possible to analyze personal genuine emotional state. ME analysis has many potential applications in national security [2], medical care [3], educational psychology [4], and political psychology [5]. Due to the growth and importance of MEs, researchers [6] have worked collaboratively to solicit the works in this area by conducting challenges in datasets and methods for MEs. This year, the theme has extended to spotting challenges [7].

The main idea of most methods for ME spotting is to compare the feature differences between the first frame and the other frames in a time window. Meanwhile, the utilized features are diverse, including LBP [8, 9], HOG [10], optical flow [11, 12, 13, 14, 15, 16, 17], integral projection [18], Riesz pyramid [19], frequency domain [20] and etc.

These approaches allow making comparisons between frames over a time window of the size of an ME. However, the movements are spotted between frames, but they are not specifically the ME movement. This is why the ability to distinguish MEs from other movements (such as blinking or subtle head movements) remains weak.

In this paper, we spots the micro-expression clips in two recently published databases, and establish the baseline method for ME spotting challenge by using directly a temporal pattern extracted from local region [21]

. Frames in a ME duration are taken into account to obtain a real temporal and local pattern (LTP), and then the LTPs are recognized by a classifier. Even though the spatial pattern is not studied, the spotted facial motions are differentiated by a fusion process from local to global. This method helps to improve the ability to distinguish ME from other movements. Furthermore, it allows finding the ME spatial local region and the temporal onset index of ME.

The rest of paper is organized as follows: Section II presents the baseline methodology. Section III introduces the result evaluation standard and the detailed experiment results. Section IV concludes the paper.

Ii Methodology

Ii-a Databases

Two spontaneous micro-expression databases: SAMM [22] and CAS(ME) [23] are used for ME spotting challenge. Both databases contains long videos, which were recorded in strictly controlled laboratory environment. The detailed information of these two databases is presented in the following two subsections, and Table I outlines the differences of these two databases.

Database Participants Samples Resolution FPS
SAMM 32 79 20401088 200
CAS(ME) 22 87 640480 30
TABLE I: A Comparison between SAMM and CAS(ME).

Ii-A1 SAMM Long Videos Database

SAMM database consists of a total of 32 subjects and each has 7 videos [22]. The average length of videos is 35.3s. The original release of SAMM consists of micro-movement clips labelled in Action Units. Recently, the authors [10] introduced objective classes and emotion classes for the database. The spotting challenge focuses on 79 videos, each contains one/multiple micro-movements, with a total of 159 micro-movements. The index of onset, apex and offset frames of micro-movements were provided as the ground truth. The micro-movements interval is from onset frame to offset frame. In this database, all the micro-movements are labeled. Thus, the spotted frames can indicate not only micro-expression but also other facial movements, such as eye blinks.

Ii-A2 Cas(me) Database

In the part A of CAS(ME) database [23], there are 22 subjects and 87 long videos. The average duration is 148s. The facial movements are classified as macro- and micro-expressions. The video samples may contain multiple macro or micro facial expressions. The onset, apex, offset index for these expressions are provided by the authors in an excel file. In addition, the eye blinks are labeled with onset and offset time.

Ii-B Baseline method

The baseline method is developed based on the proposed LTP-ML (local temporal pattern-machine learning) method in  

[21]. The method is extended for long videos by employing a sliding temporal window. The main idea and the modification of LTP-ML method is presented in the following paragraphs.

Ii-B1 Pre-processing

As micro-expression is a local facial movement, we analyze ME only on selected regions of interest (ROIs). First of all, as shown in Figure 1, 84 facial landmarks are tracked in the video sequence by utilizing the Genfacetracker (©Dynamixyz) [24]. Then the size of ROI square is determined by the distance between the left and right inner corners of eyes: . 12 ROIs squares are chosen based on the regions where ME happens most frequently, i.e. the corner of eyebrows and of the mouth. Two ROIs of nose region are chosen as reference because the nose is the most rigid facial region.

Fig. 1: Facial landmarks tracking and ROI selection. On the left : an example from SAMM; on the right: an example from CAS(ME).

Since the average duration of ME is around 300ms, and the subjects barely moved in one second, the long video in these two databases are processed by a temporal sliding window whose length is 1s. The overlap is set to 300ms to avoid missing any possible ME movements. Thus, the video is separated into an ensemble of small sequences by sliding temporal window as shown in Figure 2. The positions of 12 chosen ROIs for all frames in one sequence are determined by the detected landmarks of the first frame in the window.

Fig. 2: PCA process analysis. The long video is divided into small sequences by a sliding window. Then the PCA process is performed respectively on time axis for 12 ROIs sequences in one small divided clip.

Ii-B2 Feature Extraction

In this part, local temporal patterns (LTPs) [21] are analyzed in local region to distinguish ME from other movements. They are extracted from 12 ROIs respectively in each small sequence. Supposing in sequence (), as illustrated on the lower part of Figure 2, PCA is performed on the temporal axis of each ROI sequence to conserve the principal variation at this region. The first two components of each ROI frame are used to analyze the variation pattern of local movement. The PCA process for ROI sequence () in in can be presented as in 1.


where represents the pixels in one ROI frame, are the first two components of PCA, is the frame index in this ROI sequence (). Hence, each frame in can be represented by a point . Then, a sliding window is set depending on the average duration of ME (300ms). The distances between the first frame and the other frames in this window are calculated. The window goes through each frame in the sequence , and the distance set can be got as , as shown in Figure 3.

Fig. 3: Distance calculation for one ROI sequence in video clip .

The values of distance are then normalized for the entire to avoid the influence of different movement magnitude in different videos. Hence, the feature of frame n for can be represented as: , , where is the normalized distance value and the is the normalization coefficient. More detailed deduction process can be found in [21]. The feature for one ROI sequence of the entire long video is the concatenation of features of all the separated sequences.

Ii-B3 Local Classification

As presented the above paragraph, one video contains 12 feature ensembles from 12 ROIs. Li et al showed in [21]

that the LTP patterns are similar for all chosen ROIs for all kinds of ME. The patterns which can represent the ME local movements can be recognized by a local classification. A supervised classification SVM is employed with Leave-One-Subject-Out (LOSO) cross validation. The feature selection and label annotation is presented in 


Ii-B4 Global Fusion

After the LTPs which fit the local ME movement pattern are recognized, a global fusion is processed to eliminated the false positives concerning other movements and true negatives caused by our recognition process. As introduced in [21], there are three steps: a local qualification, a spatial fusion and a merge process.

Iii Baseline Result

Iii-a Performance Metrics

There are three evaluation methods used to compare the performance of the spotting tasks:

1. True positive in one video definition Supposing there are micro-expressions in the video, and intervals are spotted. The result of this spotted interval is considered as true positive (TP) if it fits the following condition:


where is set to 0.5, represents the micro-expression interval (onset-offset). Otherwise, the spotted interval is regarded as false positive (FP).

2. Result evaluation in one video Supposing the number of TP in one video is ( and ), then FP = , false negative (FN) = , the Recall and F1-score are defined as:


In practical, these metrics might not be suitable for some videos, as there exist the following situations on a single video:

  • The test video does not have micro-expression sequences, thus, , the denominator of recall will be zeros.

  • The spotting method does not spot any intervals. The denominator of precision will be zero since .

  • If there are two spotting methods, Method spots p intervals and Method spots q intervals, and . Supposing for both methods, the number of TP is 0, thus the metric (recall, precision or F1-score) values both equal to zeros. However, in fact, the Method performs better than Method.

Considering these situations, we propose for each video, we record the result in terms of TP, FP and FN. For performance comparison, we produce a final calculation of other metrics for the entire database.

3. Evaluation for entire database Supposing in the entire database, there are videos and micro-expression sequences, and the method spot intervals in total. The database could be considered as one long video, thus, the metrics for entire database can be calculated by:


The final results by different methods would be evaluated by F1-score since it considers the both recall and precision.

Iii-B Results and discussion

As introduced in Section II, SAMM and CAS(ME) have different FPS and resolution. Hence, the lengths of sliding window and the ROIs size are different. Table II lists the experimental parameters for these two databases.

SAMM 200 60 15
CAS(ME) 30 9 10
TABLE II: Parameter configuration

After performing the LTP-ML method on these two databases, the spotting results for while database are listed in Table III. For CAS(ME) database, there are 97 videos, but only 32 videos contains micro-expression. Thus, different results are given under two conditions: one is only considering 32 videos which have ME (CAS(ME)), another one is treating all 97 videos as an entire dataset. Since the raw videos in SAMM database are too big to download (700GB), only 79 videos (full frame: 270GB and cropped face: 11GB) were provided for the challenge. In this work, we used the cropped videos provided by the authors using the method in [25]. The spotting process is performed on the downloaded cropped face version.

The F1-scores for SAMM and CAS(ME) are 0.0316 and 0.0179 respectively. The value is low because of the large amounts of FP. Both datasets contain lots of irrelevant facial movements, especially for CAS(ME), there are also macro-expression samples. The ability of distinguishing ME from other movements still need to be enhanced.

nb_vid 79 32 97
TP 34 16 16
FP 1958 1711 5742
FN 125 41 41
Precision 0.0171 0.0093 0.0028
Recall 0.2138 0.2807 0.2807
F1-score 0.0316 0.0179 0.0055
TABLE III: Baseline results for micro-expression spotting. CAS(ME) means all the videos in this sub-dataset of CAS(ME) have ME sequences.

Iv Conclusion

This paper addresses the challenge in spotting ME on long videos sequences using two most recent databases, i.e. SAMM and CAS(ME). We proposed LTP-ML for spotting MEs and provided a set of performance metrics as the guideline for result evaluation on ME spotting. The baseline results of these two databases are provided in this paper. Whilst the method was able to produce reasonable amount of TPs, there are still a huge challenge lays ahead due to the large amount of FPs. Further research will focus on enhancing the ability of distinguishing ME from other facial movements to reduce FPs.


  • [1] P. Ekman and W. V. Friesen, “Nonverbal leakage and clues to deception,” Psychiatry, vol. 32, no. 1, p. 88–106, 1969.
  • [2] P. Ekman, “Lie catching and microexpressions,” The philosophy of deception, p. 118–133, 2009.
  • [3] J. Endres and A. Laidlaw, “Micro-expression recognition training in medical students: a pilot study,” BMC medical education, vol. 9, no. 1, p. 47, 2009.
  • [4] M.-H. Chiu, H. L. Liaw, Y.-R. Yu, and C.-C. Chou, “Facial micro-expression states as an indicator for conceptual change in students’ understanding of air pressure and boiling points,” British Journal of Educational Technology.
  • [5] P. A. Stewart, B. M. Waller, and J. N. Schubert, “Presidential speechmaking style: Emotional response to micro-expressions of facial affect,” Motivation and Emotion, vol. 33, no. 2, p. 125, 2009.
  • [6] M. H. Yap, J. See, X. Hong, and S.-J. Wang, “Facial micro-expressions grand challenge 2018 summary,” in Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on.   IEEE, 2018, pp. 675–678.
  • [7] “The second facial micro-expressions grand challenge,” 2018. [Online]. Available:
  • [8] A. Moilanen, G. Zhao, and M. Pietikäinen, “Spotting rapid facial movements from videos using appearance-based feature difference analysis,” in Pattern Recognition (ICPR), 2014 22nd International Conference on.   IEEE, 2014, p. 1722–1727.
  • [9] X. Li, X. Hong, A. Moilanen, X. Huang, T. Pfister, G. Zhao, and M. Pietikäinen, “Towards reading hidden emotions: A comparative study of spontaneous micro-expression spotting and recognition methods,” IEEE Transactions on Affective Computing, 2017.
  • [10] A. Davison, W. Merghani, C. Lansley, C.-C. Ng, and M. H. Yap, “Objective micro-facial movement detection using facs-based regions and baseline evaluation,” in Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on.   IEEE, 2018, pp. 642–649.
  • [11] W.-J. Yan, X. Li, S.-J. Wang, G. Zhao, Y.-J. Liu, Y.-H. Chen, and X. Fu, “Casme ii: An improved spontaneous micro-expression database and the baseline evaluation,” PloS one, vol. 9, no. 1, p. e86041, 2014.
  • [12] S.-T. Liong, J. See, K. Wong, A. C. Le Ngo, Y.-H. Oh, and R. Phan, “Automatic apex frame spotting in micro-expression database,” in Pattern Recognition (ACPR), 2015 3rd IAPR Asian Conference on.   IEEE, 2015, p. 665–669.
  • [13]

    D. Patel, G. Zhao, and M. Pietikäinen, “Spatiotemporal integration of optical flow vectors for micro-expression detection,” in

    International Conference on Advanced Concepts for Intelligent Vision Systems.   Springer, 2015, p. 369–380.
  • [14] S.-T. Liong, J. See, R. C.-W. Phan, Y.-H. Oh, A. C. Le Ngo, K. Wong, and S.-W. Tan, “Spontaneous subtle expression detection and recognition based on facial strain,” Signal Processing: Image Communication, vol. 47, p. 170–182, 2016.
  • [15]

    X. Li, J. Yu, and S. Zhan, “Spontaneous facial micro-expression detection based on deep learning,” in

    Signal Processing (ICSP), 2016 IEEE 13th International Conference on.   IEEE, 2016, p. 1130–1134.
  • [16] S.-J. Wang, S. Wu, and X. Fu, “A main directional maximal difference analysis for spotting micro-expressions,” in

    Asian Conference on Computer Vision

    .   Springer, 2016, p. 449–461.
  • [17] H. Ma, G. An, S. Wu, and F. Yang, “A region histogram of oriented optical flow (rhoof) feature for apex frame spotting in micro-expression,” in Intelligent Signal Processing and Communication Systems (ISPACS), 2017 International Symposium on.   IEEE, 2017, pp. 281–286.
  • [18] H. Lu, K. Kpalma, and J. Ronsin, “Micro-expression detection using integral projections,” 2017.
  • [19] C. Duque, O. Alata, R. Emonet, A.-C. Legrand, and H. Konik, “Micro-expression spotting using the riesz pyramid,” in WACV 2018, 2018.
  • [20] Y. Li, X. Huang, and G. Zhao, “Can micro-expression be recognized based on single apex frame?” in 2018 25th IEEE International Conference on Image Processing (ICIP).   IEEE, 2018, pp. 3094–3098.
  • [21] J. LI, C. Soladié, and R. Séguier, “Ltp-ml: Micro-expression detection by recognition of local temporal pattern of facial movements,” in Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on.   IEEE, 2018, pp. 634–641.
  • [22] A. K. Davison, C. Lansley, N. Costen, K. Tan, and M. H. Yap, “Samm: A spontaneous micro-facial movement dataset,” IEEE Transactions on Affective Computing, vol. 9, no. 1, pp. 116–129, 2018.
  • [23] F. Qu, S.-J. Wang, W.-J. Yan, H. Li, S. Wu, and X. Fu, “Cas (me)^ 2: a database for spontaneous macro-expression and micro-expression spotting and recognition,” IEEE Transactions on Affective Computing, 2017.
  • [24]
  • [25] A. K. Davison, M. H. Yap, and C. Lansley, “Micro-facial movement detection using individualised baselines and histogram-based descriptors,” in 2015 IEEE International Conference on Systems, Man, and Cybernetics.   IEEE, 2015, p. 1864–1869.