Extended Local Binary Patterns for Efficient and Robust Spontaneous Facial Micro-Expression Recognition

07/22/2019 ∙ by Chengyu Guo, et al. ∙ 2

Facial MicroExpressions (MEs) are spontaneous, involuntary facial movements when a person experiences an emotion but deliberately or unconsciously attempts to conceal his or her genuine emotions. Recently, ME recognition has attracted increasing attention due to its potential applications such as clinical diagnosis, business negotiation, interrogations and security. However, it is expensive to build large scale ME datasets, mainly due to the difficulty of naturally inducing spontaneous MEs. This limits the application of deep learning techniques which require lots of training data. In this paper, we propose a simple, efficient yet robust descriptor called Extended Local Binary Patterns on Three Orthogonal Planes (ELBPTOP) for ME recognition. ELBPTOP consists of three complementary binary descriptors: LBPTOP and two novel ones Radial Difference LBPTOP (RDLBPTOP) and Angular Difference LBPTOP (ADLBPTOP), which explore the local second order information along radial and angular directions contained in ME video sequences. ELBPTOP is a novel ME descriptor inspired by the unique and subtle facial movements. It is computationally efficient and only marginally increases the cost of computing LBPTOP, yet is extremely effective for ME recognition. In addition, by firstly introducing Whitened Principal Component Analysis (WPCA) to ME recognition, we can further obtain more compact and discriminative feature representations, and achieve significantly computational savings. Extensive experimental evaluation on three popular spontaneous ME datasets SMIC, CASMEII and SAMM show that our proposed ELBPTOP approach significantly outperforms previous state of the art on all three evaluated datasets. Our proposed ELBPTOP achieves 73.94 which is 6.6 ELBPTOP increases recognition accuracy from 44.7 dataset.



There are no comments yet.


page 1

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Facial Micro-Expressions (MEs) are spontaneous, involuntary facial movements when a person experiences an emotion but deliberately or unconsciously attempts to conceal his or her genuine emotions [1, 2, 3]. MEs are more likely to occur in high-risk environments because there are more risks to show true emotions [4]. Recently, automatic facial ME analysis has attracted increasing attention of affective computing researchers and psychologists because of its potential applications such as clinical diagnosis, business negotiation, interrogations, and security [5, 6]

. The study of facial MEs is a well established field in psychology, however it is a relatively new area from the computer vision perspective with many unsolved and challenging problems 

[7, 8]. There are three main challenges in automatic ME analysis.

(1) MEs have very short duration, local and subtle facial movements. Compared to ordinary facial expressions, the duration of a ME is usually very short, typically being no more than 500 ms [9]. Besides short duration, MEs also have other unique characteristics such as local and subtle facial movements [10]. Because of these unique characteristics, it is very difficult for human beings to recognize MEs.

(2) Lack of large scale spontaneous ME datasets. Datasets have played a key role in visual recognition problems, especially in the era of deep learning which requires large scale datasets for training [11]. ME analysis is not an exception. However, another challenging issue faced by automatic facial ME analysis is the lack of benchmark datasets (especially large scale ME datasets) due to the difficulties in inducing spontaneous MEs [1, 7]. Existing popular spontaneous ME datasets like SMIC[1], CASME II[2], and SAMM[3] are small. Besides, the emotion categories of the collected samples in these datasets are unevenly distributed, because some emotions are easier to elicit hence they have more samples.

(3) Lack of efficient and discriminative feature representations. Above challenges make ME analysis much harder and more demanding than ordinary facial tasks. Therefore, the extraction of efficient and discriminative feature representations becomes especially important for automatic ME analysis.

In automatic ME analysis, there are mainly two tasks: ME spotting and ME recognition. The former refers to the problem of automatically and accurately locating the temporal interval of a micro-movement in a video sequence; while the latter is to classify the ME in the video into one of the predefined emotion categories (such as Happiness, Sadness, Surprise, Disgust, etc). ME recognition is the focus of this paper.

Like ordinary facial expression recognition, ME recognition consists of three steps: preprocessing, feature representation and classification [7]. As we discussed previously, the development of powerful feature representations plays a very important role in ME recognition, and thus has been one main focus of research [12]. Representative feature representation approaches for ME recognition are mainly based on Local Binary Patterns (LBP)  [13, 12], Local Phase Quantization (LPQ)  [14], Histogram of Oriented Gradients (HOG)  [15] and Optical Flow (OF) [16].

Despite these efforts, there is still significant room for improvement towards achieving good performance. The small scale of existing ME datasets and the imbalanced distribution of samples are the primary obstacles to applying existing data hungry deep convolutional neural networks which have brought significant breakthroughs in various visual recognition problems in computer vision due to their ability to learn powerful feature representations directly from raw data. Therefore, state of the art methods for ME recognition are still dominated by traditional handcrafted features like Local Binary Patterns on Three Orthogonal Planes (LBPTOP)  

[17], 3D Gradient Oriented Histogram (HOG 3D) [18] and Histograms of Oriented Optical Flow (HOOF) [19].

Due to its prominent advantages such as theoretical simplicity, computational efficiency, and robustness to monotonic grey scale changes, the texture descriptor LBP [20]

has emerged as one of the most prominent features for face recognition 

[21]. Its 3D extension LBPTOP [17] is widely used for facial expression and ME recognition [22]. Many variants of LBP have been proposed to improve robustness and discriminative power, as summarized in recent surveys [23, 24]. However, most LBP variants [25, 26] have not been explored for ME recognition. In other words, in contrast to LBP-based face recognition, LBPTOP type ME recognition is surprisingly underexplored. Moreover, current state-of-the-art ME features like LBPTOP and its variants LBP-SIP [27], LBP-MOP [28], STLBP-IP [29], and STRBP [30] suffer from some drawbacks, such as limited representation power of using only one type of binary feature, limited robustness and increased computational complexity.

In this paper, in order to build more discriminative features that can inherit the advantages of LBP type features without suffering the shortcoming of using filters as complemental features [20] (i.e., the expensive computation cost), we propose a novel binary feature descriptor named Extended Local Binary Patterns on Three Orthogonal Planes (ELBPTOP) for ME recognition. ELBPTOP is a descriptor that, we argue, nicely balances the three concerns: high distinctiveness, good robustness and low computational cost. In addition, LBPTOP can be considered as a special case of the proposed ELBPTOP descriptor. Our contributions of this paper are summarized as follows.

  • Inspired by the unique texture information of human faces and the subtle intensity variations of local subtle facial movements, the novel ELBPTOP encodes not only the first order information, i.e. the pixel difference information between a central pixel and its neighbours (called Center Pixel Difference Vector, CPDV), but also encodes the second order discriminative information in two directions: the radial direction (Radial Pixel Difference Vector, RPDV) and the angular direction (Angular Pixel Difference Vector, APDV). They are named ADLBPTOP and RDLBPTOP respectively. The proposed ELBPTOP is more effective to capture local, subtle intensity changes and thus delivers stronger discriminative power.

  • To achieve our goal of being computationally efficient while preserving distinctiveness, we then apply Whitened Principal Component Analysis (WPCA) to get a more compact, robust, and discriminative global descriptor. We are aware of the fact that WPCA has proven to be effective in face recognition. However, we argue that we are the first to apply WPCA to the problem of ME recognition, which has its own unique challenges compared to the extensively studied face recognition problem.

  • We provide extensive experimental evaluation on three popular spontaneous ME datasets CASME II, SMIC, and SAMM to test the effectiveness of the proposed approach, and find that our proposed ELBPTOP approach significantly outperforms previous state of the art on all three evaluated datasets. Our proposed ELBPTOP achieves 73.94% on CASMEII, which is 6.6% higher than state of the art on this dataset. More impressively, ELBPTOP increases recognition accuracy from 44.7% to 63.44% on the SAMM dataset.

Although our method is simple and handcrafted, the very strong quality results obtained on three popular ME datasets in addition with the low computational complexity prove the efficiency of our approach for ME recognition.

The remainder of the paper is organized as follows. Section II reviews related work in micro-expression recognition and gives a brief outline of LBP and LBPTOP. The main model and more details are represented in Section III, including the proposed ADLBPTOP and the RDLBPTOP descriptors and our ME recognition scheme. Experimental results are presented in Section IV, leading to conclusions in Section V.

Ii Related works

Feature representation approaches of ME recognition can be divided into two distinct categories: geometric-based and appearance-based [31] methods. Specifically, geometric-based features describe the face geometry such as the shapes and locations of facial landmarks, so they need precise landmarking and alignment procedures. By contrast, appearance-based features describe intensity and textural information such as wrinkles and shading changes, and they are more robust to illumination changes and alignment error. Thus, appearance-based feature representation methods, including LBPTOP [17], HOG 3D [18], HOOF [19] and deep learning, have been more popular in ME recognition [7].

LBPTOP variants: Since the pioneering work by Pfister et al. [6], LBPTOP has emerged as the most popular approach for spontaneous ME analysis, and quite a few variants have been proposed. LBP Six Interception Points (LBPSIP) [27] is based on three intersecting lines crossing over the center point. LBP Mean Orthogonal Planes (LBP-MOP) [28] first computes an average plane for three orthogonal planes, and then computes the LBP on the three orthogonal average planes. By reducing redundant information, LBPSIP and LBPMOP achieved better performance. [32] explores two effective binary face descriptors: Hot Wheel Patterns [32] and Dual-Cross Patterns [33] and makes use of abundant labelled micro-expressions. Besides computing the sign of pixel differences, Spatio-Temporal Completed Local Quantized Patterns (STCLQP) [34] also exploits the complementary components of magnitudes and orientations. Decorrelated Local Spatiotemporal Directional Features (DLSTD) [35] uses Robust Principal Component Analysis (RPCA) [36] to extract subtle emotion information and division of 16 Regions of Interest (ROIs) to utilize the Action Unit (AU) information. Spatio-Temporal Local Radon Binary Pattern (STRBP) [30] uses Radon Transform to obtain robust shape features, while Spatiotemporal Local Binary Pattern with Integral Projection (STLBP-IP) [29] turns to integral projections to preserve shape attributes.

HOOF variants: Histogram Of Oriented Gradients (HOOF) [19] is one of the baseline methods that makes use of optical flow in ME recognition. Facial Dynamics Map (FDM) [37] describes local facial dynamics by extracting principal OF direction of each cuboid. Similarly,  [38] designs Main Directional Mean Optical Flow (MDMO) features that utilize the AU information from partitioning facial area into 36 ROIs. Different from these methods, Consistent Optical Flow Maps [39]estimates consistent OF to characterize facial movements, which are calculated from 25 ROIs and the OF of each ROI could be in multiple directions. Recently, Bi-Weighted Oriented Optical Flow (BI-WOOF) [40] makes use of only the apex frame and the onset frame. The majority of OF-based methods need to partition the face area precisely to make use of AU information. This improves the performance but increases the complexity of preprocessing.

HOG 3D variants: HOG 3D [18] is firstly used to recognize posed MEs and then as a baseline on spontaneous MEs. Its variants, the Histogram of Image gradient Orientation (HIGO) [41] ignores the magnitude weighting, hence can suppress the influence of illumination. This makes HIGO become one of the most accurate descriptors at present. However, it is worth noting that HOG is an edge-based gradient descriptor. It is sensitive to noise when not being filtered, and the use of low pass filters could lead to the loss of subtle motion change information in ME recognition. Besides, the computation process is time-consuming and cumbersome, resulting in slow speed.

Deep learning methods: [42]

adopts a shallow network with Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM). Other neural networks are explored in Dual Temporal Scale Convolutional Neural Network (DTSCNN) 

[43], 3D Flow Convolutional neural network (3DFCNN) [44] and Micro-Expression Recognition algorithm using Recurrent CNNs (MER-RCNN) [45]. These methods achieve some improvements in ME recognition, but they are still significantly below state of the art handcrafted features, mainly due to lack of large scale ME data.


LBP was firstly proposed in [13], and a completed version was developed in [12]. Later on it was introduced to face recognition in [21] and its 3D extended version LBPTOP was proposed in [17] with application to facial expression analysis.

Fig. 1: (a) LBP pattern: The sample neighborhood is the center pixel with equally spaced pixels on a circle of radius . Then the binary code is calculated by comparing the differences between the center pixel and its neighbors. An example is in the figure. (b) The process of LBPTOP.

LBP characterizes the special structure of pixels, that are evenly distributed in angle on a circle of radius centered at pixel . In specific, as shown in Figure 1(a), for a central pixel and its neighboring equally spaced pixels on the circle of radius , the LBP pattern is computed via:



is the sign function. The gray values of points that do not fall exactly in the center of pixels are estimated by interpolation. The decimal value of LBP pattern is given by the binary sequence of the circular neighborhood, such as

in Figure 1(a). LBP is gray scale invariant and is able to encode important local patterns like lines, edges and blobs because it measures the differences between the center pixel and its neighbors.

Given an N*M texture image, a LBP pattern can be the computed at each pixel c, such that a textured image can be characterized by the distribution of LBP values, representing the whole image by a LBP histogram vector. By altering and , one can compute LBP features for any quantization of the angular space and for any spatial resolution.

LBPTOP [17] is the 3D extension of LBP by extracting LBP patterns separately from three orthogonal planes: the spatial plane (XY) similar to the regular LBP, the vertical spatiotemporal plane (YT) and the horizontal spatiotemporal plane (XT), as illustrated in Figure 1(b).

Clearly, LBPTOP encodes temporal changes, and componential information. A video can be represented by concatenating LBP on TOP. Despite a little more complex than the static LBP, LBPTOP can achieve real time processing speed depending on the size of the local sampling neighborhood. The dimensionality of LBPTOP is higher than LBP. Since LBPTOP, which extracts features from TOP, becomes popular when extending 2D spatial appearance descriptors to the spatiotemporal domain.

Iii Proposed Approach

In this section, we first introduce the proposed novel binary descriptor ELBPTOP and then present how to use it for ME recognition.

Iii-a Elbptop

Fig. 2: (a) Local circularly symmetric neighbor sampling of ELBP. Two circles of neighbor points are around the central pixel . The radius of the inner circle is , and the radius of the outer circle is . (b) An illustration of the process to calculate ELBP pattern.

LBPTOP has emerged as one of the dominant descriptors for ME recognition. Despite this fact, it has several limitations.

  • Currently, LBPTOP [17] usually only exploit the uniform patterns for ME representation. This results information loss since the proportion of uniform patterns may be too small to capture the variations.

  • It encodes the difference between each pixel and its neighboring pixels only. It is common to combine complementary features like Gabor filters to improve discriminative power. However, this brings significant computational burden.

  • A large sampling size is helpful since it encodes more local information and provides better representation power. However, increasing the number of sampling points of LBPTOP increases its feature dimensionality significantly.

The above analysis leads us to propose novel binary type descriptors, which should not be competitive with LBPTOP, but complement and extend a set of binary feature candidates.

We propose to explore the second order discriminative information in two directions of a local patch: the radial differences (RDLBPTOP) and the angular differences (ADLBPTOP), as complement to the differences between a pixel and its neighbors (LBPTOP). The proposed RDLBPTOP and ADLBPTOP preserve the advantages of LBP, such as computational efficiency and gray scale invariance.

(1) Radial Difference Local Binary Pattern (RDLBP) As illustrated in Section II, LBP is computed by thresholding the neighboring pixel values on a ring against its center pixel value. It only encodes the relationship between the neighboring pixels on the same ring (i.e. a single scale) and the center one, failing to capture the second order information of neighboring pixels between different rings (different scales). For every pixel in the image, we look at two rings of radii and centered on the pixel and pixels distributed evenly on each ring, as shown in Figure 2. To produce the RDLBP codes, we firstly compute the radial differences between pixels on the two rings and then threshold them against 0. The formal definition of the RDLBP code is as follows:


where and denote the outer ring and the inner ring respectively. As can be seen from Figure 3, the LBP values of two different pixels can be same in some cases, but for RDLBP, they are totally different. This is because RDLBP encodes radial pixel difference information.

Fig. 3: The two given patterns in the left would be considered equivalent by LBP. However the patterns are, in some ways, quite different from one to others. Fortunately, this underlying change properties can be revealed via angular and radial differences.
Fig. 4: Proportions of the uniform111For clear illustration, we transform the “full” pattern into “rotation invariant (ri)” pattern [12]. Accordingly, the “uniform (u2)” pattern is transformed into “rotation invariant uniform (riu2)” pattern. Meanwhile, the proportion of the “u2” pattern in the ”full” pattern is equal to the proportion of the ”riu2” pattern in the ”ri” pattern. The transformation has no effect on our conclusion. LBPs for the ELBP descriptors (LBP, ADLBP, and RDLBP) on three planes (XY,XT,YT) from the CASME II dataset. The first 9 bins of each histogram are the uniform patterns, and others are the nonuniform patterns. We could observe that the uniform patterns may not account for the major proportion of overall patterns. This is especially obvious in the case of ADLBP.

(2) Angular Difference Local Binary Pattern (ADLBP) LBP also fails to encode the second order information between pixels on the ring. Therefore, ADLBP is composed of neighboring pixel comparisons in angular (like clockwise) direction for all pixels except the enter pixel.

Formally, it can be calculated as follows:


Similarly, Figure 3 shows that ADLBP encodes angular difference information, which is different from the original LBP descriptor . It is very compact and provide useful information. We can see that both RDLBP and ADLBP are gray scale invariant and computationally efficient. They can also benefit from rotation invariant extension, uniform extension and 3D extension of LBP.

(3) Extended LBP (ELBP) We use ELBP to represent the combination of all three binary descriptors: LBP, RDLBP and ADLBP. The three operators LBP, RDLBP and ADLBP can be combined in two ways, jointly or independently. Because the joint way (3D joint histogram) leads to huge dimension, we use the latter way.

For ME recognition, as shown in Figure 1(b), we extend ELBP to ELBPTOP. Most LBPTOP based ME descriptors use uniform LBP patterns and group the nonuniform patterns into one bin. However, this leads to lots of information loss because uniform LBPs may not be the majority of LBPs, as illustrated in Figure 1. This is more obvious in the case of ADLBP, where the nonuniform patterns are the dominant patterns. Therefore, in this paper, we use all patterns, rather than uniform patterns only.

Iii-B ELBPTOP for ME recognition

In this section, the ME representation is addressed using our proposed ELBPTOP approach to explicitly handle the encountered challenges.

To enhance the discrimination power, we propose to fuse the information extracted by three binary descriptors LBPTOP, RDLBPTOP and ADLBPTOP. The ME feature representation algorithm is illustrated in Figure 6(a). For each binary descriptor LBPTOP (or RDLBPTOP or ADLBPTOP), the ME video sequences are represented as the concatenated spatiotemporal histograms of the binary codes. In specific, a video sequence is divided into blocks, then for each single binary descriptor, the dimension of the histogram is . For instance, if we divide the video sequence into blocks, and we choose , the histogram dimension of a single descriptor would be .

An efficient and effective feature representation scheme is equally important for ME recognition as an efficient and good local descriptor. For each binary codes (LBPTOP, RDLBPTOP or ADLBPTOP), the dimension of the feature representation for each ME video sequence is , which is in fact very high. This would cause computational burden for later classification stage. Therefore, to improve efficiency and preserve distinctiveness, Whitened Principal Component Analysis (WPCA) [46, 47] is firstly introduced for dimensionality reduction before feature fusion.

The idea behind WPCA is that discriminative information is equally distributed along all principle components. The whitening transformation is applied to normalize the contribution of each principal component. Specifically, given a feature representation , standard PCA is used to get the projected feature at first, where is the projection matrix of with

orthonormal columns. Then, the sorted eigenvectors corresponding to the descending sorted first

principal components are transformed to normalized eigenvectors

whose variances equal to 1.

In summary, figure 6(a) illustrates the overview of the proposed feature extraction framework. At first, the video sequences are spatially divided into multiple nonoverlapping subblocks, from each of which three sub-region histograms are extracted via the three proposed binary codes. Each subblock histogram is normalized to sum one. Then, histograms of different subblocks are concatenated and projected by WPCA for dimensionality reduction. Finally, three feature representation vectors with low dimensionality from LBPTOP, RDLBPTOP and ADLBPTOP, are concatenated as a single vector , which is used for final ME feature representation.

Fig. 5: Illustration of the ME classification problem. Samples frames are from CASME II [2].
Fig. 6: Overview of the proposed ME recognition framework.

Iii-C The ME recognition pipeline

The ME recognition problem is illustrated in Figure 6(b). The proposed overall pipeline for ME classification is shown in Figure 5. Following [41], a raw ME video sequence are generally processed by the following steps: face alignment, motion magnification, temporal interpolation, feature extraction and classification.

Our main contribution in this work is the feature representation step, which is presented in detail in previous sections. Below, we give a very brief introduction to other involved steps. Readers are referred to [41, 6, 3] for more information.

Face Alignment: For CASME II [2], SMIC [1] datasets, we use the given cropped images so that face alignment is not required. For SAMM [3] dataset, Active Shape Model [48] is used to detect 77 facial landmarks and then all the facial images are normalized using affine transformation and cropped into the same size according to the eye center points and the outermost points.

Motion Magnification: Since local intensity changes and facial movement changes in ME are subtle, effective ME characteristics are difficult to capture. To tackle these issues, following [41, 49] we use Eulerian Video Magnification (EVM) [50] to magnify the subtle motions in videos. The goal is to consider the time series of intensity values at any spatial location (pixel) and amplify variation in a given temporal frequency band of interest. The filtered spatial bands are then amplified by a given factor , added back to the original signal, and collapsed to generate the output video.

Temporal Interpolation: To address the issue that ME clips are short and have varied duration, we use the Temporal Interpolation Model (TIM)  [51] and the code provided by [6]. The model first seeks a low-dimensional manifold where visual features extracted from the frames of a video can be projected onto a continuous deterministic curve embedded in a path graph. Moreover, it can map arbitrary points on the curve back into the image space, making it suitable for temporal interpolation.


For classification, we use Linear Support Vector Machine (LSVM) 

[52] as the classifier. Leave-one-subject-out cross-validation (LOSOCV) method is adopted to determine the penalty parameter in SVM. For each test subject, LOSOCV is applied to the training samples, where in each fold the samples belonging to one subject are served as validation set and the rest of samples compose the new training set to select a best and the selected is used for testing.

Iv Experiments

Iv-a Datasets

Three most popular spontaneous datasets, including CASME II [2], SMIC [1] and SAMM [3], are used to evaluate the performance of the proposed method. The dataset statistics are summarized in Table I.

SMIC [1]: SMIC consists of 164 sample video clips of 16 subjects belonging to 3 different classes , e.g., Positive (51 samples), Negative (70 samples) and Surprise (43 samples). The SMIC data has three versions: a high-speed camera (HS) version at 100 fps, a normal visual camera (VIS) version at 25 fps and a near-infrared camera (NIR) version at 25 fps. The HS camera was used to record all data, while VIS and NIR cameras were only used for the recording of the last eight subjects’ data. In this paper, we use the HS samples for experiments, and the resolution of average face size is 160 130.

CASME II [2]: CASME II contains 247 ME video clips from 26 subjects. All samples are recorded by a high speed camera at 200 fps. The resolution of samples is 640 480 pixels and the cropped area has 340 280 pixels. These samples are categorized into five ME classes: Happiness (32 samples), Surprise (25 samples), Disgust (64 samples), Repression (27 samples) and Others (99 samples). These classes are used in the whole parameter evaluation and they are used for comparison in Table VII. To remove the bias of human reporting, [53] reorganized the classes based on AU instead of original estimated emotion classes. Performance on reorganized classes are also reported in Table VIII.

SAMM [3]: SAMM database contains 159 ME video clips from 29 subjects. All samples are recorded by a high speed camera at 200 fps. The resolution of samples is 2040 1088 pixels and the cropped facial area has about 400 400 pixels. These samples are categorized into seven AU based classes. Classes I-VI are linked with Happiness (24 samples), Surprise (13 samples), Anger (20 samples), Disgust (8 samples), Sadness (3 samples), and Fear (7 samples). Class VII (84 samples) relates to contempt and other AUs that have no emotional link in EMFACS [54]. We carry on experiment on SAMM with classes I-V and the results are shown in Table VIII.

Feature SMIC-HS [1] CASME II [2] SAMM [3]
No. of Samples 164 247 159
No. of Subjects 16 26 29
Resolution 640 480 640 480 2040 1088
Facial Area 160 130 340 280 400 400
FPS 100 200 200
FACS Coded NO Yes Yes
Classes 3 5 7
TABLE I: A summary of the different features of the SMIC, CASME II and SAMM databases.

Iv-B Implementation Details

Following [41], leave one subject out (LOSO) strategy is used to calculate accuracy. For each fold, all samples from one subject are used as testing set and the rest for training. The final accuracy is obtained by averaging all folds.

Parameters: For block division parameters (), is for CASME II and SMIC, and is for SAMM. For EVM [49], we choose the second-order bandpass filter with cutoff frequencies , and spatial frequency cutoff . Magnification value is set to for CASME II and SAMM, while is chosen for SMIC. TIM [51] is used to interpolate all ME sequences into the same length 10 according to [41]. Values of the number of neighboring pixels , outer ring radius and inner ring radius can be found in tables. The WPCA dimension is , where is the number of video clips of each dataset, e.g., 163 for SMIC and 246 for CASME II.

Iv-C Parameter evaluation

The effect of encoding scheme: Table II compares the performance of two encoding schemes, full patterns (all patterns) and uniform patterns, on SMIC. Results on single binary descriptor without WPCA are reported. From table II, we can see that histogram representations generated by the full patterns significantly outperform the uniform patterns, on all binary descriptors by a large margin (2.22% to 6.46%), clearly demonstrating the insufficiency of the uniform patterns for representing ME videos. As a result, we conduct rest experiments using the full patterns encoding scheme.

Method full patterns uniform patterns
Acc. (%) Acc. (%)
LBPTOP 52.07 (3,8) 49.85 (3,8)
ADLBPTOP 53.11 (3,8) 49.89 (3,8)
RDLBPTOP 53.26 (3,8,2) 46.80 (3,8,2)
TABLE II: ME recognition accuracy (%) of single descriptors on SMIC using two different encoding schemes: full patterns and uniform patterns. , and indicates the number of neighboring points, the outer ring and the inner ring respectively. All experiments are conducted without WPCA and EVM.
Method original (h) WPCA (h)
Acc. (%) Dim. Acc. (%) Dim.
LBPTOP 51.09 (2,8) 98304 52.29 (2,8) 163
ADLBPTOP 55.11 (2,8) 98304 58.45 (2,8) 163
RDLBPTOP 52.61 (2,8,1) 98304 52.61 (2,8,1) 163
TABLE III: ME recognition accuracy (%) of different binary descriptors on SMIC with or without WPCA. is set to . Experiments are conducted without EVM.

The effect of WPCA: Table III illustrates the effect of WPCA dimensionality reduction on SMIC. Clearly, the accuracy of all descriptors is consistently improved by WPCA. Besides, due to much lower feature dimensionality (163 compared with 98304), WPCA could lead to great computational saving. Therefore, further experiments are conducted using WPCA.

Acc. () Acc. () Acc.
SMIC 62.27 (1,8) 52.19 (1,8) 52.55 (1,8,0)
58.45 (2,8) 52.29 (2,8) 52.61 (2,8,1)
53.11 (3,8) 52.07 (3,8) 50.67 (3,8,1)
53.11 (3,8) 52.07 (3,8) 53.26 (3,8,2)
47.95 (4,8) 54.50 (4,8) 55.97 (4,8,1)
47.95 (4,8) 54.50 (4,8) 52.89 (4,8,2)
47.95 (4,8) 54.50 (4,8) 55.45 (4,8,3)
CASME II 48.35 (1,8) 50.15 (1,8) 49.14 (1,8,0)
56.45 (2,8) 52.79 (2,8) 50.89 (2,8,1)
44.36 (3,8) 50.92 (3,8) 49.49 (3,8,1)
44.36 (3,8) 50.92 (3,8) 55.10 (3,8,2)
47.23 (4,8) 49.49 (4,8) 40.64 (4,8,1)
47.23 (4,8) 49.49 (4,8) 43.19 (4,8,2)
47.23 (4,8) 49.49 (4,8) 45.60 (4,8,3)
TABLE IV: ME recognition accuracy (%) of the single binary descriptors on SMIC and CASME II under various parameter settings. is set to . The WPCA dimension for SMIC is 163, and 246 for CASME II. Experiments are conducted without EVM.

Evaluation of single binary descriptor: To explore the characteristics of different binary descriptors, we conduct experiments under various settings. As shown in Table IV, the radius has great impacts on the performance of the three descriptors. The best accuracy often exceeds the second best by a large gap. Therefore, the choose of the best radius is of great importance. Similarly, is very important for the performance of RDLBP. Comparing the best results of ADLBPTOP, LBPTOP and RDLBPTOP, we can find that the proposed ADLBPTOP and RDLBPTOP outperform LBPTOP on both SMIC and CASME II, which shows the importance of radial and angular difference information. Especially, ADLBPTOP performs much better than LBPTOP (3.66% and 8.27% higher on two datasets respectively).

Acc. (%) Acc. (%)
ADLBPTOP 62.27 (1,8) 56.45 (3,8)
ADLBPTOP+EVM 63.73 (1,8) 69.12 (3,8)
54.61 (4,4) 70.20 (2,4)
LBPTOP 54.50 (4,8) 52.97 (2,8)
LBPTOP+EVM 60.83 (3,8) 67.08 (4,8)
65.16 (3,4) 71.55 (3,4)
RDLBPTOP 55.97 (4,8,1) 55.10 (3,8,2)
RDLBPTOP+EVM 61.04 (4,8,3) 67.62 (3,8,2)
62.57 (4,4,3) 69.24 (3,4,1)
TABLE V: ME recognition accuracy using different numbers of neighbors as well as with or without EVM. The parameters of and the WPCA dimensions are the same as Table IV.
Fig. 7: ME recognition accuracy(%) of different feature fusion schemes on SMIC and CASME II. In the boxes, we show the accuracy of the best fused descriptor and three single binary descriptors.

Evaluation of EVM and parameter : Evaluation of the number of neighboring pixels and the effect of EVM are summarized in Table V. Note that all results are reported with their best radii. We can see that EVM can generally increase the recognition accuracy, sometimes significantly (such as for ADLBPTOP and LBPTOP). Table V also indicates that for each single ELBPTOP descriptor, performance achieved by is better than that by , with ADLBPTOP on SMIC being an exception.

Acc. (%) Acc. (%)
ADLBP- TOP 63.73 (1,8) 69.12 (3,8)
XYOT 55.91 (1,8) 64.12 (3,8)
XOT 55.92 (1,8) 61.47 (3,8)
YOT 60.22 (1,8) 62.87 (3,8)
XY 55.69 (1,8) 56.46 (3,8)
LBP- TOP 60.83 (3,8) 67.08 (4,8)
XYOT 60.47 (3,8) 65.04 (4,8)
XOT 57.24 (3,8) 61.38 (4,8)
YOT 55.47 (3,8) 67.65 (4,8)
XY 45.97 (3,8) 60.26 (4,8)
RDLBP- TOP 61.04 (4,8,3) 67.62 (3,8,2)
XYOT 57.84 (4,8,3) 68.85 (3,8,2)
XOT 56.06 (4,8,3) 62.91 (3,8,2)
YOT 58.76 (4,8,3) 66.56 (3,8,2)
XY 48.85 (4,8,3) 57.14 (3,8,2)
TABLE VI: ME recognition accuracy (%) of three binary descriptors on different combinations of planes. TOP, XYOT, XOT, YOT and XY are abbreviations for XY+XT+YT, XT+YT, XT, YT and original spatial plane XY respectively. The parameters and are the same as Table IV. Experiments are conducted with EVM.

Evaluation of orthogonal planes: Table VI illustrates the performance of three binary features (LBP, ADLBP, RDLBP) on five combinations of planes. It can be observed that TOP generally yields the best performance, which indicates that the dynamic information along the time dimension represents the most important information for ME recognition. In contrast, the results on XY plane are almost the worst. This is possibly because that the XY plane contains much redundant information about the facial appearance. Maybe not all areas in the facial area contain useful discriminative information for ME recognition.

Feature Fusion: In order to find a good fusion of LBP*, RDLBP*, and ADLBP* (here, * represents one of TOP, XYOT, XOT, YOT and XY), we test all 215 () possible feature fusion schemes on SMIC and CASME II. All results are shown in Figure 7 in descending order. We can see that the highest accuracy is achieved by combining the three type of binary codes. The best results on SMIC-HS is 69.06%, given by , and on CASME II is 73.94%, given by .

As can be seen from Figure 7, the fused feature increases the accuracy by 3.90%, 5.33% and 6.49% respectively compared with using LBPTOP, ADLBPTOP or RDLBPTOP alone on SMIC. Similarly, the accuracy is improved by 2.39%,3.74% and 4.70% on the three binary codes respectively on CASME II. The strong performance improvement shows that the fused approach indeed captures complementary information.

Iv-D Comparative evaluation

Methods Classifier Year Accuracy (%)
LBPTOP [1] SVM 2013 48.78
LBP-MOP [28] SVM 2015 44.13 50.61
FDM [37] SVM 2017 45.93 54.88
LBP-SIP [27] SVM 2014 46.56 44.51
3DFCNN [44] Softmax 2018 59.11 55.49
STCLQP [34] SVM 2016 58.39 64.02
STLBP-IP [29] SVM 2015 59.51 57.93
CNN+LSTM [42] Softmax 2016 60.98
BiWOOF + Phase [55] SVM 2017 62.55 68.29
Hierahical STLBP-IP [56] KGSL 2018 63.83 60.78
STRBP [30] SVM 2017 64.37 60.98
Discriminative STLBP-IP [57] SVM 2017 64.78 63.41
OF Maps [39] SVM 2017 65.35
HIGOTOP [41] SVM 2018 67.31 68.29
ELBPTOP SVM 73.94 69.06
TABLE VII: Comparison between ELBPTOP and previous state of the art methods on CASME II (with original classes) and SMIC.

In this section, we compare the best results achieved by our ELBPTOP with state of the art results on CASME II (with both original and reorganized classes), SMIC and SAMM.

From Tables VII and VIII, we can observe that our proposed approach consistently gives the best results on all three datasets, significantly outperforming the state of the art. As illustrated in Table VII, it is clear that our proposed method produces the highest accuracy (73.94%), which is 6.63% higher than the second best on CASME II (with original classes). Best accuracy is also achieved by our ELBPTOP on SMIC as well. The recent work of [53] reorganized the classes based on AU instead of estimated emotion classes for removing the bias of human reporting. We also compare our results based on their reorganized classes in Table VIII. Our method also surpasses all other methods on CASME II (with reorganized classes) significantly, improving from 69.64% to 79.55% (a margin of 9.91%). The effectiveness of our method is further demonstrated by the large improvement on SAMM, with an increase from 44.70% to 63.44% (a margin of 18.74%). The strong performance on all ME datasets clearly proves that our proposed ELBPTOP is effective for ME recognition.

Methods Acc.(%)
LBPTOP[53] 44.70 67.80
HOOF[53] 42.17 69.64
HOG 3D[53] 34.16 69.53
ELBPTOP 63.44 79.55

TABLE VIII: Comparison between ELBPTOP and previous state of the art methods on SAMM and CASME II (with reorganized classes).

V Conclusion

In this paper, we proposed a simple, efficient yet robust descriptor ELBPTOP for ME recognition. ELBPTOP consists of three complementary binary descriptors: LBPTOP and two novel ones RDLBPTOP and ADLBPTOP, which explore the local second order information along radial and angular directions contained in ME video sequences. For dimension reduction, WPCA is used to obtain efficient and discriminative features. Extensive experiments on three benchmark spontaneous ME datasets, SMIC, CASME II and SAMM have shown that our proposed approach surpass state-of-the-art by a large margin. In our future work, we plan to learn binary codes directly from data for ME recognition.


  • [1] X. Li, T. Pfister, X. Huang, G. Zhao, and M. Pietikäinen, “A spontaneous micro-expression database: Inducement, collection and baseline,” in 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).   IEEE, 2013, pp. 1–6.
  • [2] W.-J. Yan, X. Li, S.-J. Wang, G. Zhao, Y.-J. Liu, Y.-H. Chen, and X. Fu, “Casme ii: An improved spontaneous micro-expression database and the baseline evaluation,” PloS one, vol. 9, no. 1, p. e86041, 2014.
  • [3] A. K. Davison, C. Lansley, N. Costen, K. Tan, and M. H. Yap, “Samm: A spontaneous micro-facial movement dataset,” IEEE Transactions on Affective Computing, vol. 9, no. 1, pp. 116–129, 2018.
  • [4] P. Ekman, “Darwin, deception, and facial expression,” Annals of the New York Academy of Sciences, vol. 1000, no. 1, pp. 205–221, 2003.
  • [5] Q. Wu, X. Shen, and X. Fu, “Micro-expression and its applications,” Advances in Psychological Science, vol. 18, no. 9, pp. 1359–1368, 2010.
  • [6] T. Pfister, X. Li, G. Zhao, and M. Pietikäinen, “Recognising spontaneous facial micro-expressions,” in 2011 international conference on computer vision.   IEEE, 2011, pp. 1449–1456.
  • [7] Y.-H. Oh, J. See, A. C. Le Ngo, R. C.-W. Phan, and V. M. Baskaran, “A survey of automatic facial micro-expression analysis: Databases, methods and challenges,” Frontiers in psychology, vol. 9, p. 1128, 2018.
  • [8] B. Martinez and M. F. Valstar, “Advances, challenges, and opportunities in automatic facial expression recognition,” in Advances in face detection and facial image analysis.   Springer, 2016, pp. 63–100.
  • [9] W.-J. Yan, Q. Wu, J. Liang, Y.-H. Chen, and X. Fu, “How fast are the leaked facial expressions: The duration of micro-expressions,” Journal of Nonverbal Behavior, vol. 37, no. 4, pp. 217–230, 2013.
  • [10] S. Porter and L. Ten Brinke, “Reading between the lies: Identifying concealed and falsified emotions in universal facial expressions,” Psychological science, vol. 19, no. 5, pp. 508–514, 2008.
  • [11] L. Liu, J. Chen, P. Fieguth, G. Zhao, R. Chellappa, and M. Pietikäinen, “From bow to cnn: Two decades of texture representation for texture classification,” International Journal of Computer Vision, vol. 127, no. 1, pp. 74–109, 2019.
  • [12] T. Ojala, M. Pietikäinen, and T. Mäenpää, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 7, pp. 971–987, 2002.
  • [13] T. Ojala, M. Pietikäinen, and D. Harwood, “A comparative study of texture measures with classification based on featured distributions,” Pattern recognition, vol. 29, no. 1, pp. 51–59, 1996.
  • [14] S. ul Hussain and B. Triggs, “Visual recognition using local quantized patterns,” in European conference on computer vision.   Springer, 2012, pp. 716–729.
  • [15] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” 2005.
  • [16] B. K. Horn and B. G. Schunck, “Determining optical flow,” Artificial intelligence, vol. 17, no. 1-3, pp. 185–203, 1981.
  • [17] G. Zhao and M. Pietikainen, “Dynamic texture recognition using local binary patterns with an application to facial expressions,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 6, pp. 915–928, 2007.
  • [18] S. Polikovsky, Y. Kameda, and Y. Ohta, “Facial micro-expressions recognition using high speed camera and 3d-gradient descriptor,” 2009.
  • [19] R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal, “Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2009, pp. 1932–1939.
  • [20] L. Liu, P. Fieguth, Y. Guo, X. Wang, and M. Pietikäinen, “Local binary features for texture classification: Taxonomy and experimental study,” Pattern Recognition, vol. 62, pp. 135–160, 2017.
  • [21] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary patterns: Application to face recognition,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 12, pp. 2037–2041, 2006.
  • [22] T.-H. Oh, R. Jaroensri, C. Kim, M. Elgharib, F. Durand, W. T. Freeman, and W. Matusik, “Learning-based video motion magnification,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 633–648.
  • [23] D. Huang, C. Shan, M. Ardabilian, Y. Wang, and L. Chen, “Local binary patterns and its application to facial image analysis: a survey,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 41, no. 6, pp. 765–781, 2011.
  • [24] A. Fernández, M. X. Álvarez, and F. Bianconi, “Texture description through histograms of equivalent patterns,” Journal of mathematical imaging and vision, vol. 45, no. 1, pp. 76–102, 2013.
  • [25] L. Liu, P. Fieguth, G. Zhao, M. Pietikäinen, and D. Hu, “Extended local binary patterns for face recognition,” Information Sciences, vol. 358, pp. 56–72, 2016.
  • [26] L. Liu, S. Lao, P. W. Fieguth, Y. Guo, X. Wang, and M. Pietikäinen, “Median robust extended local binary pattern for texture classification,” IEEE Transactions on Image Processing, vol. 25, no. 3, pp. 1368–1381, 2016.
  • [27] Y. Wang, J. See, R. C.-W. Phan, and Y.-H. Oh, “Lbp with six intersection points: Reducing redundant information in lbp-top for micro-expression recognition,” in Asian conference on computer vision.   Springer, 2014, pp. 525–537.
  • [28] ——, “Efficient spatio-temporal local binary patterns for spontaneous facial micro-expression recognition,” PloS one, vol. 10, no. 5, p. e0124674, 2015.
  • [29] X. Huang, S.-J. Wang, G. Zhao, and M. Piteikainen, “Facial micro-expression recognition using spatiotemporal local binary pattern with integral projection,” in Proceedings of the IEEE international conference on computer vision workshops, 2015, pp. 1–9.
  • [30] X. Huang and G. Zhao, “Spontaneous facial micro-expression analysis using spatiotemporal local radon-based binary pattern,” in

    2017 International Conference on the Frontiers and Advances in Data Science (FADS)

    .   IEEE, 2017, pp. 159–164.
  • [31] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 1, pp. 39–58, 2008.
  • [32] X. Ben, X. Jia, R. Yan, X. Zhang, and W. Meng, “Learning effective binary descriptors for micro-expression recognition transferred by macro-information,” Pattern Recognition Letters, vol. 107, pp. 50–58, 2018.
  • [33] C. Ding, J. Choi, D. Tao, and L. S. Davis, “Multi-directional multi-level dual-cross patterns for robust face recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 3, pp. 518–531, 2015.
  • [34] X. Huang, G. Zhao, X. Hong, W. Zheng, and M. Pietikäinen, “Spontaneous facial micro-expression analysis using spatiotemporal completed local quantized patterns,” Neurocomputing, vol. 175, pp. 564–578, 2016.
  • [35] S.-J. Wang, W.-J. Yan, G. Zhao, X. Fu, and C.-G. Zhou, “Micro-expression recognition using robust principal component analysis and local spatiotemporal directional features,” in European Conference on Computer Vision.   Springer, 2014, pp. 325–338.
  • [36] J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma, “Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization,” in Advances in neural information processing systems, 2009, pp. 2080–2088.
  • [37] F. Xu, J. Zhang, and J. Z. Wang, “Microexpression identification and categorization using a facial dynamics map,” IEEE Transactions on Affective Computing, vol. 8, no. 2, pp. 254–267, 2017.
  • [38] Y.-J. Liu, J.-K. Zhang, W.-J. Yan, S.-J. Wang, G. Zhao, and X. Fu, “A main directional mean optical flow feature for spontaneous micro-expression recognition,” IEEE Transactions on Affective Computing, vol. 7, no. 4, pp. 299–310, 2016.
  • [39] B. Allaert, I. M. Bilasco, and C. Djeraba, “Consistent optical flow maps for full and micro facial expression recognition,” 2017.
  • [40] S.-T. Liong, J. See, K. Wong, and R. C.-W. Phan, “Less is more: Micro-expression recognition from video using apex frame,” Signal Processing: Image Communication, vol. 62, pp. 82–92, 2018.
  • [41] X. Li, X. Hong, A. Moilanen, X. Huang, T. Pfister, G. Zhao, and M. Pietikäinen, “Towards reading hidden emotions: A comparative study of spontaneous micro-expression spotting and recognition methods,” IEEE Transactions on Affective Computing, vol. 9, no. 4, pp. 563–577, 2018.
  • [42] D. H. Kim, W. J. Baddar, and Y. M. Ro, “Micro-expression recognition with expression-state constrained spatio-temporal feature representations,” in Proceedings of the 24th ACM international conference on Multimedia.   ACM, 2016, pp. 382–386.
  • [43] M. Peng, C. Wang, T. Chen, G. Liu, and X. Fu, “Dual temporal scale convolutional neural network for micro-expression recognition,” Frontiers in psychology, vol. 8, p. 1745, 2017.
  • [44] J. Li, Y. Wang, J. See, and W. Liu, “Micro-expression recognition based on 3d flow convolutional neural network,” Pattern Analysis and Applications, pp. 1–9, 2018.
  • [45] Z. Xia, X. Feng, X. Hong, and G. Zhao, “Spontaneous facial micro-expression recognition via deep convolutional network,” in 2018 Eighth International Conference on Image Processing Theory, Tools and Applications (IPTA).   IEEE, 2018, pp. 1–6.
  • [46] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of cognitive neuroscience, vol. 3, no. 1, pp. 71–86, 1991.
  • [47] H. V. Nguyen, L. Bai, and L. Shen, “Local gabor binary pattern whitened pca: A novel approach for face recognition from single image per person,” in International conference on biometrics.   Springer, 2009, pp. 269–278.
  • [48] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, “Active shape models-their training and application,” Computer vision and image understanding, vol. 61, no. 1, pp. 38–59, 1995.
  • [49] Y. Wang, J. See, Y.-H. Oh, R. C.-W. Phan, Y. Rahulamathavan, H.-C. Ling, S.-W. Tan, and X. Li, “Effective recognition of facial micro-expressions with video motion magnification,” Multimedia Tools and Applications, vol. 76, no. 20, pp. 21 665–21 690, 2017.
  • [50] H.-Y. Wu, M. Rubinstein, E. Shih, J. Guttag, F. Durand, and W. Freeman, “Eulerian video magnification for revealing subtle changes in the world,” 2012.
  • [51] Z. Zhou, G. Zhao, and M. Pietikäinen, “Towards a practical lipreading system,” in CVPR 2011.   IEEE, 2011, pp. 137–144.
  • [52] C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” ACM transactions on intelligent systems and technology (TIST), vol. 2, no. 3, p. 27, 2011.
  • [53] A. Davison, W. Merghani, and M. Yap, “Objective classes for micro-facial expression recognition,” Journal of Imaging, vol. 4, no. 10, p. 119, 2018.
  • [54] P. Ekman and W. V. Friesen, Facial action coding system: Investigator’s guide.   Consulting Psychologists Press, 1978.
  • [55] S.-T. Liong and K. Wong, “Micro-expression recognition using apex frame with phase information,” in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).   IEEE, 2017, pp. 534–537.
  • [56] Y. Zong, X. Huang, W. Zheng, Z. Cui, and G. Zhao, “Learning from hierarchical spatiotemporal descriptors for micro-expression recognition,” IEEE Transactions on Multimedia, vol. 20, no. 11, pp. 3160–3172, 2018.
  • [57] H. Xiaohua, S.-J. Wang, X. Liu, G. Zhao, X. Feng, and M. Pietikainen, “Discriminative spatiotemporal local binary pattern with revisited integral projection for spontaneous facial micro-expression recognition,” IEEE Transactions on Affective Computing, 2017.