1 Introduction
In this paper, the problem is recognizing facial actions given a face video and action categories in the granularity of either the holistic expression (emotion, see Fig.1) or action units (AU, see Fig. 2). The widelyused 6 basic emotions defined by Paul Ekman include surprise, sadness, disgust, anger, fear and happiness. He also defines the Facial Action Coding System (FACS), using which we can code almost any expression. Recently, feature learning [1, 2, 3] using autoencoder or adversarial training has shown to be able to disentangle facial expression from identity and pose
. Unlike face recognition, limited labelled training data are available for facial expressions and AUs in particular.
As shown in Fig.1, an expressive face can be separated into a principal component of the neutral face encoding identity cues and an action component encoding motion cues such as the highlighted brow, cheek, lip, lid and nose which relate to AUs in FACS. As recognition is always broken down into measuring similarity [4], a similar identity can confuse a similar action. To decouple them, [5] first rules out the neural face explicitly and then discriminate between different action components [6] instead of raw faces [7]
. The first step is based on the observation that the underlying neutral face stays the same. If we stack vectors of neutral faces over the time of an action as a matrix, it should be lowrank, ideally with rank
. While theoretically the lowrank Principal Component Pursuit [8] can be exact under certain conditions, it is of approximate nature in practice. Their second step is based on the idea of describing an action component as a sparse representation over an overcomplete dictionary formed by action components of all categories.Our intuition is to remain both facets in a joint manner. For one thing, we implicitly get rid of the neutral face. For another, we use equallyspaced sampled frames, since all frames collaboratively yet redundantly represent the expression and neutral face. Then, the sparse coefficient vectors form a jointsparse coefficient matrix. That drives us to induce both the joint Sparse representation [9, 10] and the implicit LowRank approximation [8] in one model (SLR) [11], which also induces consistent classifications across frames.
Furthermore, ideally nonzero coefficients all drop to the groundtruth category. Therefore, the classlevel sparsity is and the coefficient matrix exhibits group sparsity. However, coefficient vectors share classwise yet not necessarily atomwise sparsity [12]. Thus, we prefer enforcing both the group sparsity and atomwise sparsity. We name this extended model CollaborativeHierarchical Sparse and LowRank (CHiSLR) model [11] following the naming of CHiLasso [12].
In Sec. 2, we review the classic idea of learning a sparse representation for classification and its related application on expression recognition. In the remainder, we first elaborate our model in Sec. 3, then discuss solving the model via joint optimization in Sec. 4, and finally quantitatively evaluate our model in Sec. 5, with a conclusion followed in Sec. 6.
2 Related Works
Among nonlinear models, one line of work is kernelbased methods [13]
while another is deep learning
[14, 15, 16, 1]. Similar ideas with disentangling factors have been presented in [3, 2, 1]. By introducing extra cues, one line of works is 3D models [17] while another is multimodal models [18]. But in the linear world, observing a random signalfor recognition, we just hope to send the classifier a
discriminative compact representation over a dictionary such that . Normally is computed by pursuing the best reconstruction. For example, when is undercomplete (skinny), a closedform approximate solution can be obtained by LeastSquares:.
When is overcomplete (fat), add a Tikhonov regularizer:
where and is undercomplete. Notably, is generally dense.
Alternatively, we can seek a sparse usage of .
Sparse Representation based Classification [9] (SRC) expresses a test sample
as a weighted linear combination of training samples simply stacked columnwise in the dictionary .
Presumably, nonzero weight coefficients drop to the groundtruth class, which induces a sparse coefficient vector or the socalled sparse representation.
In practice, nonzero coefficients also drop to other classes due to noises and correlations among classes.
Once adding an error term , we can form an dictionary which is always overcomplete:
s.t. .
SRC evaluates which class leads to the minimum reconstruction error,
which can be seen as a maxmargin classifier.
Particularly for facial actions, we treat videos as multichannel signals [10, 19], different from imagebased methods [5, 6]. [5] explicitly separates the neutral face and action component, and then exploits the classwise sparsity separately for the recognition of identity from neutral faces and expression from action components. Differently, with the focus of facial actions we exploit the lowrank property for disentangling identity as well as structured sparsity by interchannel observation. Furthermore, there is tradeoff between simplicity and performance. As videos are sequential signals, the above appearancebased methods including ours cannot model the dynamics given by a temporal model [20] or spatiotemporal models [21, 22, 23]. Other linear models include ordinal regression [24, 25, 26] and boosting [27].
3 Linear Representation Model
In this section, we explain how to model using and training data , which contains types of actions. We would like to classify a test video as one of the classes.
3.1 SLR: joint Sparse representation and LowRankness
First of all, we need an explicit representation of an expressive face. The matrix can be an arrangement of dimensional feature vectors () of the frames: . We emphasize our model’s power by simply using the raw pixel intensities.
Now, we seek an implicit latent representation of an input test face’s emotion
as a sparse linear combination of prepared fixed training emotions :
.
Since an expressive face is a superposition of an emotion and a neutral face , we have
,
where is ideally times repetition of the column vector of a neutral face . Presumably . As shown in Fig. 3, subjects to
,
where the dictionary matrix is an arrangement of all
submatrices , .
Only for training, we have training emotions with neutral faces subtracted.
The above constraint of characterizes an affine transformation from the latent representation to the observation .
If we write and in the homogeneous form, we have
.
In the ideal case with ,
if the neutral face is preobtained [6, 5],
it is trival to solve for .
Normally, is unknown and is not with rank due to noises.
As is supposed to be sparse and is expected to be as small as possible (maybe even ),
intuitively our objective is to
,
where
can be seen as the sparsity of the vector formed by the singular values of
. Here is a nonnegative weighting parameter we need to tune. When , the optimization problem reduces to that in SRC. With both terms relaxed to be convex norms, we alternatively solve ,where is the entrywise matrix norm, whereas is the Schatten matrix norm (nuclear norm, trace norm) which can be seen as applying norm to the vector of singular values. Now, the proposed joint SLR model is expressed as
(1) 
3.2 CHiSLR: a CollaborativeHierarchical SLR model
If there is no lowrank term , (1) becomes a problem of multichannel Lasso (Least Absolute Shrinkage and Selection Operator). For a singlechannel signal, Group Lasso has explored the group structure for Lasso yet does not enforce sparsity within a group, while Sparse Group Lasso yields an atomwise sparsity as well as a group sparsity. Then, [12] extends Sparse Group Lasso to multichannel, resulting in a CollaborativeHierarchical Lasso (CHiLasso) model. For our problem, we do need , which induces a CollaborativeHierarchical Sparse and LowRank (CHiSLR) model:
(2)  
where is the submatrix formed by all the rows indexed by the elements in group n. As shown in Fig. 4, given a group of indices, the subdictionary of columns indexed by is denoted as . is a nonoverlapping partition of . Here denotes the Frobenius norm, which is the entrywise norm as well as the Schatten matrix norm and can be seen as a group’s magnitude. is a nonnegative weighting parameter for the group regularizer, which is generalized from an regularizer (consider for singleton groups) [12]. When , CHiSLR degenerates into SLR. When , we get back to collaborative Sparse Group Lasso.
3.3 Classification
Following SRC, for each class , let denote the submatrix of which consists of all the columns of that correspond to emotion class and similarly for . We classify by assigning it to the class with minimal residual as .
4 Optimization
Both SLR and CHiSLR models can be seen as solving
(3) 
To follow a standard iterative ADMM procedure, we write down the augmented Lagrangian function for (3) as
(4) 
where is the matrix of multipliers, is inner product, and is a positive weighting parameter for the penalty (augmentation). A single update at the th iteration includes
(5)  
(6)  
(7) 
The substep of solving (5) has a closedform solution:
(8) 
where is the shrinkage thresholding operator. In SLR where , (6) is a Lasso problem, which we solve by using the Illinois fast solver. When follows (2) of CHiSLR, computing needs an approximation based on the Taylor expansion at [28, 12]. We refer the reader to [12] for the convergence analysis and recovery guarantee.
5 Experimental Results
We evaluate our model on expressions (CK+) and action units (MPIVDB). Images are cropped using the ViolaJones face detector. Per category accuracies are averaged over 20 runs.
5.1 Holistic facial expression recognition
Experiments are conducted on the CK+ dataset [29] consisting of 321 videos with labels^{1}^{1}1Contempt is discarded in [5, 6] due to its confusion with anger and disgust but we choose to keep it is for the completeness of the experiment on CK+. See https://github.com/eglxiang/icassp15_emotion for cropped face data and programs of CHiSLR, SLR, SRC and Eigenface.. For SLR and CHiSLR, we assume no prior knowledge of the neutral face. A testing unit contains the last () frames together with the first frame, which is not explicitly known a priori as a neutral face. But for forming the dictionary, we subtract the first frame from the last frames per video. The parameters are set as , , and . We randomly choose 10 videos for training and 5 for testing per class. Fig. 5 visualizes the recovery results given by CHiSLR. Table 1 and 2
present their confusion matrix, respectively. Columns are predictions and rows are ground truths. Table
4 summarizes the true positive rate (i.e., sensitivity). We have anticipated that SLR (0.70) performs worse than SRC (0.80) since SRC is equipped with neutral faces. However, CHiSLR’s result (0.80) is comparable with SRC’s. CHiSLR performs even better in terms of sensitivity, which verifies that the group sparsity indeed boosts the performance.As a comparsion, we replicate the imagebased SRC used in [5, 6, 7] and assume the neutral face is provided. We represent an action component by subtracting the neutral face which is the first frame from the last frame per video. We choose half of CK+ for training and the other half for testing per class. When sparsity level is set to 35%, SRC achieves a recognition rate of 0.80 shown by Table 3. Accuracies for fear & sad are low as they confuse each other.
5.2 Facial action unit recognition
To be poseindependant, the following experiments are conducted on a profile view of MPIVDB ^{2}^{2}2See http://vdb.kyb.tuebingen.mpg.de for the raw data and https://github.com/eglxiang/FacialAU for cropped face data. containing 27 long video all with over 100 frames (1 video per category, see Fig. 2). From each video we sample 10 disjoint subvideos each of which contains 10 equallyspaced sampled frames. Different from Sec. 5.1, all frames are directly used without subtracting the first frame as the subvideos do not start with neutral states. However, there implicitly exist underlying neural states and presumably the proposed model is still valid. Then we randomly sample 5 subvideos from the 10 for training (i.e., forming dictionary) and the other 5 for testing (namely and ). In this way, the dataset is divided into a training set and a disjoint testing set both with 5 subvideos per category. When , SLR’s performance is shown in Fig. 6 with an average recognition rate of 0.80. When , CHiSLR’s performance is shown in Fig. with a average recognition rate of 0.84. They both perform poorly on AU10R (right upper lip raiser), which confuse with 12R (right lip corner), 13 (cheek puffer), 14R (right dimpler) and 15 (lip corner depressor) because they are all about lips.
An  Co  Di  Fe  Ha  Sa  Su  

An  0.77  0.01  0.09  0.02  0  0.07  0.04 
Co  0.08  0.84  0  0  0.03  0.04  0 
Di  0.05  0  0.93  0.01  0.01  0.01  0 
Fe  0.09  0.01  0.03  0.53  0.12  0.07  0.15 
Ha  0.01  0.02  0.01  0.02  0.93  0  0.03 
Sa  0.19  0.02  0.02  0.05  0  0.65  0.07 
Su  0  0.02  0  0.02  0  0.02  0.95 
An  Co  Di  Fe  Ha  Sa  Su  

An  0.51  0  0.10  0.02  0  0.31  0.06 
Co  0.03  0.63  0.03  0  0.04  0.26  0.01 
Di  0.04  0  0.74  0.02  0.01  0.15  0.04 
Fe  0.08  0  0.01  0.51  0.03  0.19  0.18 
Ha  0  0.01  0  0.03  0.85  0.08  0.03 
Sa  0.09  0  0.04  0.04  0  0.70  0.13 
Su  0  0.01  0  0.02  0.01  0.02  0.94 
An  Co  Di  Fe  Ha  Sa  Su  
An  0.71  0.01  0.07  0.02  0.01  0.03  0.16 
Co  0.07  0.60  0.02  0  0.16  0.03  0.12 
Di  0.04  0  0.93  0.02  0.01  0  0 
Fe  0.16  0  0.09  0.25  0.25  0  0.26 
Ha  0.01  0  0  0.01  0.96  0  0.02 
Sa  0.22  0  0.13  0.01  0.04  0.24  0.35 
Su  0  0.01  0  0  0.01  0  0.98 
Model  An  Co  Di  Fe  Ha  Sa  Su 
SRC  0.71  0.60  0.93  0.25  0.96  0.24  0.98 
SLR  0.51  0.63  0.74  0.51  0.85  0.70  0.94 
CHiSLR  0.77  0.84  0.93  0.53  0.93  0.65  0.95 
6 Conclusion
In this paper, we propose a identitydecoupled linear model to learn a facial action representation, unlike [6] requiring neutral faces as inputs and [5] generating labels of the identity and facial action as mutual byproducts yet with extra efforts. Our contribution is twofold. First, we do not recover the action component explicitly. Instead, the videobased sparse representation is jointly modelled with the lowrank property across frames so that the neutral face underneath is automatically subtracted. Second, we preserve the label consistency by enforcing atomwise as well as group sparsity. For the CK+ dataset, CHiSLR’s performance on raw faces is comparable with SRC given neutral faces, which verifies that action components are automatically separable from raw faces as well as sparsely representable by training data. We also apply the model on recognizing actions units with limited training data, which may embarrass deep learning techniques.
References
 [1] Salah Rifai, Yoshua Bengio, Aaron Courville, Pascal Vincent, and Mehdi Mirza, “Disentangling factors of variation for facial expression recognition,” in ECCV. Springer, 2012, pp. 808–822.
 [2] Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee, “Learning to disentangle factors of variation with manifold interaction,” in ICML, 2014, pp. 1431–1439.

[3]
Ping Liu, Joey Tianyi Zhou, Ivor WaiHung Tsang, Zibo Meng, Shizhong Han, and
Yan Tong,
“Feature disentangling machinea novel approach of feature selection and disentangling in facial expression analysis,”
in ECCV. Springer, 2014, pp. 151–166. 
[4]
Xiang Xiang and Trac D. Tran,
“Poseselective max pooling for measuring similarity,”
in ICPR workshops, 2016.  [5] Sima Taheri, Visha M. Patel, and Rama Chellappa, “Componentbased recognition of faces and facial expressions,” IEEE Trans. on Affective Computing, vol. 4, no. 4, pp. 360–371, 2013.
 [6] Stefanos Zafeiriou and Maria Petrou, “Sparse representations for facial expressions recognition via l1 optimization,” in IEEE CVPR Workshop, 2010.
 [7] Raymond Ptucha, Grigorios Tsagkatakis, and Andreas Savakis, “Manifold based sparse representation for robust expression recognition without neutral subtraction,” in IEEE ICCV Workshops, 2011.

[8]
Emmanuel J. Candes, Xiaodong Li, Yi Ma, and John Wright,
“Robust principal component analysis?,”
Journal of the ACM, vol. 58, no. 3, pp. 1–37, 2011.  [9] John Wright, Allen Y. Yang, Arvind Ganesh, Shankar S Sastry, and Yi Ma, “Robust face recognition via sparse representation,” IEEE TPAMI, vol. 31, no. 2, pp. 210–227, 2009.
 [10] Yonina C. Eldar and Holger Rauhut, “Average case analysis of multichannel sparse recovery using convex relaxation,” IEEE Trans. Inf. Theory, vol. 56, no. 1, pp. 505–519, 2010.
 [11] Xiang Xiang, Minh Dao, Gregory D Hager, and Trac D Tran, “Hierarchical sparse and collaborative lowrank representation for emotion recognition,” in ICASSP. IEEE, 2015, pp. 3811–3815.
 [12] Pablo Sprechmann, Ignacio Ram rez, Guillermo Sapiro, and Yonina Eldar, “CHiLasso: A collaborative hierarchical sparse modeling framework,” IEEE Trans. Sig. Proc., vol. 59, no. 9, pp. 4183–4198, 2011.
 [13] C. Fabian BenitezQuiroz, Ramprakash Srinivasan, and Aleix M. Martinez, “Emotionet: An accurate, realtime algorithm for the automatic annotation of a million facial expressions in the wild,” in CVPR, June 2016.
 [14] Xiangyun Zhao, Xiaodan Liang, Luoqi Liu, Teng Li, Yugang Han, Nuno Vasconcelos, and Shuicheng Yan, “Peakpiloted deep network for facial expression recognition,” in ECCV, 2016, pp. 425–442.

[15]
Heechul Jung, Sihaeng Lee, Junho Yim, Sunjeong Park, and Junmo Kim,
“Joint finetuning in deep neural networks for facial expression recognition,”
in ICCV, December 2015. 
[16]
Ping Liu, Shizhong Han, Zibo Meng, and Yan Tong,
“Facial expression recognition via a boosted deep belief network,”
in CVPR, 2014.  [17] Hui Chen, Jiangdong Li, Fengjun Zhang, Yang Li, and Hongan Wang, “3d modelbased continuous emotion recognition,” in CVPR, 2015.
 [18] Zheng Zhang, Jeff M. Girard, Yue Wu, Xing Zhang, Peng Liu, Umur Ciftci, Shaun Canavan, Michael Reale, Andy Horowitz, Huiyuan Yang, Jeffrey F. Cohn, Qiang Ji, and Lijun Yin, “Multimodal spontaneous emotion corpus for human behavior analysis,” in CVPR, June 2016.
 [19] Kaili Zhao, WenSheng Chu, Fernando De la Torre, Jeffrey F. Cohn, and Honggang Zhang, “Joint patch and multilabel learning for facial action unit detection,” in CVPR, June 2015.

[20]
Arnaud Dapogny, Kevin Bailly, and Severine Dubuisson,
“Pairwise conditional random forests for facial expression recognition,”
in ICCV, 2015.  [21] Mengyi Liu, Shiguang Shan, Ruiping Wang, and Xilin Chen, “Learning expressionlets on spatiotemporal manifold for dynamic facial expression recognition,” in CVPR, 2014, pp. 1749–1756.
 [22] Ziheng Wang, Shangfei Wang, and Qiang Ji, “Capturing complex spatiotemporal relations among facial muscles for facial expression recognition,” in CVPR, 2013, pp. 3422–3429.
 [23] Yimo Guo, Guoying Zhao, and Matti Pietikäinen, “Dynamic facial expression recognition using longitudinal facial expression atlases,” in ECCV, pp. 631–644. Springer, 2012.

[24]
Rui Zhao, Quan Gan, Shangfei Wang, and Qiang Ji,
“Facial expression intensity estimation using ordinal information,”
in CVPR, June 2016.  [25] Ognjen Rudovic, Vladimir Pavlovic, and Maja Pantic, “Multioutput laplacian dynamic ordinal regression for facial expression recognition and intensity estimation,” in CVPR. IEEE, 2012, pp. 2634–2641.
 [26] Minyoung Kim and Vladimir Pavlovic, “Structured output ordinal regression for dynamic facial emotion intensity prediction,” in ECCV. Springer, 2010, pp. 649–662.
 [27] Peng Yang, Qingshan Liu, and Dimitris N Metaxas, “Exploring facial expressions with compositional features,” in CVPR. IEEE, 2010.
 [28] Minh Dao, Nam H Nguyen, Nasser M Nasrabadi, and Trac D Tran, “Collaborative multisensor classification via sparsitybased representation,” IEEE Trans. on Sig. Proc., vol. 64, no. 9.
 [29] Patrick Lucey, Jeffrey F. Cohn, Takeo Kanade, Jason Saragih, and Zara Ambadar, “The Extended CohnKanade Dataset (CK+): A complete dataset for action unit and emotionspecified expression,” in IEEE CVPR, 2010.