icassp15_emotion
This reoisitory contains the program assiciated with the paper with ID SPTM-P8.5 at IEEE ICASSP 2015 - Sparsity and Optimization session. A pre-print paper is available at http://arxiv.org/abs/1410.1606.
view repo
In this paper, we design a Collaborative-Hierarchical Sparse and Low-Rank (C-HiSLR) model that is natural for recognizing human emotion in visual data. Previous attempts require explicit expression components, which are often unavailable and difficult to recover. Instead, our model exploits the lowrank property over expressive facial frames and rescue inexact sparse representations by incorporating group sparsity. For the CK+ dataset, C-HiSLR on raw expressive faces performs as competitive as the Sparse Representation based Classification (SRC) applied on manually prepared emotions. C-HiSLR performs even better than SRC in terms of true positive rate.
READ FULL TEXT VIEW PDFThis reoisitory contains the program assiciated with the paper with ID SPTM-P8.5 at IEEE ICASSP 2015 - Sparsity and Optimization session. A pre-print paper is available at http://arxiv.org/abs/1410.1606.
In this paper, the problem of interest is to recognize the emotion given a video of a human face and emotion category [1]. As shown in Fig.1, an expressive face can be separated into a dominant neutral face and a sparse expression component, which we term emotion and is usually encoded in a sparse noise term . We investigate if we can sparsely represent the emotion over a dictionary of emotions [2] rather than expressive faces [3], which may confuse a similar expression with a similar identity [2]. Firstly, how to get rid of the neutral face? Surely we can prepare an expression with a neutral face explicitly provided as suggested in [2]
. Differently, we treat an emotion as an action and assume neutral faces stay the same. If we stack vectors of neutral faces as a matrix, it should be low-rank (ideally with rank 1). Similarly, over time sparse vectors of emotions form a sparse matrix. Secondly,
how to recover the low-rank and sparse components? In [4], the (low-rank) Principal Component Pursuit (PCP) [5] is performed explicitly. While theoretically the recovery is exact under conditions [5], it is of approximate nature in practice. Finally, since we only care about the sparse component, can we avoid such an approximate explicit PCP step? This drives us to exploit Sparse representation and Low-Rank property jointly in one model named SLR (Sec. 3.1).Different from image-based methods [2, 4], we treat an emotion video as a multichannel signal. If we just use a single channel such as one frame to represent an emotion, much information is lost since all frames collaboratively represent an emotion. Therefore, we prefer using all or most of them. Should we treat them separately or simultaneously? The former just needs to recover the sparse coefficient vector for each frame. The latter gives a spatial-temporal representation, while it requires the recovery of a sparse coefficient matrix, which should often exhibit a specific structure. Should we enforce a class-wise sparsity separately or enforce a group sparsity collaboratively? [4] models the class-wise sparsity separately for the recognition of a neutral face’s identity and an expression image’s emotion once they have been separated. Alternatively, we can exploit the low-rankness as well as structured sparsity by inter-channel observation. Since class decisions may be inconsistent, we prefer a collaborative model [6] with group sparsity enforced [7]. This motivates us to introduce the group sparsity as a root-level sparsity to the SLR model embedded with a leaf-level atom-wise sparsity. The reason of keeping both levels is that signals over frames share class-wise yet not necessarily attom-wise sparsity patterns [8]. Therefore, we term this model Collaborative-Hierarchical Sparse and Low-Rank (C-HiSLR) model.
When observing a random signal
for recognition, we hope to send the classifier a
discriminative compact representation , which satisfies and is yet computed by pursuing the best reconstruction. When is under-complete, a closed-form approximate solution can be obtained by Least-Squares:In practice, we care more about how to recover [16]. Enforcing sparsity is feasible since can be exactly recovered from under conditions for [17]. However, finding the sparsest solution is NP-hard and difficult to solve exactly [18]. But now, it is well-known that the norm is a good convex relaxation of sparsity – minimizing the norm induces the sparsest solution under mild conditions [19]. Exact recovery is also guaranteed by -minimization under suitable conditions [20]. Typically, an iterative greedy algorithm is the Orthogonal Matching Pursuit (OMP) [16].
For multichannel with dependant coefficients across channels [21], where is low-rank. In a unsupervised manner, Sparse Subspace Clustering [22] of solves where
is sparse and Principal Component Analysis is
where is a projection matrix.In this section, we explain how to model using and training data , which contains types of emotions. We would like to classify a test video as one of the classes.
First of all, we need an explicit representation of an expressive face. The matrix can be an arrangement of -dimensional feature vectors () such as Gabor features [23] or concatenated image raw intensities [10] of the frames: . We emphasize our model’s power by simply using the raw pixel intensities.
Now, we seek an implicit latent representation of an input test face’s emotion
as a sparse linear combination of prepared fixed training emotions :
.
Since an expressive face is a superposition of an emotion and a neutral face , we have
,
where is ideally -times repetition of the column vector of a neutral face . Presumably . As shown in Fig. 2, subjects to
,
where the dictionary matrix is an arrangement of all
sub-matrices , .
Only for training, we have training emotions with neutral faces subtracted.
The above constraint of characterizes an affine transformation from the latent representation to the observation .
If we write and in homogeneous forms [24], then we have
.
In the ideal case with ,
if the neutral face is pre-obtained [2, 4],
it is trival to solve for .
Normally, is unknown and is not with rank due to noises.
As is supposed to be sparse and is expected to be as small as possible (maybe even ),
intuitively our objective is to
,
where
can be seen as the sparsity of the vector formed by the singular values of
. Here is a non-negative weighting parameter we need to tune [25]. When , the optimization problem reduces to that in SRC. With both terms relaxed to be norm, we alternatively solve ,(1) |
If there is no low-rank term , (1) becomes a problem of multi-channel Lasso (Least Absolute Shrinkage and Selection Operator). For a single-channel signal, Group Lasso [27] has explored the group structure for Lasso yet does not enforce sparsity within a group, while Sparse Group Lasso [28] yields an atom-wise sparsity as well as a group sparsity. Then, [8] extends Sparse Group Lasso to multichannel, resulting in a Collaborative-Hierarchical Lasso (C-HiLasso) model. For our problem, we do need , which induces a Collaborative-Hierarchical Sparse and Low-Rank (C-HiSLR) model:
(2) | ||||
where is the sub-matrix formed by all the rows indexed by the elements in group n. As shown in Fig. 3, given a group of indices, the sub-dictionary of columns indexed by is denoted as . is a non-overlapping partition of . Here denotes the Frobenius norm, which is the entry-wise norm as well as the Schatten matrix norm and can be seen as a group’s magnitude. is a non-negative weighting parameter for the group regularizer, which is generalized from an regularizer (consider for singleton groups) [8]. When , C-HiSLR degenerates into SLR. When , we get back to collaborative Sparse Group Lasso.
Following SRC, for each class , let denote the sub-matrix of which consists of all the columns of that correspond to emotion class and similarly for . We classify by assigning it to the class with minimal residual as .
Both SLR and C-HiSLR models can be seen as solving
(3) |
To follow a standard iterative ADMM procedure [26], we write down the augmented Lagrangian function for (3) as
(4) |
where is the matrix of multipliers, is inner product, and is a positive weighting parameter for the penalty (augmentation). A single update at the -th iteration includes
(5) | ||||
(6) | ||||
(7) |
The sub-step of solving (5) has a closed-form solution:
(8) |
where is the shrinkage thresholding operator. In SLR where , (6) is a Lasso problem, which we solve by using an existing fast solver [29]. When follows (2) of C-HiSLR, computing needs an approximation based on the Taylor expansion at [30, 8]. We refer the reader to [8] for the convergence analysis and recovery guarantee.
All experiments are conducted on the CK+ dataset [31] which consists of 321 emotion sequences with labels (angry, contempt^{1}^{1}1Contempt is discarded in [2, 4] due to its confusion with other classes., disgust, fear, happiness, sadness, surprise) ^{2}^{2}2Please visit http://www.cs.jhu.edu/~xxiang/slr/ for the cropped face data and programs of C-HiSLR, SLR, SRC and Eigenface. and is randomly divided into a training set (10 sequences per category) and a testing set (5 sequences per category). For SRC, we assume that the information of neutral face is provided. We subtract the first frame (a neutral face) from the last frame per sequence for both training and testing. Thus, each emotion is represented as an image. However, for SLR and C-HiSLR, we assume no prior knowledge of the neutral face. We form a dictionary by subtracting the first frame from the last frames per sequence and form a testing unit using the last () frames together with the first frame, which is not explicitly known as a neutral face. Thus, each emotion is represented as a video. Here, we set or , , and . Fig. 4 visualizes the recovery results given by C-HiSLR. Facial images are cropped using the Viola-Jones detector [32] and resized to . As shown in Fig. 5, imperfect alignment may affect the performance.
Firstly, SRC achieves a total recognition rate of 0.80, against 0.80 for eigenface with nearest subspace classifier and 0.72 for eigenface with nearest neighbor classifier. This verifies that emotion is sparsely representable by training data and SRC can be an alternative to subspace based methods. Secondly, Table 1-3
present the confusion matrix (
) and Table 4 summarizes the true positive rate (i.e., sensitivity). We have anticipated that SLR (0.70) performs worse than SRC (0.80) since SRC is equipped with neutral faces. However, C-HiSLR’s result (0.80) is comparable with SRC’s. C-HiSLR performs even better in terms of sensitivity, which verifies that the group sparsity indeed boosts the performance.We design the C-HiSLR representation model for emotion recognition,
unlike [2] requiring neutral faces as inputs
and [4] generating labels of identity and emotion as mutual by-products with extra efforts.
Our contribution is two-fold.
First, we do not recover emotion explicitly.
Instead, we treat frames simultaneously and implicitly subtract the low-rank neutral face.
Second, we preserve the label consistency by enforcing atom-wise as well as group sparsity.
For the CK+ dataset, C-HiSLR’s performance on raw data
is comparable with SRC given neutral faces,
which verifies that emotion is automatically separable from expressive faces as well as sparsely representable.
Future works will include handling misalignment [33] and incorporating dictionary learning [12].
ACKNOWLEDGMENTS
This work is supported by US National Science Foundation
under Grants CCF-1117545 and CCF-1422995, Army Research Office
under Grant 60219-MA, and Office of Naval Research under Grant
N00014-12-1-0765.
The first author is grateful for the fellowship from China Scholarship Council.
An | Co | Di | Fe | Ha | Sa | Su | |
---|---|---|---|---|---|---|---|
An | 0.77 | 0.01 | 0.09 | 0.02 | 0 | 0.07 | 0.04 |
Co | 0.08 | 0.84 | 0 | 0 | 0.03 | 0.04 | 0 |
Di | 0.05 | 0 | 0.93 | 0.01 | 0.01 | 0.01 | 0 |
Fe | 0.09 | 0.01 | 0.03 | 0.53 | 0.12 | 0.07 | 0.15 |
Ha | 0.01 | 0.02 | 0.01 | 0.02 | 0.93 | 0 | 0.03 |
Sa | 0.19 | 0.02 | 0.02 | 0.05 | 0 | 0.65 | 0.07 |
Su | 0 | 0.02 | 0 | 0.02 | 0 | 0.02 | 0.95 |
with a standard deviation of 0.05.
An | Co | Di | Fe | Ha | Sa | Su | |
---|---|---|---|---|---|---|---|
An | 0.51 | 0 | 0.10 | 0.02 | 0 | 0.31 | 0.06 |
Co | 0.03 | 0.63 | 0.03 | 0 | 0.04 | 0.26 | 0.01 |
Di | 0.04 | 0 | 0.74 | 0.02 | 0.01 | 0.15 | 0.04 |
Fe | 0.08 | 0 | 0.01 | 0.51 | 0.03 | 0.19 | 0.18 |
Ha | 0 | 0.01 | 0 | 0.03 | 0.85 | 0.08 | 0.03 |
Sa | 0.09 | 0 | 0.04 | 0.04 | 0 | 0.70 | 0.13 |
Su | 0 | 0.01 | 0 | 0.02 | 0.01 | 0.02 | 0.94 |
An | Co | Di | Fe | Ha | Sa | Su | |
An | 0.71 | 0.01 | 0.07 | 0.02 | 0.01 | 0.03 | 0.16 |
Co | 0.07 | 0.60 | 0.02 | 0 | 0.16 | 0.03 | 0.12 |
Di | 0.04 | 0 | 0.93 | 0.02 | 0.01 | 0 | 0 |
Fe | 0.16 | 0 | 0.09 | 0.25 | 0.25 | 0 | 0.26 |
Ha | 0.01 | 0 | 0 | 0.01 | 0.96 | 0 | 0.02 |
Sa | 0.22 | 0 | 0.13 | 0.01 | 0.04 | 0.24 | 0.35 |
Su | 0 | 0.01 | 0 | 0 | 0.01 | 0 | 0.98 |
Model | An | Co | Di | Fe | Ha | Sa | Su |
SRC | 0.71 | 0.60 | 0.93 | 0.25 | 0.96 | 0.24 | 0.98 |
SLR | 0.51 | 0.63 | 0.74 | 0.51 | 0.85 | 0.70 | 0.94 |
C-HiSLR | 0.77 | 0.84 | 0.93 | 0.53 | 0.93 | 0.65 | 0.95 |
“Robust face recognition via sparse representation,”
IEEE T-PAMI, vol. 31, no. 2, pp. 210–227, 2009.“Decoding by linear programming,”
IEEE Trans. Inf. Theory, vol. 51, no. 12, pp. 4203–4215, 2005.“Model selection and estimation in regression with grouped variables,”
J. Royal Statistical Society, vol. 68, no. 1, pp. 49–67, 2013.“Matlab Computer Vision System Toolbox,”
http://www.mathworks.com/products/computer-vision/.“Efficient region tracking with parametric models of geometry and illumination,”
IEEE T-PAMI, vol. 20, no. 10, pp. 1025–1039, 1998.