Generating faces for affect analysis

11/12/2018 ∙ by Dimitrios Kollias, et al. ∙ Middlesex University London Imperial College London 8

This paper presents a novel approach for synthesizing facial affect; either categorical, in terms of the six basic expressions (i.e., anger, disgust, fear, happiness, sadness and surprise), or dimensional, in terms of valence (i.e., how positive or negative is an emotion) and arousal (i.e., power of the emotion activation). In the Valence-Arousal case, a system is created, based on VA annotation of 600,000 frames from the 4DFAB database; in the categorical case, the system is based on the selection of apex frames of posed expression sequences from the 4DFAB. The proposed system accepts at its input: i) either the basic facial expression, or the pair of valence-arousal emotional state descriptors, which need to be synthesized and ii) a neutral 2D image of a person on which the corresponding affect will be synthesized. The proposed approach consists of the following steps: First, based on the provided desired emotional state, a set of 3D facial meshes is produced from the 4DFAB database and is used to build a blendshape model that generates the new facial affect. To synthesize this affect on the 2D neutral image, 3D Morphable Models fitting is performed and the reconstructed face is then deformed to generate the target facial expressions. Finally, the new face is rendered into the original image. Qualitative experimental studies illustrate the generation of realistic images, when the neutral image is sampled from a variety of well known lab-controlled or in-the-wild databases, including Aff-Wild, RECOLA, AffectNet, AFEW, Multi-PIE, AFEW-VA, BU-3DFE, Bosphorus, RAF-DB. Also, quantitative experiments are conducted, in which deep neural networks, trained using the generated images from each of the above databases in a data-augmentation framework, provide affect recognition; better performances are achieved through the presented approach when compared with the current state-of-the-art.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 5

page 6

page 7

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Rendering photorealistic facial expressions from single static faces while preserving the identity information is an open research topic which has significant impact on the area of affective computing. Generating faces of a specific person with different facial expressions can be used in various applications, including face recognition 

cao2018vggface2  parkhi2015deep , face verification sun2014deep  taigman2014deepface , emotion prediction, expression database generation, facial expression augmentation and entertainment.

This paper describes a novel approach that uses an arbitrary face image with a neutral facial expression and synthesizes a new face image of the same person, but with a different expression, generated according to a categorical or dimensional emotion representation model. This problem cannot be tackled using small databases with labeled facial expressions, as it would be really difficult to disentangle facial expression and identity information through them. Our approach is based on the analysis of a large 4D facial database, the 4DFAB cheng4dfab , which we appropriately annotated and used for facial expression synthesis on a given subject’s face.

At first, a dimensional emotion model, in terms of the continuous variables, valence (i.e., how positive or negative is an emotion) and arousal (i.e., power of the emotion activation) whissell1989dictionary russell1978evidence , has been used to annotate a large amount of 600,000 facial images. This model can represent, not only primary, extreme expressions, but also subtle expressions which are met in everyday human to human, or human to machine interactions.

Additionally, a categorical emotional model, in terms of the six basic facial expressions, has been used, according to which the apex frames of posed expression sequences were selected from the 4DFAB. This selection was performed by identifying the micro-expressions (which are spontaneous and subtle facial movements that happen involuntarily, thus reveling one’s genuine, underlying emotion ekman1969nonverbal ) of subjects in 4DFAB. Each micro-expression was broken down to 3 steps: onset, apex, and offset, describing the beginning, peak and end of emotion, respectively. Through this procedure we selected in total, 12,000 expressions at peak time, including 2,000 cases for each of the six basic expressions.

The proposed approach accepts a pair of valence-arousal values, or a value indicating the basic facial expression to be synthesized on a given neutral 2D image of a person. It uses blendshape modeling to generate the new facial affect, 3DMM fitting to synthesize the affect on the 2D neutral image and face deformation to generate the targeted facial expression.

Section 2 refers to work that has been published regarding facial expression synthesis. Section 3 presents the proposed approach for generating facial affect. We describe the annotation and use of the 4DFAB database, and provide the pipeline of our approach in detail. Section 4 describes the thirteen databases, annotated either using a categorical or dimensional emotion model, that have been used so as to produce synthesized images. In Section 5, we provide at first a qualitative evaluation of our approach. Then, by using augmented data of faces from the databases described in the previous Section, we train deep neural networks to predict the valence-arousal and basic emotion category values in these databases. Experimental results show that the proposed approach manages to synthesize photorealistic facial affect, whilst improving the performance of state-of-the-art affect prediction methods. Conclusions and future work are presented in Section 6.

2 Related Work

In the past several years, facial expression synthesis has been an active research topic. Existing facial expression synthesis methods are roughly split into two categories. The first category uses computer graphics techniques in order to either directly warp input faces to target expressions zhang2006geometry yang2012facial yeh2016semantic , or re-use sample patches of existing images mohammed2009visio . The second one synthesizes images with attributes that are predefined through the creation of generative models ding2017exprgan susskind2008generating .

A lot of research in the first category has been devoted to the creation of correspondences between target images and existing facial textures. The approaches followed generated new expressions by: composing face patches from an existing expression database mohammed2009visio jonze1999being ; warping face images via optical flow yang2012facial yang2011expression and feature correspondence theobald2009mapping ; creating fully textured 3D facial models pighin2006synthesizing blanz2003reanimating . In yeh2016semantic

, the authors proposed to learn optical flow using a variational autoencoder. Although such methods usually produce realistic images with high resolution, the elaborated complex processes cause highly expensive computations. They have produced ways that can be used to synthesize facial expressions on virtual agents

zhang2010facial , or to transfer facial expressions between different subjects, i.e., facial reenactment thies2015real . However, accurately synthesizing a wide variety of facial expressions on arbitrary real faces is still considered an open problem.

The second category of methods has initially focused on using deconvolutional neural networks (DeCNNs) 111https://zo7.github.io/blog/2016/09/25/generating-faces.html, or deep belief nets (DBNs) susskind2008generating

, generating faces through interpolation of the facial images in the training sets. This, however, makes them inherently unsuited for facial expression generation in the case of unseen subjects.

With the recent development of Generative Adversarial Networks (GANs) goodfellow2014generative , image editing has migrated from pixel-level manipulations to semantic-level ones. GANs have been successfully applied to face image editing, for modification of facial attributes yan2016attribute2image ghodrati2015towards , age modeling zhang2017age and pose adjustment huang2017beyond

. These methods generally use the encoder of the GAN to find a low-dimensional representation of the face image in a latent space, then manipulate the latent vector and finally decode it to generate the new image.

Popular approaches shift the latent vector along a direction corresponding to semantic attributes larsen2015autoencoding yeh2016semantic , or concatenate attribute labels with it zhang2017age yan2016attribute2image . Adversarial discriminator networks are used, either at the encoder to regularize the latent space makhzani2015adversarial , or at the decoder to generate blur-free and realistic images larsen2015autoencoding or at both encoder and decoder, such as the Conditional Adversarial Autoencoder. All of these approaches require large training databases so that identity information can be properly disambiguated. Otherwise, when presented with an unseen face, the networks tend to generate faces which look like the “closest” subject in the training datasets.

It has been proposed to handle the above problem by warping images, rather than generating them from the latent vector yeh2016semantic . This approach achieves a high interpolation quality, but requires that the input expression is known, thus failing when generating facial expressions that are “far apart”, e.g., angry faces from smiling ones. Moreover, it is hard to take fine-grain control of the synthesized images, e.g., widen the smile or narrow the eyes.

Conditional GANs (CGANs) mirza2014conditional and conditional variational autoencoders (CVAEs) sohn2015learning can generate samples conditioned on attribute information, when available. However, they suffer from mode-collapse (e.g., the generator only outputs samples from a single mode, or with extremely low variety) and blurriness. Additionally, those networks must be trained, using the whole training set and knowing the attribute labels; it is not clear how to adapt them to new attributes without retraining from scratch.

The proposed approach has quite a few novelties. First of all, it is the first time, to the best of our knowledge, that the dimensional model of affect is taken into account when synthesizing face images. As verified in the experimental section of the paper, a large number of different expressions can be generated in the continuous 2D domain, given a valence and arousal pair of values. All other models are producing synthesized images according to the six basic, or a few more, expressions. We show that the presented approach can be used to accurately synthesise the six basic expressions. Moreover, this is the first time that a 4D face database is annotated in terms of valence and arousal and is then used for affect synthesis. Furthermore, there has been no attempt, until now, to use blendshape models for the synthesis of the data. Finally, the proposed approach works well, when presented with a neutral face image, obtained either in a controlled environment, or in-the-wild, considering different head poses of the person appearing in the image.

3 The Proposed Approach

This section begins by describing the 4DFAB database, its annotation in terms of valence-arousal and the selection of some expressive categorical sequences from this database, since the 4DFAB database was used for constructing the 3D facial expression gallery that was used in our affect synthesis. Then, the framework of our methodology is presented, which includes the generation of new 3D facial affect from the 4DFAB and the facial affect synthesis for arbitrary 2D images.

3.1 The 4DFAB database

The 4DFAB database cheng4dfab is the first large scale 4D face database designed for biometric applications and facial expression analysis. It consists of 180 subjects (60 females, 120 males) aging from 5 to 75 years. 4DFAB was collected over a period of 5 years under four different sessions, with over 1,800,000 3D faces. The database was designed to capture articulated facial actions and spontaneous facial behaviors, the latter being elicited by watching emotional video clips. In this paper, we use all 1,580 spontaneous expression sequences for dimensional emotion analysis and synthesis; these sequences cover a wide range of expressions as defined in compound_expr_pnas .

Moreover, to be able to develop categorical emotion synthesis, we selected 2,000 expressive 3D meshes per basic expression (12,000 meshes in total), corresponding to the apex frames of posed expression sequences in the 4DFAB. Examples are shown in Table 1.

AN [height=1.2cm]figures/AN_0.jpg [height=1.2cm]figures/AN_1.jpg [height=1.2cm]figures/AN_2.jpg [height=1.2cm]figures/AN_3.jpg [height=1.2cm]figures/AN_4.jpg [height=1.2cm]figures/AN_5.jpg [height=1.2cm]figures/AN_6.jpg
DI [height=1.2cm]figures/DI_0.jpg [height=1.2cm]figures/DI_1.jpg [height=1.2cm]figures/DI_2.jpg [height=1.2cm]figures/DI_3.jpg [height=1.2cm]figures/DI_4.jpg [height=1.2cm]figures/DI_5.jpg [height=1.2cm]figures/DI_6.jpg
FE [height=1.2cm]figures/FE_0.jpg [height=1.2cm]figures/FE_1.jpg [height=1.2cm]figures/FE_2.jpg [height=1.2cm]figures/FE_3.jpg [height=1.2cm]figures/FE_4.jpg [height=1.2cm]figures/FE_5.jpg [height=1.2cm]figures/FE_6.jpg
HA [height=1.2cm]figures/HA_0.jpg [height=1.2cm]figures/HA_1.jpg [height=1.2cm]figures/HA_2.jpg [height=1.2cm]figures/HA_3.jpg [height=1.2cm]figures/HA_4.jpg [height=1.2cm]figures/HA_5.jpg [height=1.2cm]figures/HA_6.jpg
SA [height=1.2cm]figures/SA_0.jpg [height=1.2cm]figures/SA_1.jpg [height=1.2cm]figures/SA_2.jpg [height=1.2cm]figures/SA_3.jpg [height=1.2cm]figures/SA_4.jpg [height=1.2cm]figures/SA_5.jpg [height=1.2cm]figures/SA_6.jpg
SU [height=1.2cm]figures/SU_0.jpg [height=1.2cm]figures/SU_1.jpg [height=1.2cm]figures/SU_2.jpg [height=1.2cm]figures/SU_3.jpg [height=1.2cm]figures/SU_4.jpg [height=1.2cm]figures/SU_5.jpg [height=1.2cm]figures/SU_6.jpg
Table 1: Examples from the 4DFAB of apex frames with posed expressions for the six basic expressions: Angry (AN), Disgust (DI), Fear (FE), Happy (HA), Sad (SA), Surprise (SU)

Next, in order to develop the novel dimensional expression synthesis method, we annotated all 1,580 dynamic 3D sequences (i.e., over 600,000 frames), in terms of valence and arousal emotion dimensions, using the tool described in zafeiriou2017aff . Valence and arousal values ranged in . Examples are shown in Fig. 1. In the rest of the paper, we refer the 4DFAB database either as: i) the 600,000 frames with their corresponding 3D meshes, which are annotated with 2D valence and arousal (V-A) emotion values or ii) the 12,000 apex frames of posed expressions with their corresponding 3D meshes, which have categorical annotation.

[height=6cm,width=8cm,trim=.090pt .060pt .070pt .080pt]figures/2d_wheel_with_pics.png

Figure 1: The 2D Valence-Arousal Space and some representatives frames of 4DFAB

As each 3D face in 4DFAB differed in the number, as well as topology of vertices, we needed to first correlate all these meshes with a universal coordinate frame - namely a 3D face template. This step is usually referred to as establishing dense correspondences. We followed the same UV-based registration approach as in cheng4dfab so as to bring all 600,000 meshes into full correspondence with the mean face of LSFM booth2018large . As a result, we created a new set of 600,000 3D faces that share identical mesh topology, while maintaining their original facial expressions. In the following, this set constitutes the 3D facial expression gallery which we use for facial affect synthesis.

3.2 The methodology pipeline

The main novelty and contribution of this paper lies in the development of a fully automatic dimensional facial affect synthesis framework (depicted in Fig. 2). In the first part (Fig. 2(a)), assuming that the user inputs a target V-A pair, we aim at generating semantically correct 3D facial affect, using the generated 4D gallery. There are two key steps in this pipeline. The first includes data selection from the 4D face gallery and their utilization for facial affect generation. To this end, we discretize the 2D Valence-Arousal (V-A) Space into a large number of classes (please see Section 3.3.1 for more details). Each class contains aligned meshes that are associated with corresponding V-A pairs; accordingly, these V-A pairs lie within the area of the class. Therefore, when a user provides a V-A pair, the corresponding class is computed and the data belonging to this class are retrieved. A blendshape model is then created using these data and the mean face is computed. Eventually, using this blendshape model, we generate an unseen 3D face with affect. The details of this part are described in Fig. 2(a) and Section 3.3.

[width=0.99]figures/method2.jpg
(a) Generate new facial affect, given a target V-A pair.
[width=0.9]figures/method3.jpg
(b) Synthesize facial affect on a 2D neutral face.

Figure 2: The two main parts in our facial affect synthesis framework: (a) generating new facial affect from our 4D face gallery, given a target V-A value pair provided by the user; (b) synthesizing the facial affect (from part (a)) on an arbitrary 2D neutral face.

Fig. 2(b) describes the procedure of synthesizing a new facial affect on an arbitrary 2D face. As was previously described, given a target V-A pair, we create an unseen expressive face without any identity, gender and age information. In this part, we want to transfer the affect of this expressive face to the face of another person. Consequently, we render a 2D expressive face without loss of identity. Three processing steps are needed to achieve this goal. The first is to perform 3DMM fitting booth2017itw3dmm

so as to estimate the 3D shape of the target face. The second step is to transfer the facial affect from the synthesized 3D face to the reconstructed 3D face. Finally, the last step renders the new 3D face with affect to the original image frame. This procedure is described in detail in Section 

3.4.

The methodology pipeline for synthesizing categorical affect is similar to that of Fig. 2. In Fig. 2(a), the only difference is that there is no V-A pair, neither discretization of the 2D V-A Space. Instead, users input the basic expression that they wish to synthesize. The corresponding 2,000 aligned meshes are retrieved and a blendshape model and the mean face are then computed. In Fig. 2(a), the user is assumed to input the ’happy’ basic expression.

Finally, let us note that, ideally, we could avoid discretizing the 2D V-A Space beforehand and, instead, apply discretization on the fly. The latter means that when the user inputs the V-A pair, at that moment we can do what Fig.

2(a) shows: i) find all frames belonging to a class defined with the V-A pair as center of a square of size in the 2D V-A Space, ii) build the blendshape model, iii) compute the mean face and then apply the procedure of Fig. 2(b). In this way, we could ideally synthesize infinite expressions, dealing with a regression problem (since the V-A Space is continuous and not discrete). However, although we could potentially create an infinite number of expressions, this could make our methodology intractable and would not be a realistic application. All 600,000 3D meshes and their annotations would have to be stored somewhere and for a new V-A pair a new blendshape model would have to be built during that time.

3.3 Generation of new 3D facial affect from 4DFAB

Here, we provide details regarding: i) the discretization of the Valence-Arousal space into 100 or 250 classes and ii) the creation of blendshape models and mean faces for both these classes and the basic expression ones.

3.3.1 Discretizing the 2D Valence-Arousal Space

The first step is to select the number of classes into which we will discretize the 2D V-A Space. We aim at including a sufficient number of facial data in each class. This is due to the fact that If each class contains only few examples, it is more likely to include identity information. However, the synthesized facial affect should only describe the expression associated with the designated V-A value pair, rather than information for the person’s identity, or gender and age.

At first, the 2D V-A Space was discretized into 100 classes, with each class covering a square of size . Fig. 3(b) shows the histogram of annotations of 4DFAB in terms of the discretized Valence-Arousal Space. Fig. 3(a) illustrates the corresponding mean shapes of blendshape models of various classes of this Space.

[height=6cm,width=6cm]figures/blendshapes_100.jpg
(a) Mean shapes of blendshape models
[height=6.3cm,width=8.3cm]figures/hist_100.jpg
(b) histogram

Figure 3: Case with 100 classes: Mean shapes of the blendshape models and corresponding areas in the 2D Valence-Arousal Space; shown as a 2D histogram of 4DFAB annotations.

Next, the 2D V-A Space was discretized into 250 classes; 200 classes corresponded to the first and second quadrant of the 2D V-A Space (where arousal was positive), with each one covering a square of size . The rest 50 classes corresponded to the third and fourth quadrant of the 2D V-A Space (where arousal was negative), with each one covering a square of size . Those 50 classes were the same as in the previous case (where the 2D V-A Space was discretized into 100 classes), because - as can be seen in the histogram of Fig. 3(b) - there were not enough samples. Fig. 4(b) shows the histogram of 4DFAB annotations in the discretized Valence-Arousal Space. Fig. 4(a) illustrates the corresponding mean blendshapes of various classes of this Space.

[height=4cm,width=5.7cm]figures/blendshape_200.png
(a) Mean shapes of blendshape models
[height=6cm,width=8.5cm]figures/hist_200.png
(b) histogram

Figure 4: Case with 250 classes: Mean shapes of the blendshape models (a) and corresponding areas in the 2D Valence-Arousal Space; shown as a 2D histogram of 4DFAB annotations (b).

Table 2 shows some mean shapes of blendshape models along with their areas in the Valence-Arousal Space, when this was discretized into 100 classes, versus the corresponding four classes and areas when the Valence-Arousal Space was discretized into 250 classes.

Mean shapes of blendshape models
when discretizing into 100 classes
Corresponding mean shapes when discretizing into 250 classes
VA in [0.8,1]
VA in [0.8,0.9]
V in [0.8,0.9]
A in [0.9,1]
V in [0.9,1]
A in [0.8,0.9]
VA in [0.9,1]
V in [-0.2,0]
A in [0.6,0.8]
V in [-0.1,0]
A in [0.6,0.7]
V in [-0.1,0]
A in [0.7,0.8]
V in [-0.2,-0.1]
A in [0.6,0.7]
V in [-0.2,-0.1]
A in [0.7,0.8]
V in [-0.8,-0.6]
A in [0.6,0.8]
V in [-0.7,-0.6]
A in [0.6,0.7]
V in [-0.7,-0.6]
A in [0.7,0.8]
V in [-0.8,-0.7]
A in [0.6,0.7]
V in [-0.8,-0.7]
A in [0.7,0.8]
Table 2: Some mean shapes of blendshape models and their areas in the 2D Valence-Arousal Space when this was discretized into 100 classes, and the corresponding four mean shapes of blendshape models when the discretization was done into 250 classes. V and A stand for valence and arousal

Given those two different modes of discretization, it can be mentioned that consecutive, or adjacent, frames generally share similar annotation and display almost the same emotion. Consequently, we also chose not to take into account such frames when building the blendshape models. It was, however, necessary to define the number of consecutive frames not being used. Two scenarios have been tested to handle this problem: i) skipping 5 frames (corresponding to around 0.08 sec) in the 1580 sequences of 4DFAB, resulting in 120,000 frames in total; ii) skipping a variable number of frames, say , in each sequence, so that, in the end, each sequence contained only 60 frames (corresponding to 1 sec), resulting in 94,800 frames in total; , with being the length of each sequence.

Let us now discuss the case for both scenarios, in which the Valence-Arousal Space was discretized into 100 classes. There existed some classes that their mean face was similar to the respective mean face when we kept all the frames, as can be seen in Fig. 5

. However, there were a few classes whose mean face was not good probably because the total number of samples belonging to these classes was not sufficient, as shown in Fig.

6.

[height=4cm,width=5.7cm]figures/blendshapes_skip.png

Figure 5: Mean shapes of some blendshape models of the two scenarios

[height=2.5cm,width=8.3cm]figures/blend_errors.png

Figure 6: Failure cases of mean shapes of some blendshape models of the two scenarios

In the case where the Valence-Arousal Space was discretized into 250 classes, both scenarios failed, as the total number of samples belonging to many classes was not sufficient.

3.3.2 Categorical Emotion Synthesis case

Fig. 7 shows the mean shapes of the six blendshape models corresponding to the basic expressions.

[height=2cm,width=8cm]figures/cat_blendshapes.png

Figure 7: The mean shapes of the blendshape models for each of the six basic expressions

3.3.3 Expression Blendshape Models

Expression blendshape models provide an effective way to parameterize facial behaviors and are frequently used in many computer vision applications. The localized blendshape model 

neumann2013sparse was localized so as to describe the selected V-A samples. Each 3D mesh was subtracted from the neutral mesh of the corresponding sequence. A set of difference vectors were created and then stacked into a matrix , where

is number of vertices in the mesh. Afterwards, a variant of sparse Principal Component Analysis (PCA) was applied to the data matrix

, so as to identify sparse deformation components :

(1)

where, the constraint can be either or , with denoting the components of sparse weight matrix . The selection of these two constraints depended on the actual usage; the major difference was that the latter one allowed for negative weights and therefore enabled deformation towards both directions, which was useful for describing shapes like muscle bulges. The regularization of sparse components was performed with norm Wright2009 ; Bach2012 . To permit more local deformations, additional regularization parameters were added into . To compute optimal and , an iterative alternating optimization was employed (please refer to neumann2013sparse for more details).

3.4 Facial affect synthesis for arbitrary 2D images

Given a Valence-Arousal pair of values, the aim was to modify the face in an arbitrary 2D image and generate a new facial image with affect. This procedure consisted of three steps: (1) fit a 3D Morphable Model (3DMM) on the image; (2) generate facial affect on the reconstructed 3D face; (3) blend the new face into the original image. Specifically, we started by performing a 3DMM fitting booth2017itw3dmm on a 2D facial image, and retrieved a reconstructed 3D face with the texture sampled from the original image. Next, we calculated the facial deformation by subtracting the synthesized face from the LSFM template, and imposed this deformation on the reconstructed mesh. So far, we have generated a new 3D face with certain affect; the last step would be rendering it back to the original 2D image, where a Poisson image blending Perez2003Siggraph was employed to produce a natural and realistic result.

4 Databases

To evaluate our facial affect synthesis method in different scenarios (e.g. controlled laboratory environment, uncontrolled in-the-wild setting), we utilized neutral facial images from as many as 13 databases. In the following, we briefly present the Multi-PIE, Aff-Wild, AFEW, AFEW-VA, BU-3DFE, RECOLA, AffectNet, RAF-DB, KF-ITW, Face place, FEI, 2D Face Sets and Bosphorus databases that we used in our experimental study.

Table 3 presents these databases, showing: i) the model of affect they use, their condition and the total number of frames they contain and ii) the total number of images that we synthesized using our approach (both in the valence-arousal and the six basic expressions cases).

Databases (DBs) Model of Affect Condition DB size
Total number of
synthesized images
VA Basic Expr
MULTI-PIE Neutral, Surprise, Disgust, Smile + Squint, Scream controlled 755,370 52,254 5,520
KF-ITW Neutral, Happiness , Surprise in-the-wild 3,264 116,235 12,236
FEI Neutral, Smile controlled 2,800 11,400 1,200
Face place 6 Basic Expr, Neutral, Confusion controlled 6,574 59,736 6,288
AFEW 6 Basic Expr, Neutral in-the-wild 41,406 705,649 56,514
RECOLA VA controlled 345,000 46,455 4,890
BU-3DFE 6 Basic Expr, Neutral controlled 2,500 5,700 600
Bosphorus 6 Basic Expr controlled 4,666 17,018 1,792
AffectNet VA + 6 Basic Expr, Neutral + Contempt in-the-wild
450,000
manually annotated
2,476,235 176,425
Aff-Wild VA in-the-wild 1,180,000 60,135 6,330
AFEW-VA VA in-the-wild 30,050 108,864 11,460
RAF-DB 6 Basic, Neutral + 11 Compound Expr in-the-wild 15,339 + 3,954 121,866 12,828
2D Face Sets:
Pain
6 Basic, Neutral + 10 Pain Expr controlled 599 2,736 288
2D Face Sets:
Iranian
Neutral, Smile controlled 369 2,679 282
2D Face Sets:
Nottingham Scans
Neutral controlled 100 5,700 600
Table 3: Databases used in our approach, their model of affect, size and number of synthesized images in the valence-arousal case and the six basic expressions one

1) Multi-PIE gross2010multi : It contains 755,370 face images (3072x2048) of 337 people taken in a controlled environment across 15 different poses, 20 illuminations, 6 facial expressions (neutral, smile, surprise, squint, disgust and scream) and 4 different sessions.

2) Face place222Stimulus images courtesy of Michael J. Tarr, Center for the Neural Basis of Cognition and Department of Psychology, Carnegie Mellon University, http://www.tarrlab.org/: It contains photographs of many different individuals in various types of disguises, such that, for each individual, there are multiple photographs in which hairstyle and/or eyeglasses have been changed or added. It consists of 1,284 images of Asian, 937 images of African-American, 3,362 images of Caucasian, 494 images of Hispanic and 497 images of multiracial people showing posed expressions (6 basic, neutral and confusion).

3) 2D Face Sets: We used 3 subsets from the 2D Face Sets database333http://pics.stir.ac.uk:

Iranian women: It consists of 369 color images (1200x900) of 34 women. People display mostly smile and neutral expression in each of five poses.

Nottingham scans: It has 100 monochrome images (50 men, 50 women) in neutral and frontal pose. The image resolution varies from 358x463 to 468x536.

Pain expressions: It contains 599 color images (720x576) of 13 women and 10 men. They usually display two of the six basic emotions (anger, disgust, fear, sad, happy, surprise) plus 10 pain expressions. Profile neutral and 45 degrees images are available.

4) FEI: The FEI database thomaz2010new is a Brazilian face data-base that contains 200 individuals, each with 14 images, resulting in 2,800 images of size 640x480. All images were color and taken against a white background in an upright frontal position with profile rotation of up to 180. The subjects display neutral and smile expressions and are mostly students and staff at FEI, between 19 and 40 years old with distinct appearance, hairstyle and adorns. The number of male and female subjects are both 100.

5) Aff-Wild: Aff-Wild 1804.10938  zafeiriou2017aff is the first large scale captured in-the-wild database, consisting of 298 Youtube videos, with around frames in total. The length of each video varies from 10 seconds to 15 minutes. These videos contain spontaneous facial behaviors elicited by a variety of stimuli in arbitrary recording conditions. There are 200 subjects (130 males and 70 females) from different ethnicities. Aff-Wild serves as the benchmark of the first Affect-in-the-wild Challenge444https://ibug.doc.ic.ac.uk/resources/first-affect-wild-challenge kollias2017recognition . For each video, there are eight annotators to annotate the valence and arousal, in the range of [, ].

6) AFEW 5.0: This database is a dynamic facial expressions corpus (used in EmotiW Challenge 2017 dhall2017individual ) consisting of 1,809 nearly real world scenes from movies and reality TV shows. There are over 330 subjects aging from 1 to 77. The database is split into three sets: training (773 videos), validation (383 videos) and test set (653 videos). It is a challenging database because both training and validation sets are mainly from the movies, while 114 out of 653 test videos are from TV. Annotations of neutral and 6 basic expressions are provided.

7) AFEW-VA: Recently, a part of the AFEW database has been annotated in terms of Valence and Arousal, thus creating the AFEW-VA kossaifi2017afew database. It includes 600 video clips selected from films with real-world conditions, i.e., occlusions, illumination and body movements. The length of each video ranges from around 10 frames to over 120 frames. This database consists of per-frame annotations of V-A. In total, more than 30,000 frames were annotated for affect prediction of V-A, using discrete values in the range of [, ].

8) BU-3DFE: BU-3DFE database yin20063D is the first 3D facial expression database, which includes 2,500 expressive meshes from 100 subjects (56 females, 44 males) with age from 18 to 70. The subjects are from various ethnic/racial ancestries. They recorded 6 articulated expressions (happiness, disgust, fear, angry, surprise and sadness) with 4 intensities; also, there is a neutral 3D scan per subject.

9) Kinect Fusion ITW: The KF-ITW database booth2017itw3dmm is the first Kinect 3D database which has many “in-the-wild” characteristics, even though it is captured indoors. This database consists of 17 different subjects performing some expressions (neutral, happy, surprise) under various illumination conditions.

10) Bosphorus: The Bosphorus database savran2008bosphorus consists of 105 subjects in various poses, expressions (6 basic expressions, no neutral) and occlusion conditions. 18 men have beard/moustache and 15 others have short facial hair. There are 60 men and 45 women, they are mostly between 25 and 35. Majority of them are Caucasian. 27 professional actors/actresses are incorporated in the database. The number of total face scans is 4,666, each scan has been manually labeled with 24 facial landmarks.

11) RECOLA: The REmote COLlaborative and Affective database ringeval2013introducing contains natural and spontaneous emotions in the continuous domain (arousal and valence). The corpus includes four modalities: audio, visual, electro-dermal activity and electro-cardiogram. It consists of 46 French speaking subjects being recorded for 9.5 h in total. The recordings were annotated for 5 minutes each by 6 French-speaking annotators.

[width=1]figures/eccv1.jpg
(a)
[width=1]figures/eccv2.jpg
(b)
[width=1]figures/eccv3.jpg
(c)

Figure 8: (a)-(c). Synthesis of facial affect across all databases: first rows show the neutral, second ones show the corresponding synthesized images and third rows show the corresponding VA values or a basic expressions.

12) AffectNet: The Affect from the InterNet database mollahosseini2017affectnet contains around 1,000,000 facial images downloaded from the Internet by querying three major search engines using 1250 emotion related keywords in six different languages. About half of the retrieved images were manually annotated by twelve human experts, for the presence of seven discrete facial expressions and the intensity of valence and arousal. The average image resolution of faces is with STD of pixels.

[width=0.75]figures/eccv_synthesize.jpg

Figure 9: Synthesis of facial affect: on the left side are the neutral 2D images and on the right the synthesized images with different levels of affect

13) RAF-DB: The Real-world Affective Faces database li2017reliable contains about 30,000 facial images from thousands of individuals. Each image has been individually labeled about 40 times for the six basic expressions and the neutral one, as well as for eleven compound expressions. The subjects range in age from some months to 70 years old. They are 52% female, 43% male, and 5% unsure. For racial distribution, there are 77% Caucasian, 8% African-American, and 15% Asian.

5 Experimental Study

This section describes the experiments made as as to evaluate the proposed approach. At first, we provide qualitative evaluation of our approach by showing many synthesized images from all thirteen databases described in the previous Section and also by comparing our generated synthesized images with real ones. Next, quantitative evaluation is performed by using the synthesized images as additional data for training deep neural networks (DNNs) and achieving best performances outperforming current state-of-the-art results obtained on these databases.

5.1 Qualitative evaluation of achieved facial affect synthesis

We used all databases mentioned in Section 4 to supply the proposed approach with ’input’ neutral faces. We also synthesized the emotional state of specific V-A value pairs for these images. One important task during this facial affect synthesis procedure has been to preserve identity, age and gender of the original face. Instead of finding the closest matching sample (or K-nearest samples) for the given V-A pair, we have categorized our 3D data based on the 2D V-A Space and then employed the mean expression of the area that contained the target V-A pair. Finally, we also synthesized images for the six basic expressions.

Fig. 8 shows representative results, organized in three parts. In each part, the first row illustrates some neutral images sampled from each of the aforementioned databases, the second one shows the respective synthesized images and the third the respective basic expressions or VA values that were synthesized. Moreover, Fig. 9 shows the neutral images on the left side (first column), and the synthesized images, with different valence and arousal values, on the right (following columns). It can be observed that the synthesized images are identity preserving, realistic and vivid. Also, Fig. 10 shows on the left side (first column) the neutral images, and on the right (following columns) the synthesized images with some basic expressions.

[width=1]figures/basic_expr_example.jpg

Figure 10: Synthesis of facial affect: on the left side are the neutral 2D images and on the right the synthesized images with some basic expressions

[width=1]figures/comparison_real_fake.jpg

Figure 11: Comparison between real and synthesized images: For (a), (b) and (c) on the left (first column) are the real images, whereas on the right are the synthesized ones; for (c) we show multiple syntetic

Finally, Fig. 11 compares images with real and synthesized emotions of people. On the first left column in (a),(b) and (c) are the real images (i.e., images of real reactions of people), whereas on the right are the synthesized by our approach images. In (c), we show multiple synthesized images that looked most similar to the real image.

These Figures show that the proposed framework works well, when using images from either in-the-wild, or controlled databases. This indicates that we can effectively synthesize facial affect regardless of different image conditions (e.g., occlusions, illumination and head poses).

5.2 Quantitative evaluation of the facial affect synthesis

Here we present the results obtained, when utilizing the synthesized data produced by our approach, as additional data to train DNNs, for both valence-arousal prediction, as well as classification into the basic expressions. In particular, we present the networks being used in our experiments, the adopted evaluation criteria and the overall experiments performed on eight databases. All results have been state-of-the-art, validating the effectiveness of the proposed approach.

5.2.1 Leveraging synthesized data for training Deep Neural Networks: Valence-Arousal case

We used the synthesized faces to train deep neural networks for valence and arousal prediction on four facial affect databases annotated in terms of valence and arousal, the Aff-Wild, RECOLA, AffectNet and AFEW-VA. The first step has been to select neutral frames from these four databases. Specifically, we selected frames with zero valence and arousal values (human inspection was also conducted to make sure that they represented neutral faces). Then, for each frame, we synthesized facial affect using the mean shape of the blendshapes (as was described in Section 3.3.1) and assigned the median valence and arousal value of that class to it.

5.2.1.1 Experiments on the Aff-Wild

Following the proposed approach, we synthesized 60,135 images from the Aff-Wild database and added those images to the training set of the first Affect-in-the-wild Challenge.

The network architecture that we employed here was that of AffWildNet (VGG-FACE-GRU) described in kollias2017recognition . The main evaluation criterion that we chose has been the one used in the Affect-in-the-wild Challenge, i.e., the Concordance Correlation Coefficient (CCC) lawrence1989concordance ; we also report the Mean Squared Error (MSE), since the teams also provided this criterion in the Challenge.

CCC evaluates the agreement between two time series by scaling their correlation coefficient with their mean square difference. CCC takes values in the range , where indicates perfect concordance and denotes perfect discordance. Therefore high values are desired. CCC is defined as follows:

(2)

where and

are the variances of the ground truth and predicted values respectively,

and are the corresponding mean values and is the respective covariance value.

The Mean Squared Error (MSE) provides a simple comparative metric, with a small value being desirable. MSE is defined as follows:

(3)

where and are the ground truth and predicted values respectively and is the total number of samples.

Table 4 shows a comparison of the performance of: i) our VGG-FACE-GRU network trained with the augmented data, ii) the best performing network, AffWildNet, reported in kollias2017recognition and iii) the winner of the Aff-Wild Challenge weichi (Method FATAUVA-Net).

Networks CCC MSE
Valence Arousal Valence Arousal
FATAUVA-Net weichi 0.396 0.282 0.123 0.095
AffWildNet kollias2017recognition 0.570 0.430 0.080 0.060
VGG-FACE-GRU
trained on the
augmented dataset
0.591 0.442 0.074 0.051
Table 4: Aff-Wild: CCC and MSE evaluation of valence & arousal predictions provided by the VGG-FACE-GRU trained on the dataset augmented with images synthesized by our approach vs FATAUVA-Net weichi & AffWildNet kollias2017recognition . Note that valence and arousal values are in ].

From Table 4, it can be verified that the network trained on the augmented, with synthesized images, dataset, outperformed the networks trained without them. It should be noted that the boost in performance has not been large compared to the best performing network, about 0.8%, mainly due to the fact that the number of synthesized images (around 60,000) was small compared to the size of Aff-Wild’s training set (around 900,000) and that the training set size was already sufficient for training the best performing DNN.

5.2.1.2 Experiments on the RECOLA

Following our approach, we synthesized 46,455 images from the RECOLA database; this number corresponds to around 40% of its training data set size. We then added those images to the training set.

The network architecture that we employed here was the ResNet-GRU described in kollias2017recognition . The evaluation criterion was the CCC, which has been widely used in the AVEC Challenges valstar2016avec that have utilized this database.

Table 5 shows a comparison of the performance of: i) our ResNet-GRU network trained with the augmented data, ii) the AffWildNet fine-tuned on the RECOLA, as reported in kollias2017recognition and iii) a ResNet-GRU directly trained on RECOLA, as reported in kollias2017recognition .

Networks CCC
Valence Arousal
ResNet-GRU kollias2017recognition 0.462 0.209
fine-tuned AffWildNetkollias2017recognition 0.526 0.273
ResNet-GRU trained on
the augmented dataset
0.554 0.312
Table 5: RECOLA: CCC evaluation of valence & arousal predictions provided by the ResNet-GRU trained on the dataset augmented with images synthesized by our approach vs ResNet-GRU and fine-tuned AffWildNet kollias2017recognition

From Table 5, it can be verified that the network trained on the augmented, with synthesized images, dataset, outperformed the networks trained without them. This boost in performance was much greater than that of the network (ResNet-GRU) trained with the original training set. Our network also outperformed the network (AffWildNet) trained on the large database, like Aff-Wild, and then fine-tuned on the RECOLA training set. The above gains in performance can be justified by the fact that the number of syntesized images (around 46,500) was significant compared to the size of RECOLA’s training set (around 120,000) and that the original training set size was not very sufficient to train the DNNs.

5.2.1.3 Experiments on the AffectNet

It should be mentioned that the AffectNet database contains around 450,000 manually annotated images and around 550,000 automatically annotated images for valence-arousal. We only used the manually annotated images, since our approach requires neutral images. As a result, we created 2,476,235 synthesized images from the AffectNet database, a number that is more than 5 times bigger than the training data size. We added those images to the training data set.

The network architecture that we employed here was the VGG-FACE network. At first, we trained a VGG-FACE network using the training data set of AffectNet database (let us call this network ’the VGG-FACE baseline’). Next, we trained a VGG-FACE network on the augmented dataset.

The evaluation criteria here were CCC, Pearson-CC (P-CC), Sign Agreement Metric (SAGR) and MSE.

The P-CC takes values in the range [−1,1] and high values are desired. It is defined as follows:

(4)

where and are the variances of the ground truth and predicted values respectively and is the respective covariance value.

The SAGR takes values in the range [0,1], with high values being desirable. It is defined as follows:

(5)

where is the total number of samples, and are the ground truth and predicted values respectively, is the Kronecker delta function and is defined as:

(6)

Table 6 shows a comparison of the performance of: i) our VGG-FACE baseline network trained on the original data, ii) our VGG-FACE network trained with the augmented data and iii) the baseline network of the AffectNet database mollahosseini2017affectnet .

Networks CCC P-CC SAGR MSE
Valence Arousal Valence Arousal Valence Arousal Valence Arousal
baseline mollahosseini2017affectnet 0.60 0.34 0.66 0.54 0.74 0.65 0.14 0.17
our VGG-FACE baseline
0.50 0.37 0.54 0.48 0.65 0.60 0.19 0.18
VGG-FACE trained on
the augmented dataset
0.62 0.54 0.66 0.55 0.78 0.75 0.14 0.15
Table 6: AffectNet: CCC, P-CC, SAGR and MSE evaluation of valence & arousal predictions provided by the VGG-FACE trained on the dataset augmented with images synthesized by our approach vs AffectNet’s baseline mollahosseini2017affectnet & our VGG-FACE baseline. Note that valence and arousal values are in ].

From Table 6, it can be verified that the network trained on the augmented, with synthesized images, dataset, outperformed the networks trained without them. At first, this boost in performance has been very big, in all evaluation criteria, compared to our baseline VGG-FACE network. The explanation arises from the large number of synthesized images that helped the network train and generalize better, since in the training set there existed a lot of ranges that were poorly represented in the training set. This is better seen in the histograms of the -manually annotated- training set, for valence and arousal, shown in Fig. 12(a) & (c), respectively.

Our network also outperformed the AffectNet’s data-base baseline, which was trained on the union of the manually annotated training set and the automatically annotated one. For the arousal estimation, the performance gain was remarkable, mainly in CCC and SAGR evaluation criteria, whereas for the valence estimation the performance gain was also significant. The gain in arousal was much bigger than that in valence, because as is shown in Fig. 12 (b) & (d), the arousal annotations are heavily imbalanced, with most samples belonging in region [-0.1,0.2]. Our synthesized data helped in having more samples in these areas, thus increasing the obtained performance.

[height=0.2]figures/hist_va_affectnet.jpg
                                        (a)                                                                           (b)                                                                         (c)                                                                         (d)
                    valence training set             valence training+automatic set                   arousal training set             arousal training+automatic set

Figure 12: The histogram of valence and arousal AffectNet’s annotations for the manually annotated training set (a),(c) and the union of the manually and automatically annotated training set (b),(d)
5.2.1.4 Experiments on the AFEW-VA

Following our approach, we synthesized 108,864 images from the AFEW-VA database, a number that is more than 3.5 times bigger than its original size.

For training, we used the VGG-FACE-GRU architecture described in kollias2017recognition . Similarly to kossaifi2017afew , we used a 5-fold person-independent cross-validation strategy and at each fold we augmented the training set with the synthesized images of people appearing only in that set (preserving the person independence). The evaluation criteria were the P-CC and the MSE.

Table 7 shows a comparison of the performance of: i) our VGG-FACE-GRU network trained with the augmented data and ii) the best performing networks, as reported in kossaifi2017afew .

Networks Pearson CC MSE
Valence Arousal Valence Arousal
best of kossaifi2017afew 0.407 0.450 6.96 4.97
VGG-FACE-GRU
trained on
the augmented dataset
0.542 0.589 4.75 2.74
Table 7: AFEW-VA: P-CC and MSE evaluation of valence & arousal predictions provided by the VGG-FACE trained on the dataset augmented with images synthesized by our approach vs the best networks in kossaifi2017afew . Note that valence and arousal values are in ].

From Table 7, it can be verified that the network trained on the augmented, with synthesized images, dataset, outperformed the networks trained without them. A great boost in performance was achieved. The general gain in performance can be justified because the number of syntesized images (around 109,000) is much greater than the number of images in the dataset (around 30,000), with the latter being rather small for effectively training the DNNs.

5.2.2 Leveraging synthesized data for training Deep Neural Networks: Basic Expressions case

In the following experiments we used the synthesized faces to train deep neural networks, for face classification into the six basic expressions, over four facial affect databases, RAF-DB, AffectNet, AFEW and BU-3DFE. Our first step has been to select neutral frames from these four databases. Then, for each frame, we synthesized facial affect using the six mean shapes of the blendshape models shown in Fig. 7.

5.2.2.1 Experiments on the RAF-DB

In this experiment, we considered the six basic expression cases only, since our approach synthesizes images based on these categories; we ignored compound expressions that were included in the data set.

In this framework, we created 12,828 synthesized images from the RAF-DB database, which are slightly more than its training images (12,271). We added these synthesized images to the original set, approximately doubling its size. We employed the VGG-FACE network and trained it on the augmented data set. Our evaluation criterion was the mean diagonal value of the confusion matrix.

Networks Anger Disgust Fear Happy Sad Surprise Neutral Average
mSVM-VGG-FACE li2017reliable 0.685 0.275 0.351 0.853 0.649 0.663 0.599 0.582
LDA-VGG-FACE li2017reliable 0.661 0.250 0.378 0.731 0.515 0.535 0.472 0.506
mSVM-DLP-CNN li2017reliable 0.716 0.522 0.622 0.928 0.801 0.812 0.803 0.742
VGG-FACE trained on
the augmented dataset
0.784 0.644 0.622 0.911 0.812 0.845 0.806 0.775
Table 8: RAF-DB: The diagonal values of the confusion matrix for the six basic expressions and their average, using the VGG-FACE trained on the data set augmented with images that were synthesized by our approach, as well as using other networks including the best performing one on RAF-DB li2017reliable

For comparison purposes, we used the networks defined in li2017reliable

: i) mSVM-VGG-FACE: first the VGG-FACE was trained on the RAF-DB database and then features from the penultimate fully connected layer were extracted and fed into a Support Vector Machine (SVM) that performed the classification, ii) LDA-VGG-FACE: same as before: LDA was applied on the features which were extracted from the penultimate fully connected layer and performed the final classification and iii) mSVM-DLP-CNN: the designed Deep Locality Preserving CNN network (we refer the interested reader for more details to

li2017reliable

) was first trained on the RAF-DB database and then a SVM performed the classification using the features extracted from the penultimate fully connected layer of this architecture.

Table 8 shows a comparison of the performance of the above described networks. From Table 8, it can be verified that the network trained on the augmented, with synthesized images, dataset, outperformed the networks trained without them. When compared to the mSVM-VGG-FACE and LDA-VGG-FACE networks, the boost in performance has been significant. This can be explained by the fact that the disgust and fear classes, originally, did not contain a lot of images, but after adding the synthesized data, they did. This resulted in obtaining a better performance in the other classes, as well. Interestingly, there was also a considerable performance gain in the neutral class, that did not contain any synthesized images. This can be explained by considering the fact that the network trained with the augmented data could distinguish better the classes, since it had more samples in the two above described categories.

5.2.2.2 Experiments on the AffectNet

Following our approach, we synthesized 176,425 images from the AffectNet database, a number that is almost 40% of its size. We added those images to its training set. It should be mentioned that the AffectNet database contained the six basic expressions and another one, contempt. Our approach synthesized images only for the basic expressions, so for the contempt class we only kept the original training data.

The network architecture that we employed here was VGG-FACE. At first, we trained a VGG-FACE network using the training set of the AffectNet database (let us call this network ’VGG-FACE baseline’). Next, we trained a VGG-FACE network on the augmented data set.

The evaluation criteria here were total accuracy and score metric. The total accuracy is defined as the total number of correct predictions divided by the total number of samples. The

score is a weighted average of the recall (= the ability of the classifier to find all the positive samples) and precision (= the ability of the classifier not to label as positive a sample that is negative). The

score reaches its best value at 1 and worst score at 0. In our multi-class problem, score is the unweighted mean of the scores of the eight classes. score of each class is defined as:

(7)

Table 9 shows a comparison of the performance of: i) our VGG-FACE baseline network trained on the original data, ii) our VGG-FACE network trained with the augmented data and iii) the baseline network of the AffectNet database mollahosseini2017affectnet .

Networks Total Accuracy score
baseline mollahosseini2017affectnet 0.58 0.58
our VGG-FACE baseline
0.52 0.51
VGG-FACE trained on
the augmented dataset
0.60 0.59
Table 9: AffectNet: Total accuracy and score of the VGG-FACE trained on the dataset augmented with images synthesized by our approach vs AffectNet’s baseline mollahosseini2017affectnet & our VGG-FACE baseline

From Table 9, it can be verified that the network trained on the augmented, with synthesized images, dataset, outperformed the networks trained without them. In more detail, when compared to our VGG-FACE baseline network, the boost in performance was significant. This can be explained by the big size of the added synthesized images. When compared to the AffectNet’s baseline, a slightly improved performance was also obtained; this could be higher, if we had synthesized images for the contempt category as well.

5.2.2.3 Experiments on the AFEW

Following our approach, we synthesized 56,514 images from the AFEW database; this number was almost 1.4 times bigger than its training set size (41,406). We added those images to its training set.

The network architecture that we employed here was VGG-FACE, which was trained on the augmented data set. The evaluation criterion here was total accuracy.

For comparison purposes, we used the following networks developed by the three winning methods of the EmotiW 2017 Grand Challenge: i) VGG-FACE: the VGG-FACE was trained on the AFEW database as described in knyazev2017convolutional , ii) VGG-FACE-FER: the VGG-FACE was first fine-tuned on the FER2013 database goodfellow2013challenges and then trained on the AFEW as described in knyazev2017convolutional , iii) VGG-FACE-external: the VGG-FACE was trained on the union of the AFEW database and some external data as described in vielzeuf2017temporal and iv) VGG-FACE-LSTM-external-augmentation: the VGG-FACE-LSTM was trai-ned on the union of the AFEW database and some external data; then data augmentation was performed, as described in vielzeuf2017temporal .

Networks Total Accuracy
VGG-FACE knyazev2017convolutional 0.379
VGG-FACE-external vielzeuf2017temporal 0.414
VGG-FACE-FER knyazev2017convolutional 0.483
VGG-FACE-LSTM-external-augmentation vielzeuf2017temporal 0.486
VGG-FACE trained on
the augmented dataset
0.484
Table 10: AFEW: Total accuracy of the VGG-FACE trained on the dataset augmented with images synthesized by our approach vs networks developed by the three winning methods of the EmotiW 2017 Grand Challenge knyazev2017convolutional vielzeuf2017temporal .

Table 10 shows a comparison of the performance of the above described networks. From Table 10, one can see that the VGG-FACE trained on the augmented dataset performed much better than the same network trained on, either only the AFEW database, or the union of the AFEW database with some external data whose size in terms of videos was the same as that of AFEW. The boost in performance can be explained taking into account the fact that the fear, disgust and surprise classes contained few data in AFEW and that our approach augmented the data size of those classes; in total the large number of synthesized images assisted to improve the performance of the network.

Additionally, the performance of our network is slight-ly better than the performance of the same VGG-FACE network first fine-tuned on the FER2013 database and then trained on the AFEW. FER2013 is a database of around 35,000 still images and different identities, annotated with the six basic expressions. In this case, the network that was first fine-tuned on the FER2013 database has seen more faces, since the tasks were similar. However, still our network provided a slightly better performance.

On the other hand, our network had a slightly worse performance than a VGG-FACE-LSTM network that was trained with the same external data mentioned before and was also trained with data augmentation. Here, it was the LSTM network, which due to the time recurrent nature could better exploit the fact that AFEW consists of video sequences.

5.2.2.4 Experiments on the BU-3DFE

Following our approach, we synthesized 600 images from the BU-3DFE database. This number was almost one fourth of its size (2,500). BU-3DFE is a small database and is not really suited for training DNNs.

The network architecture that we employed here was VGG-FACE, with a modification in the number of hidden units in the two first fully connected layers. Since we did not have a lot of data for training the network, we i) used 256 and 128 units in the two fully connected layers and ii) kept the convolutional weights fixed, training only the fully connected ones.

For training the network on this database, we used a 10-fold person-independent cross-validation strategy; in each fold, we augmented the training set with the synthesized images of people appearing only in that set (preserving person independence).

The evaluation criterion was total accuracy. The reported total accuracy of the model has been the average of the total accuracies over the 10-folds.

At first, we trained the above described VGG-FACE network (let us call this network ’VGG-FACE baseline’). Next, we trained the above described VGG-FACE network, but also applied on-the-fly data augmentation techniques, such as: small rotations, left and right flipping, first resize and then random crop to original dimensions, random brightness and saturation (let us call this network ’VGG-FACE-augmentation’). Finally, we trained the above described VGG-FACE network with the synthesized data as well.

Networks Total Accuracy
VGG-FACE baseline knyazev2017convolutional 0.528
VGG-FACE-augmentation vielzeuf2017temporal 0.588
VGG-FACE trained on
the augmented dataset
0.768
Table 11: BU-3DFE: Total accuracy of the VGG-FACE trained on the dataset augmented with images synthesized by our approach vs the VGG-FACE baseline and the VGG-FACE trained with on-the-fly data augmentation.

Table 11 shows a comparison of the performance of those networks. From Table 11, it can be verified that the network trained on the augmented, with synthesized images, data set, greatly outperformed the networks trained without them. This indicates that the proposed approach for synthesizing images can be used for data augmentation in cases of small amount of DNN training data, being able to significantly improve the obtained performances.

6 Conclusions and Future Work

A novel approach to generate facial affect in faces has been presented in this paper. It leverages a dimensional emotion model in terms of valence and arousal or the six basic expressions, and a large scale 4D face database, the 4DFAB. An efficient method has been developed for matching different blendshape models on large amounts of images extracted from the database and using these to render the appropriate facial affect on a selected face. A variety of faces and facial expressions has been examined in the experimental study, from thirteen databases showing expressions according to dimensional, but also categorical emotion models. The proposed approach has been successfully applied to faces from all databases, being able to render photorealistic facial expressions on them.

In our future work we will extend this approach to synthesize, not only dimensional affect in faces, but also Facial Action Units. In this way a Global Local synthesis of facial affect will be possible, through a unified modeling of global dimensional emotion and local action unit based facial expression synthesis.

References

  • (1) Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties.

    Foundations and Trends in Machine Learning

    4(1), 1–106 (2012).
    DOI 10.1561/2200000015. URL http://dx.doi.org/10.1561/2200000015
  • (2) Blanz, V., Basso, C., Poggio, T., Vetter, T.: Reanimating faces in images and video. In: Computer graphics forum, vol. 22, pp. 641–650. Wiley Online Library (2003)
  • (3) Booth, J., Antonakos, E., Ploumpis, S., Trigeorgis, G., Panagakis, Y., Zafeiriou, S.: 3d face morphable models ”in-the-wild”. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017). URL https://arxiv.org/abs/1701.05360
  • (4) Booth, J., Roussos, A., Ponniah, A., Dunaway, D., Zafeiriou, S.: Large scale 3d morphable models. International Journal of Computer Vision 126(2-4), 233–254 (2018)
  • (5) Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: Vggface2: A dataset for recognising faces across pose and age. In: Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on, pp. 67–74. IEEE (2018)
  • (6)

    Chang, W.Y., Hsu, S.H., Chien, J.H.: Fatauva-net : An integrated deep learning framework for facial attribute recognition, action unit (au) detection, and valence-arousal estimation.

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop (2017)

  • (7) Cheng, S., Kotsia, I., Pantic, M., Zafeiriou, S.: 4dfab: A large scale 4d database for facial expression analysis and biometric applications. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018). Salt Lake City, Utah, US (2018)
  • (8) Dhall, A., Goecke, R., Ghosh, S., Joshi, J., Hoey, J., Gedeon, T.: From individual to group-level emotion recognition: Emotiw 5.0. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 524–528. ACM (2017)
  • (9) Ding, H., Sricharan, K., Chellappa, R.: Exprgan: Facial expression editing with controllable expression intensity. arXiv preprint arXiv:1709.03842 (2017)
  • (10) Du, S., Tao, Y., Martinez, A.: Compound facial expressions of emotion. Proceedings of the National Academy of Sciences of the United States of America 111(15), 1454–1462 (2014). DOI 10.1073/pnas.1322355111
  • (11) Ekman, P., Friesen, W.V.: Nonverbal leakage and clues to deception. Psychiatry 32(1), 88–106 (1969)
  • (12) Ghodrati, A., Jia, X., Pedersoli, M., Tuytelaars, T.: Towards automatic image editing: Learning to see another you. arXiv preprint arXiv:1511.08446 (2015)
  • (13) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems, pp. 2672–2680 (2014)
  • (14) Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.H., et al.: Challenges in representation learning: A report on three machine learning contests. In: International Conference on Neural Information Processing, pp. 117–124. Springer (2013)
  • (15) Gross, R., Matthews, I., Cohn, J., Kanade, T., Baker, S.: Multi-pie. Image and Vision Computing 28(5), 807–813 (2010)
  • (16) Huang, R., Zhang, S., Li, T., He, R., et al.: Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. arXiv preprint arXiv:1704.04086 (2017)
  • (17) Jonze, S., Cusack, J., Diaz, C., Keener, C., Kaufman, C.: Being John Malkovich. Universal Studios (1999)
  • (18) Knyazev, B., Shvetsov, R., Efremova, N., Kuharenko, A.: Convolutional neural networks pretrained on large face recognition datasets for emotion classification from video. arXiv preprint arXiv:1711.04598 (2017)
  • (19) Kollias, D., Nicolaou, M.A., Kotsia, I., Zhao, G., Zafeiriou, S.: Recognition of affect in the wild using deep neural networks. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pp. 1972–1979. IEEE (2017)
  • (20) Kollias, D., Tzirakis, P., Nicolaou, M.A., Papaioannou, A., Zhao, G., Schuller, B., Kotsia, I., Zafeiriou, S.: Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond (2018)
  • (21) Kossaifi, J., Tzimiropoulos, G., Todorovic, S., Pantic, M.: Afew-va database for valence and arousal estimation in-the-wild. Image and Vision Computing (2017)
  • (22) Larsen, A.B.L., Sønderby, S.K., Larochelle, H., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300 (2015)
  • (23) Lawrence, I., Lin, K.: A concordance correlation coefficient to evaluate reproducibility. Biometrics pp. 255–268 (1989)
  • (24) Li, S., Deng, W., Du, J.: Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 2584–2593. IEEE (2017)
  • (25) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Frey, B.: Adversarial autoencoders. arXiv preprint arXiv:1511.05644 (2015)
  • (26) Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
  • (27) Mohammed, U., Prince, S.J., Kautz, J.: Visio-lization: generating novel facial images. ACM Transactions on Graphics (TOG) 28(3), 57 (2009)
  • (28) Mollahosseini, A., Hasani, B., Mahoor, M.H.: Affectnet: A database for facial expression, valence, and arousal computing in the wild. arXiv preprint arXiv:1708.03985 (2017)
  • (29) Neumann, T., Varanasi, K., Wenger, S., Wacker, M., Magnor, M., Theobalt, C.: Sparse localized deformation components. ACM Transactions on Graphics (TOG) 32(6), 179 (2013)
  • (30) Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: BMVC, vol. 1, p. 6 (2015)
  • (31) Pérez, P., Gangnet, M., Blake, A.: Poisson image editing. In: ACM SIGGRAPH 2003 Papers, SIGGRAPH ’03, pp. 313–318. ACM, New York, NY, USA (2003). DOI 10.1145/1201775.882269. URL http://doi.acm.org/10.1145/1201775.882269
  • (32) Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., Salesin, D.H.: Synthesizing realistic facial expressions from photographs. In: ACM SIGGRAPH 2006 Courses, p. 19. ACM (2006)
  • (33) Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the recola multimodal corpus of remote collaborative and affective interactions. In: Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on, pp. 1–8. IEEE (2013)
  • (34) Russell, J.A.: Evidence of convergent validity on the dimensions of affect. Journal of personality and social psychology 36(10), 1152 (1978)
  • (35) Savran, A., Alyüz, N., Dibeklioğlu, H., Çeliktutan, O., Gökberk, B., Sankur, B., Akarun, L.: Bosphorus database for 3d face analysis. In: European Workshop on Biometrics and Identity Management, pp. 47–56. Springer (2008)
  • (36) Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. In: Advances in Neural Information Processing Systems, pp. 3483–3491 (2015)
  • (37) Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face representation by joint identification-verification. In: Advances in neural information processing systems, pp. 1988–1996 (2014)
  • (38) Susskind, J.M., Hinton, G.E., Movellan, J.R., Anderson, A.K.: Generating facial expressions with deep belief nets. In: Affective Computing. InTech (2008)
  • (39) Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: Closing the gap to human-level performance in face verification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1708 (2014)
  • (40) Theobald, B.J., Matthews, I., Mangini, M., Spies, J.R., Brick, T.R., Cohn, J.F., Boker, S.M.: Mapping and manipulating facial expression. Language and speech 52(2-3), 369–386 (2009)
  • (41) Thies, J., Zollhöfer, M., Nießner, M., Valgaerts, L., Stamminger, M., Theobalt, C.: Real-time expression transfer for facial reenactment. ACM Trans. Graph. 34(6), 183–1 (2015)
  • (42) Thomaz, C.E., Giraldi, G.A.: A new ranking method for principal components analysis and its application to face image analysis. Image and Vision Computing 28(6), 902–913 (2010)
  • (43) Valstar, M., Gratch, J., Schuller, B., Ringeval, F., Lalanne, D., Torres Torres, M., Scherer, S., Stratou, G., Cowie, R., Pantic, M.: Avec 2016: Depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pp. 3–10. ACM (2016)
  • (44) Vielzeuf, V., Pateux, S., Jurie, F.: Temporal multimodal fusion for video emotion classification in the wild. arXiv preprint arXiv:1709.07200 (2017)
  • (45) Whissell, C.M.: The dictionary of affect in language. In: The measurement of emotions, pp. 113–131. Elsevier (1989)
  • (46) Wright, S.J., Nowak, R.D., Figueiredo, M.A.T.: Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing 57(7), 2479–2493 (2009). DOI 10.1109/TSP.2009.2016892
  • (47) Yan, X., Yang, J., Sohn, K., Lee, H.: Attribute2image: Conditional image generation from visual attributes. In: European Conference on Computer Vision, pp. 776–791. Springer (2016)
  • (48) Yang, F., Bourdev, L., Shechtman, E., Wang, J., Metaxas, D.: Facial expression editing in video using a temporally-smooth factorization. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 861–868. IEEE (2012)
  • (49) Yang, F., Wang, J., Shechtman, E., Bourdev, L., Metaxas, D.: Expression flow for 3d-aware face component transfer. ACM Transactions on Graphics (TOG) 30(4), 60 (2011)
  • (50) Yeh, R., Liu, Z., Goldman, D.B., Agarwala, A.: Semantic facial expression editing using autoencoded flow. arXiv preprint arXiv:1611.09961 (2016)
  • (51) Yin, L., Wei, X., Sun, Y., Wang, J., Rosato, M.J.: A 3d facial expression database for facial behavior research. In: Automatic face and gesture recognition, 2006. FGR 2006. 7th international conference on, pp. 211–216. IEEE (2006)
  • (52) Zafeiriou, S., Kollias, D., Nicolaou, M.A., Papaioannou, A., Zhao, G., Kotsia, I.: Aff-wild: Valence and arousal ‘in-the-wild’challenge. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1980–1987. IEEE (2017)
  • (53) Zhang, Q., Liu, Z., Quo, G., Terzopoulos, D., Shum, H.Y.: Geometry-driven photorealistic facial expression synthesis. IEEE Transactions on Visualization and Computer Graphics 12(1), 48–60 (2006)
  • (54) Zhang, S., Wu, Z., Meng, H.M., Cai, L.: Facial expression synthesis based on emotion dimensions for affective talking avatar. In: Modeling machine emotions for realizing intelligence, pp. 109–132. Springer (2010)
  • (55) Zhang, Z., Song, Y., Qi, H.: Age progression/regression by conditional adversarial autoencoder. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2 (2017)