Registration-free Face-SSD: Single shot analysis of smiles, facial attributes, and affect in the wild

02/11/2019 ∙ by Youngkyoon Jang, et al. ∙ University of Bristol 8

In this paper, we present a novel single shot face-related task analysis method, called Face-SSD, for detecting faces and for performing various face-related (classification/regression) tasks including smile recognition, face attribute prediction and valence-arousal estimation in the wild. Face-SSD uses a Fully Convolutional Neural Network (FCNN) to detect multiple faces of different sizes and recognise/regress one or more face-related classes. Face-SSD has two parallel branches that share the same low-level filters, one branch dealing with face detection and the other one with face analysis tasks. The outputs of both branches are spatially aligned heatmaps that are produced in parallel - therefore Face-SSD does not require that face detection, facial region extraction, size normalisation, and facial region processing are performed in subsequent steps. Our contributions are threefold: 1) Face-SSD is the first network to perform face analysis without relying on pre-processing such as face detection and registration in advance - Face-SSD is a simple and a single FCNN architecture simultaneously performing face detection and face-related task analysis - those are conventionally treated as separate consecutive tasks; 2) Face-SSD is a generalised architecture that is applicable for various face analysis tasks without modifying the network structure - this is in contrast to designing task-specific architectures; and 3) Face-SSD achieves real-time performance (21 FPS) even when detecting multiple faces and recognising multiple classes in a given image. Experimental results show that Face-SSD achieves state-of-the-art performance in various face analysis tasks by reaching a recognition accuracy of 95.76 attribute prediction, and Root Mean Square (RMS) error of 0.44 and 0.39 for valence and arousal estimation.



There are no comments yet.


page 2

page 3

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) Smile Recognition
(b) Facial Attribute Prediction
(c) Valence-Arousal Estimation
Figure 1: Our system, which we refer to here as Face-SSD, detects faces and smiles, recognises facial attributes, and predicts affect along the valence and arousal dimensions, in the wild. LABEL:

When detected faces are determined as smiling faces, the colour of the black bounding box changes to red. The probability that appears at the top of the box indicates the face confidence score, and the one appearing in the middle of the box is the smile confidence score. The intensity of red corresponds to the level of confidence.


attribute labels that are sorted in descending order (using prediction score) are displayed below the corresponding face bounding box. Attributes classified as “True” are displayed in yellow text, or displayed in another colour otherwise. For sorting, we use a modified value (

predicted score) for the attributes that are classified as “False”. LABEL: Horizontal and vertical bars indicate the degree of valence (VA) and arousal (AR), respectively. A bar starts in the middle (representing a value of zero) and ends at the corner of the bounding box along both positive (right / top) and negative (left / bottom) directions. The intensity of the colour (red for positive, blue for negative) corresponds to the level of the predicted score between and . According to the content of the dataset, Face-SSD has been trained for a limited range of face sizes (see Sec. 4.3.3 for details). (Best viewed in colour)

Face analysis is one of the most studied areas in various research communities including Computer Vision (CV) and Affective Computing (AC). Cutting edge results are constantly obtained for various face-related analysis and recognition tasks including face detection (Zhang et al., 2017a, b, 2018), face recognition (Wu et al., 2017a), expression recognition (Li et al., 2017), valence-arousal estimation (Kossaifi et al., 2017), action unit detection (Zeng et al., 2016; Li et al., 2017), face attribute recognition (Liu et al., 2015; Hand and Chellappa, 2017), age estimation (Chen et al., 2017b; Hsu et al., 2017; Agustsson et al., 2017), landmark detection (Lv et al., 2017; Shen et al., 2015) and face alignment (Jourabloo and Liu, 2017). However, in order to get the best performance, recent studies design specific architectures for each individual face analysis task. Although some works propose unified frameworks for handling multiple face-related tasks (Wu et al., 2017b; Chang et al., 2017; Ranjan et al., 2017), several open issues remain yet to be explored:

  • Unconstrained conditions: Most of the existing approaches require a detected and normalised face input.

  • Scalability: Most methods design separate networks for different tasks. However, networks that are specifically designed to maximise the performance for certain tasks cannot be easily adapted to do other types of face analysis tasks.

  • Real-time performance: Existing methods do not achieve real-time performance because they require time-consuming preprocessing steps such as face detection and registration before performing face analysis.

In order to address the above mentioned challenges, we propose Face-SSD, a network that performs simultaneously face detection and one or more face analysis tasks (see Fig. 1) in a single architecture. Face-SSD aims to not only detect faces in a given colour image (upper part in Fig. 2 (a)), but also to perform several other face analysis tasks (lower part in Fig. 2 (a)) associated with the detected faces. Similar to the SSD used for object detection (Liu et al., 2016), the proposed Face-SSD uses a pre-trained VGG16 network (Simonyan and Zisserman, 2015) to extract low level features as shown in Fig. 2 (a) [G1:G5]. Then, multi-scaled convolution layers are added after the convolutional layers of the VGG16 to perform both classification (face classification and face analysis task) and regression (bounding box localisation) tasks (see Fig. 2 (a) [G6:G10]). To the best of our knowledge, Face-SSD is the first single face network that can handle several face analysis tasks without a pre-normalisation step.

Figure 2: The architecture of Face-SSD. (a) The entire architecture of Face-SSD consisting solely of convolution and pool layers. (b) Example of the concatenated output convolution layer for the second scale () that produces a heatmap volume. (b-1) through (b-3) show how to modify Face-SSD for binary classification (smile recognition), multi-class recognition (facial attribute prediction), and multi-task regression (valence-arousal estimation). (c) Legend for layers and parts of Face-SSD.

The proposed architecture is trained and evaluated using well-known benchmark datasets for face detection (AFLW (Koestinger et al., 2011)), smile recognition (GENKI-4K (, 2009), CelebA (Liu et al., 2015)), facial attribute prediction (CelebA (Liu et al., 2015)) and valence-arousal estimation (AffectNet (Mollahosseini et al., 2017)). As discussed in Sec. 3.1.1, we first obtain a set of matched default boxes as proposed by Liu et al. (Liu et al., 2016). Then, we train the Face-SSD by optimising multiple losses (associated with face classification, bounding box regression, and a face-related task such as smile recognition, multiple facial attributes prediction or valence-arousal estimation). We adopt data augmentation and Hard Negative Mining (HNM) strategies, and achieve state-of-the-art or very competitive performance in various face analysis tasks without modifying the structure and while maintaining real-time performance.

The main contributions of our work are three-fold:

  1. Unconstrained processing: Face-SSD does not rely on a pre-normalisation step, it requires neither face detection nor registration in advance. Most of the existing approaches to face analysis require a cropped or normalised face in advance.

  2. Universal architecture: Face-SSD can be applied to most face analysis tasks with a simple modification (the number of final prediction channels), and achieve state-of-the-art or very competitive results. Most of the existing approaches use separate networks.

  3. Real-time processing: Face-SSD can be trivially extended to perform several face analysis tasks at negligible additional processing time.

The remainder of the paper is organised as follows: In Section 2, related work in face analysis that rely on registration, require task-specific model design, or handle multiple tasks is reviewed. In Section 3, we present the proposed Face-SSD framework and explain how to apply Face-SSD to several face applications. Experimental results of the applications using the proposed Face-SSD are provided in Section 4. Finally, conclusions are drawn and discussed in Section 5.

2 Related Work

General pipeline for facial analysis. Sariyanidi et al. (Sariyanidi et al., 2015) discusses the state-of-the-art methods for face registration, representation, dimensionality reduction and recognition, which are the common components of a generic pipeline for performing automatic facial affect analysis. Depending on the target application, the generic pipeline might have to be changed to some degree. Nonetheless, the first two steps of face localisation and 2D / 3D registration steps have been necessary for most of the face analysis tasks such as smile recognition, facial attribute prediction, valence-arousal estimation, gender recognition, age prediction, and head pose estimation. See (Sariyanidi et al., 2015; Tam et al., 2013; Wang et al., 2014; Zhang and Zhang, 2010) for details.

Registration-based face analysis.

Despite significant advances in deep learning, automatic face analysis tasks, such as smile detection (a comparative review is provided in Table

3), attribute prediction (Hand and Chellappa, 2017; Han et al., 2017; Sethi et al., 2018) and valence-arousal estimation (Mollahosseini et al., 2017)

, still face major challenges caused by occlusions and variances of head pose, scale, and illumination. These challenges are the main reason why every state-of-the-art approach to face analysis requires a pre-normalisation step involving face detection and registration (rotation, scaling, and 2D/3D transformation).

Approaches without pre-normalisation. There exist some works that process the original input image without pre-normalisation steps. Liu et al. (Liu et al., 2015) combines LNet localising a face and ANet to predict facial attributes. However, they use EdgeBox (Zitnick and Dollar, 2014) that proposes a number of candidate windows to determine the final facial region among the multiple predicted positions scattered by LNet. Before feeding the output of LNet to ANet, this process for narrowing the potential face region is performed several times through several LNet stages. This method uses an actual image as input, but the processes inside the architecture operate as a sequential pipeline.

Ranjan et al. (Ranjan et al., 2017) proposed a deep neural network consisting of multiple branches to handle various face-related tasks. The proposed network uses Selective Search (van de Sande et al., 2011) to generate multi-region proposals. Although the proposed network deals with face classification and face-related tasks in different branches, the face classification and multiple face analysis tasks are performed as separate continuous pipelines. In summary, most of the previous works that do not require a pre-normalisation step follow similar mechanisms to (Liu et al., 2015; Ranjan et al., 2017), which require region proposal steps in the middle of the process. These region proposal steps typically increase the overall processing time.

Task-specific model design. Several recent approaches address the problem of facial attributes prediction (Han et al., 2017; Sethi et al., 2018). Some propose to use successful face-specific feature representations (Zhong et al., 2016), modelling class distributions (Ehrlich et al., 2016) and balancing attributes (Rudd et al., 2016), indirect guiding the categorisation of similar features (Wang et al., 2016; Sethi et al., 2018), or direct grouping the relevant attributes (Hand and Chellappa, 2017; Han et al., 2017). The best performances (more than accuracy) are obtained by specifically designing a model structure that utilises the relations between relevant attributes (Hand and Chellappa, 2017; Han et al., 2017). For this purpose, MCNN-AUX (Hand and Chellappa, 2017) uses implicit and explicit attribute relationships while DMTL (Han et al., 2017) relies on attribute correlation and heterogeneity. R-Codean (Sethi et al., 2018)

proposes a new loss function that incorporates both magnitude and direction of image vectors in feature learning and proposes a framework incorporating a patch-based weighting mechanism. By assigning higher weights to relevant patches for each attribute, the method has similar advantages to grouping relevant attributes.

Compared to the previous studies proposing a task-specific model, we propose a generalised architecture that can be used for attribute prediction as well as other face analysis tasks. Unlike the state-of-the-art methods (specifically in face attribute prediction), the proposed Face-SSD uses similar face-size categories associated with each output layer, and incorporates the Hide-and-Seek (Singh and Lee, 2017) data augmentation method which forces the network to explore the entire face area more extensively during training. This size-based categorisation and simple data augmentation strategy enabled us to achieve performance close to the state of the art (more than accuracy in attribute prediction) without customising the network.

Multi-task facial analysis. While most works on facial analysis use a specially designed architecture to tackle a single application, some works attempt to use an integrated single architecture with multiple branches to address multiple tasks. Ranjan et al. (Ranjan et al., 2017) categorised face-related tasks into two groups: subject-independent tasks (e.g., keypoint detection, pose and smile) and subject-dependent tasks (e.g., gender and facial identity). The all-in-one network (Ranjan et al., 2017) learns multiple tasks in a single architecture, but first uses subject-independent class results to register faces. Then the network performs subject-dependent classification tasks sequentially.

Chang et al. (Chang et al., 2017) proposed FATAUVA-Net to learn multiple tasks related to affect behaviours in a single deep neural network. Similarly to the all-in-one network (Ranjan et al., 2017), FATAUVA-Net categorised similar tasks that share a feature layer. For example, the network branches eye-related tasks (attributes: eyeglasses, narrow eyes / Action Unit (AU): AU6, AU7, AU45) from the same previous layer. The network branches mouth-related tasks (attributes: mouth slightly open, smile / AU: AU23, AU24, AU25) from other layers extracting mouth-related features. The network branches the valence and arousal prediction layer from the associated AU layers. Although these architectures predict multiple face analysis tasks, it is still difficult to generalise or use them in other tasks that take advantage of large patterns for face detection and use small patterns for other face analysis tasks.

Object (face) detection in the wild. The proposed Face-SSD is inspired by SSD (Liu et al., 2016) considering the face as a specific type of object. SSD (Liu et al., 2016) has been applied and extended in many research domains, including text detection (He et al., 2017), face detection (Zhang et al., 2017b), object pose estimation (Poirson et al., 2016; Kehl et al., 2017) and temporal action detection (Lin et al., 2017). Similar to the latest methods Face-SSD uses the concept of default box (Liu et al., 2016) or anchor box (Ren et al., 2016). Using a baseline architecture that has been successfully applied to various detection tasks, we propose the first SSD-inherited architecture that tackles both continuous large-pattern-leveraged tasks (e.g., face detection) and small-pattern-leveraged tasks (e.g., face analysis) in parallel.

3 The Proposed Framework: Face-SSD

The proposed Face-SSD framework is shown in Fig. 2. Face-SSD is a fully convolutional neural network consisting solely of convolutional and pooling layers. The input to the Face-SSD is a colour image . There are six layers (i.e. ), each corresponding to a certain scale, that is size of face. At each scale the output is a heatmap containing at each spatial position the confidence score that a face is present at that location, a heatmap with the parameters of the bounding box of the face associated with that position , and a heatmap with the face analysis task confidence score(s) at each position , that is at every spatial location , as shown in Fig. 2(b). At test time, a threshold at the face detection confidence score heatmap selects candidate faces at several spatial locations . Subsequently, Non-Maximum Suppression (NMS) (Neubeck and Van Gool, 2006) is used to derive the bounding boxes and in each of them calculate scores for the face analysis tasks and for the face detection.

The following sections describe how to configure Face-SSD (Sec. 3.1), how to train face detection and face analysis task in a single architecture (Sec. 3.2), and how to combine the face detection and analysis results during testing (Sec. 3.3).

3.1 Model Construction

Face-SSD consists of layers performing at various stages feature extraction (VGG16 Conv. Layers), face detection, and face analysis as shown in Fig.

2(a). represents convolution and pooling layer groups with the same input resolution. For example, G2 consists of two convolution layers and one pooling layer, whereas G6 consists of two convolution layers. Similarly to SSD (Liu et al., 2016), Face-SSD outputs six-scale () heatmap volumes generated by multiple output convolution layers [(f1, t1):(f6, t6)]. f[1:6] is produced by the face detection part, while t[1:6] is produced by the face analysis part. The output convolution layers of the two different parts are aligned and concatenated at the end.

Each concatenated output convolution layer outputs a pixel-wise heatmap volume consisting of heatmap planes. For example, the concatenated output convolution layer for the second scale () outputs a three-dimensional volume () consisting of heatmap planes having the same resolution () of the second scale, as shown in Fig. 2(b). The first plane indicates the existence of a face. The next four heatmap planes at each spatial position contain the centre of the face bounding box and its width , and height . The former is relative to the location (i.e., are actually offsets) and the latter is relative to the current heatmap scale . The remaining set of heatmap planes are the confidences for the face analysis tasks – note that these are also heatmaps, that is, they have spatial dimensions as well.

All of the convolution layers are followed by ReLU activation function except for the output convolution layer. For the output convolution layer, for binary classification tasks, such as face classification, smile recognition and attribute prediction, we use the sigmoid function (see Fig.

2(b), (b-1) and (b-2), respectively). For regression tasks such as bounding box offsets and valence-arousal estimation, we use linear functions similarly to SSD (Liu et al., 2016) (see Fig. 2(b) and (b-3)). The parameters for the layers in Face-SSD are summarised in Table 1

. The parameters of the convolution layer are denoted in the order of number of kernels, kernel size, stride and padding, while the parameters of the pool layer follow the order of kernel size, stride and padding.

During training, the output (prediction) values that appear in heatmaps responsible for the bounding box and tasks are examined only when the corresponding face label exists in the pixel (see details in Sec. 3.2.1). During testing, the values for the bounding box and the task-related output are examined only when the corresponding face confidence score exceeds a threshold. The face detection threshold is determined by selecting the optimal value that provides the best performance on the face detection task.

Group ID Conv. ID: Parameters Pool
G1 [1:2]: (64, 3, 1, 1) (2, 2, 0)
G2 [1:2]: (128, 3, 1, 1) (2, 2, 0)
G3 [1:3]: (256, 3, 1, 1) (2, 2, 0)
G4 [1:3]: (512, 3, 1, 1) (2, 2, 0)
G5 [1:3]: (512, 3, 1, 1) (3, 1, 1)
G6 1: (1024, 3, 1, 1)
2: (1024, 1, 1, 0)
G7 1: (256, 1, 1, 0)
2: (512, 3, 2, 1)
G8 1: (128, 1, 1, 0)
2: (256, 3, 2, 1)
G9 1: (128, 1, 1, 0)
2: (256, 3, 1, 0)
G10 1: (128, 1, 1, 0)
2: (256, 3, 1, 0)
Out. Conv. : (1, 3, 1, 1)
: (4, 3, 1, 1)
: (n, 3, 1, 1)
Table 1: The detailed parameters of Face-SSD layers (see text)

3.1.1 Implementation details

Single aspect ratio: We utilise only one aspect ratio (square) configuring a default box to assign a ground truth label to a pixel position in a heatmap, as shown in Fig. 3. This is because face deformations, caused by expression and pose, result in similar aspect ratios. This is in accordance with the related work in the literature – e.g., Hao et al. (Hao et al., 2017) proposed Single-Scale RPN utilising one anchor box and Zhang et al. (Zhang et al., 2017b) proposed SFD utilising one default box.

Usage of pre-trained models: Several works including Liu et al. (Liu et al., 2015)

demonstrate that models pre-trained on object recognition (e.g., ImageNet

(Deng et al., 2009)) are useful for face localisation. Similarly, networks pre-trained on face recognition (e.g., CelebFaces (Sun et al., 2014)) are useful for capturing face attributes at a more detailed level. For this reason, we selectively use pre-trained parameters (trained with an object dataset (Russakovsky et al., 2015; Simonyan and Zisserman, 2015) and a face dataset (Koestinger et al., 2011)) to initialise the convolution filters for face detection and analysis tasks (see details in Sec. 3.2). This usage of pretrained models helps with improving the Face-SSD performance for both face detection (utilising large patterns) and analysis (utilising relatively smaller patterns) tasks.

3.2 Training

Training of Face-SSD follows the following four steps:

  1. Copying parameters of the VGG16 network (Simonyan and Zisserman, 2015) (convolution layers) to the VGG16 (feature extraction) part of Face-SSD and subsampling111For example, the first fully connected layer of the VGG16 network (Simonyan and Zisserman, 2015) connects all the positions of a dimensional input feature map, where is the feature (kernel) dimension at each of the spatial locations, to a dimension output vector

    . Let us organise the weights in a tensor

    with dimensions . On the other hand, Face-SSD takes an input feature map with dimensions and outputs a feature map with dimensions = using filters with kernel size . The weight tensor is then of dimensions . In order to initialise the , we uniformly subsample the along each of its modes – in our case by a factor . This corresponds to subsampling by a factor of along the dimension of the output feature vector and by a factor of along each spatial dimension of the input tensor of the VGG16 network – we copy the corresponding weights. the parameters from fully connected layers ( and ) of VGG16 network to the layers of Face-SSD, as described in SSD (Liu et al., 2016).

  2. Freezing the face analysis part and finetuning the face detection part by using the AFLW (face) dataset (Koestinger et al., 2011).

  3. Copying the parameters of the layers constituting the face detection part to the corresponding layers of the face analysis part.

  4. Freezing the face detection part and finetuning the layers constituting the face analysis part by using task-related datasets (e.g., CelebA (Liu et al., 2015) or GENKI-4K (, 2009) for smile recognition, CelebA (Liu et al., 2015) for facial attribute prediction, AffectNet (Mollahosseini et al., 2017) for valence-arousal estimation).

The first and second steps are similar to the initialisation and end-to-end learning process of SSD network (Liu et al., 2016). We use the same cost function as the SSD to finetune the face detection part of Face-SSD.

Figure 3: Example of matched default box for the face confidence heatmaps , produced by and output convolution layers (see Fig. 2). (a) Dotted boxes (grey) represent multiple candidate default boxes with multiple different aspect ratios. Face-SSD (b) uses only one aspect ratio in the matching process of the default box . The example image is one of the sample images of AFLW dataset (Koestinger et al., 2011).

3.2.1 Face Detection

As described above, finetuning of the face detection part is based on the use of an objective loss function , which is a weighted sum of the face classification loss and the bounding box regression loss defined as:


where N is the total number of matched default boxes. For the regression loss , smooth L1 loss (Girshick, 2015) is used for calculating the distance between the predicted and the ground truth bounding boxes (Liu et al., 2016), as shown in Eq. 2 and 3. Specifically,




The face classification loss is based on binary cross entropy over face confidence scores , as shown in Eq. 4.


The flag , used in the equations above is set to 1 when the overlap between the ground truth and the default bounding box exceeds a threshold. Note that the regression loss is only used when , and is disabled otherwise.

At the later stages of the training, similar to (Liu et al., 2016) we use Hard Negative Mining (HNM), that is, we sort calculated losses only in the background region (

) in descending order and select and backpropagate only from the highest ones. Following

(Liu et al., 2016), we set the loss-balancing weight (in Eq. 1) to .

3.2.2 Face Analysis

This section describes how to apply Face-SSD to various face analysis tasks. We address three problems: smile recognition as binary classification, facial attribute prediction as multi-class recognition and valence-arousal estimation as multi-task regression. In all three problems, the architecture of the network differs only in terms of the number of the facial task heatmaps. For datasets that have multiple annotations for the same image, Face-SSD supports multi-task learning by defining a multi-task loss function as in Eq. 5.


That is, the multi-task loss is defined as the norm of multiple weighted individual face analysis task losses . is used to calculate errors using a ground truth and a prediction for a given task . denotes the total number of face analysis tasks. In what follows we define the loss functions used for different problems we address.

Smile Recognition. The smile classification loss , is the binary cross entropy over smile confidence scores and the ground truth as defined in Eq. 6.


The ground truth at each location is set using the default box matching strategy (Liu et al., 2016). The loss is defined at each spatial location of the output heatmap, and in this case, we do not use Hard Negative Mining (HNM), which was required to select negative samples for face detection (see Sec. 3.2.1).

Finetuning the network for face analysis tasks (i.e., smile recognition) does not impair the face detection performance due to freezing the parameters for the face detection part of Face-SSD.

Facial Attribute Prediction. Facial attribute prediction is treated as multiple binary classification problems where a number of attributes may exist simultaneously. For example, a face attribute (such as smiling) can appear independently of other attributes (such as the gender or hair colour). Therefore, we define the facial attribute prediction loss as the average of independent attribute losses, that is


where denotes the total number of attributes. and denote the ground truth (1 or 0) label and a predicted attribute confidence score of the -th attribute, respectively. For calculating a single attribute prediction loss associated with an individual attribute , we use the binary cross entropy over attribute confidence scores .

Valence and Arousal Estimation. Similar to several other previous works (e.g. (Koelstra and Patras, 2013), (Mollahosseini et al., 2017)), we treat arousal and valence prediction as a regression problem. Valence is related to the degree of positiveness of the affective state, whereas arousal is related to the degree of excitement (Russell, 2003; Russell and Barrett, 1999). We used the Euclidean (L2) distance between the predicted value and ground truth value of valence/arousal , as shown in Eq. 8. The loss is then defined as the sum of the valence and the arousal losses, that is


where is the number of image samples in a mini-batch.

3.2.3 Data Augmentation in Training

Face-SSD uses a resolution and

channel colour input image. Prior to data augmentation, all pixel values for the R, G, and B channels of a sample image are normalised based on the mean and standard deviation values of the entire dataset. Each sample image is first flipped in the horizontal direction with a probability of 0.5. In the training session, we randomly select one of data augmentation mechanisms (shrinking, cropping, gamma correction and Hide-and-Seek (H-a-S)

(Singh and Lee, 2017)

) to create noisy data samples for each epoch.

Both shrinking and cropping maintain the aspect ratio. Gamma correction is applied separately to the individual R, G, B channels. In Hide and Seek (H-a-S) (Singh and Lee, 2017) we hide image subareas and force a network to seek more context in areas that are not as discriminative as key distinctive areas such as lip corners. We first randomly select a division number among , , or . If we select , the image region will be divided into () sub-image patches. Each sub-image patch is then hidden (filled with the mean R, G, B values of all data samples in a dataset) with a probability of .

3.3 Testing

The registration-free Face-SSD for a specific face analysis task (e.g., smile recognition) is based on both face and task (e.g., smile) confidence scores. First, the locations in the face confidence heatmap, for which the score exceeds a threshold (), are selected. Then Non-Maximum Suppression (NMS) method (with jaccard overlap value as in SFD (Zhang et al., 2017b)) is used to extract the final bounding boxes. Subsequently, a task-specific threshold is applied on the task related score of the final bounding boxes (Fig. 4). In the case of the regression (e.g., valence-arousal estimation), the output value of the final bounding box is used.

As mentioned in Sec. 3.1, each output layer of Face-SSD generates several heatmaps: one for face detection, four for the offset coordinates of face bounding box and for the number of face analysis tasks, as shown in Fig. 2(b). Specifically, Fig. 4(a) and LABEL: visualise the heatmaps generated by Face-SSD’s second and third-scale output layers (), which handle the second and third smallest sizes of the face that appears in the image, respectively. Thus, activations in the heatmap are high when a specific size of face is detected. For the given example of smile recognition, as shown in Fig. 4(a), the forefront heatmap shows two clusters of pixels, indicating the existence of two faces. The rearmost heatmap highlights the corresponding pixel only when a task is detected. In this example the heatmap has high values when the detected face is a smiling face.

(a) Smile Recognition
(b) Valence-Arousal Estimation
Figure 4: Examples of face detection and face analysis tasks. As a representative example of classification and regression, we visualised the output heatmaps for smile recognition and valence-arousal estimation. LABEL: The heatmaps represent face classification, bounding box regression and smile recognition results. LABEL: For the valence-arousal example, we only visualise the output heatmaps for face classification, valence and arousal estimation from the bottom row. We rescaled the range of output values at the valence-arousal estimation heatmap from to for the visualisation. The median (127) in this example represents the neutral valence or arousal value (0).

4 Experiments and Results

4.1 Datasets

In this paper, we show the performance of the proposed Face-SSD on three representative face analysis applications such as smile recognition (binary classification), facial attribute prediction (multiple class recognition), and valence-arousal estimation (multiple task regression). We stress that the structure of the network, including the number of filters and filter sizes remain the same – the only change is the number of output layer heatmaps. We used GENKI-4K (, 2009), CelebA (Liu et al., 2015), and AffectNet (Mollahosseini et al., 2017) datasets to test the three representative applications using Face-SSD.

Beginning with (Whitehill et al., 2009), which performed the first extensive smile detection study, most of the subsequent studies used the GENKI-4K222The GENKI-4K (, 2009) dataset is a subset of the GENKI dataset used in (Whitehill et al., 2009). This dataset consists of face images, each labelled with smile and head pose (yaw, pitch, roll). Only the GENKI-4K dataset is publicly available. dataset for performance evaluation (, 2009). In this paper, the smiling face detection experiments were performed not only on the GENKI-4K dataset but also on the CelebA dataset (Liu et al., 2015) which also contains smile labels. For facial attribute prediction experiments, we used the CelebA dataset (Liu et al., 2015) which is the most representative dataset. Finally, for the valence-arousal estimation experiment we used the AffectNet (Mollahosseini et al., 2017) dataset consisting of continuous level (valence-arousal) labels and face images captured in the wild.

The AFLW dataset (Koestinger et al., 2011) used for face detection and other datasets used for face analysis tasks (i.e., GENKI-4K (, 2009), CelebA (Liu et al., 2015), AffectNet (Mollahosseini et al., 2017)) have different bounding box positions and shapes. To solve this problem, we empirically adjusted the bounding box position of these datasets to create a square box that surrounds the entire face area centred on the nose (similar to the bounding box of the AFLW dataset). To do this, we first used the trained Face-SSD to detect a face bounding box. Then, we double-checked whether the detected bounding box is correct. If it was incorrect, we modified the bounding box manually.

In particular, when using the CelebA (Liu et al., 2015) dataset, we only examined smile recognition and facial attribute prediction performance for annotated faces. Each image sample in the CelebA dataset has only one bounding box with its corresponding attribute labels, even if the image contains multiple faces. Therefore, when multiple bounding boxes were detected (black boxes in Fig. 5) during the test time, we only calculated the accuracy for the detected bounding box that matched the ground truth position (red box in Fig. 5). If there is no bounding box detected for the ground truth location, it is considered as a false negative when calculating the accuracy.

Figure 5: If there were multiple faces detected (black boxes), only the annotated faces with the ground truth label (red box) were evaluated for attribute prediction. The face detected in the background was not used for accuracy measurement.

4.2 Face Detection

First, we evaluate the face detection performance. Although Face-SSD performs face detection in parallel with one or more tasks, the face analysis task results appearing in the output heatmap are only examined at the corresponding pixel positions that indicate successful face detection (as discussed in Sec. 3.3).

Here, we evaluate the face detection performance of Face-SSD on face analysis task datasets, including GENKI-4K (, 2009), CelebA (Liu et al., 2015), and AffectNet (Mollahosseini et al., 2017). According to (Du and Martinez, 2011)’s experimental results, the visual recognition ability of a human is degraded when image resolution falls below pixels. For this reason, the face detection of Face-SSD aims to support face analysis tasks rather than detecting tiny faces, which is beyond the scope of this work. To this end, we evaluate the face detection performance on the face analysis task (e.g., smile, attribute, valence-arousal) datasets that do not include severe occlusion or very small faces. Instead, these datasets consist of images that typically contain high-resolution faces compared to pixels and are captured in the wild (with naturalistic variations in pose, occlusion, and/or scale).

The face detection results are shown in Table 2 in terms of Equal Error Rate (EER) and Average Precision (AP) (Everingham et al., 2010). First, we investigated face detection performance using the same strategy as the SSD (Liu et al., 2016) called Face-SSD Baseline (Face-SSD-B) (Jang et al., 2017). The AFLW dataset (Koestinger et al., 2011) was used for training face detection part of Face-SSD. For data augmentation, Face-SSD-B used shrinking, cropping, and gamma correction (see details in Sec. 3.2.3). Using the data augmentation, Face-SSD-B trained on the non-challenging face dataset AFLW did not achieve a competitive performance (EER= and AP=) in comparison to using other face detection datasets. However, unlike general face detection evaluation, we used the simplest face analysis task dataset (GENKI-4K (, 2009)) to provide a performance comparison between different strategy combinations.

IoU for GTs HNM H-a-S for All H-a-S for Half GENKI-4K Test Results
0.50 0.35 Fine Coarse Fine Coarse EER () AP
Face-SSD-Baseline (Jang et al., 2017) 05.42 99.50
Face-SSD-B with More GTs 03.68 99.91
Face-SSD-B with HNM 01.72 99.88
Face-SSD-B with H-a-S 34.83 93.54
08.26 97.79
01.95 99.89
01.16 99.91
Face-SSD 00.66 99.88
Table 2: Effects of using Hard Negative Mining (HNM) and Hide-and-Seek (H-a-S) methods when training face detection in Face-SSD. (See text for more details about abbreviations and description)

To improve the face detection performance we first lowered the IoU threshold from to when assigning ground truths, similarly to SFD (Zhang et al., 2017b). Lowering the IoU threshold when matching default box increases the number of positive examples. By doing so the accuracy was improved from EER= and AP= to EER= and AP=.

In order to improve the performance further, we applied a Hard Negative Mining (HNM) strategy on the training data samples in a minibatch. Specifically, we extracted of the data samples that currently output the largest loss in a minibatch, and then re-used the data samples in the next minibatch. By doing so, we further reduced the detection error from EER= and AP= to EER= and AP=.

Finally, we applied H-a-S (Singh and Lee, 2017) as one of our data augmentation strategies. However, unlike what is reported in the original H-a-S paper (Singh and Lee, 2017), when the H-a-S method was applied to all training samples, the detection performance dropped significantly to EER= and AP=. Applying the H-a-S method randomly to approximately half of the training samples reduced the error to EER= and AP=. In addition, as shown in Table 2, our results indicate that for face detection it is better to hide coarsely divided patches (EER= and AP=) than to hide finely divided ones (EER= and AP=) because face detection relies on relatively large continuous patterns. In Table 2, for H-a-S, the coarse patch division process randomly selects the patch size from 3, 4, 5 and 6 (see Sec. 3.2.3), whereas the fine patch division process randomly selects the patch splitting size from 16, 32, 44 and 56 as proposed originally in (Singh and Lee, 2017).

Figure 6: Experimental curves for face detection performance on GENKI-4K (, 2009), CelebA (Liu et al., 2015) and AffectNet (Mollahosseini et al., 2017) datasets: Precision-Recall curves and Receiver Operating Characteristic (ROC) curves.

By applying the training strategies of low IoU threshold, HNM and H-a-S, we achieved EER= and AP= on the GENKI-4K dataset. For the CelebA dataset, we achieved EER= and AP=, as shown in Fig. 6. For the AffectNet dataset, we achieved EER= and AP=. These results indicate that Face-SSD can robustly detect faces in unconstrained environments, and the Face-SSD can be used for further face analysis tasks such as facial attribute and affect prediction along the dimensions of valence and arousal. The optimal thresholds for the best face detection accuracy were for the GENKI-4K dataset, for CelebA dataset, and for AffectNet dataset.

4.3 Face Analysis

Face-SSD is inspired by SSD (Liu et al., 2016), which promises real-time detection performance. Thus, the parameter values used in the process of finetuning the face detection and the face analysis parts of Face-SSD are initialised with the values used for training the base network of SSD (Liu et al., 2016). We used SGD with initial learning rate=, momentum=, weight decay=, and batch size=. We used learning rate= for the first iterations, then continued training for with learning rate=. We continuously reduced the learning rate every iterations until it reached learning rate=. Increasing the learning rate for the second iterations speeds up the optimisation process. However, we first started the training process with learning rate=, because the optimisation process tends to diverge if we use a larger learning rate in the beginning.

The following sections detail the experiments we have conducted to evaluate the two main performance factors of Face-SSD, namely prediction accuracy and processing time, for two tasks: smile recognition and facial attribute prediction.

Method Feature Classifier Detection Registration Input () Accuracy
(Shan, 2012) Pixel comparison AdaBoost ? Eyes (manual)
(Liu et al., 2012) HOG SVM VJ* Eyes
(Jain and Crowley, 2013) Multi-Gaussian SVM VJ* ?
(Kahou et al., 2014) LBP SVM VJ*+Sun* / ori. Pts
(An et al., 2015) HOG ELM VJ* Flow-based*
(Zhang et al., 2015) CNN Softmax ? Face Pts
(Li et al., 2016) Gabor-HOG SVM VJ* / manual ?
(Chen et al., 2017a)-I CNN SVM Liu* ?
(Chen et al., 2017a)-II CNN SVM Liu*
(Chen et al., 2017a)-III CNN SVM
Face-SSD CNN Sigmoid

* VJ: (Viola and Jones, 2004), Liu: (Liu et al., 2015), Sun: (Sun et al., 2013), Flow-based: (An et al., 2015)

Table 3: A detailed comparison with the state-of-the-art methods on the GENKI-4K dataset (, 2009). We summarise the features, classifiers, detection / registration methods and input image resolution (width, height, and channel) that were used in previous studies in published order. All previous studies require a normalised (cropped and aligned) input image, which necessarily require face detection and registration steps in advance (except (Chen et al., 2017a)-II and III). Some works (Shan, 2012; Jain and Crowley, 2013; Zhang et al., 2015; Li et al., 2016; Chen et al., 2017a) do not specify how to detect and align a face (in this case, ‘?’), while (Kahou et al., 2014) mentions that the original image is used if the face detection fails.

4.3.1 Smile Recognition

Accuracy for this task refers to the smile recognition performance including the face detection results. If face detection fails, the result of smile recognition is considered to be a non-smile.

Testing on the GENKI-4K dataset: Experiments that use this dataset are conventionally based on four-fold validation procedures. However, as GENKI-4K dataset contains a relatively small number of data samples (), for training we initially utilised the CelebA dataset that contains a rich set of images. When Face-SSD was trained on the CelebA dataset, we used the entire GENKI-4K dataset for testing. We obtained a smile recognition accuracy of , as shown in Fig. 7. Despite being trained on a completely different dataset with different characteristics, Face-SSD has already surpassed all the latest methods that used the GENKI-4K dataset for testing, as shown in Table 3.

Figure 7: Receiver Operating Characteristic (ROC) curve for smiling face detection accuracy using GENKI-4K (, 2009) dataset. Tr and Te represent training and testing, respectively.

To provide a fair comparison with other methods that use the four-fold validation strategy, we used the GENKI-4K dataset together with the bounding box annotations obtained with our method (as explained in Sec. 4.1) to finetune the Face-SSD, which was trained on the CelebA dataset. In this case, the smile recognition accuracy is improved further. This is due to the fact that the training samples in GENKI-4K dataset are relatively similar to the testing samples as compared to CelebA dataset. Although the training and testing samples do not overlap, using the same dataset (GENKI-4K) for training helps Face-SSD learn the test sample characteristics of the same (GENKI-4K) dataset. Our four-fold validation results were , , and , as shown in Fig. 7. Compared to the accuracies reported by existing works listed in Table 3, our method obtains the best results with mean= and standard deviation=.

Although Face-SSD does not require separate steps for face detection and registration, Face-SSD’s smile recognition results rely on the face detection performed in parallel on the same architecture (as explained in Sec. 4.2). Among the existing works listed in Table 3, Chen’s work ((Chen et al., 2017a)-II) reports testing accuracy when the registration process is not used. We therefore compare Face-SSD’s smile recognition performance more closely to the method of Chen ((Chen et al., 2017a)-II). Our experimental results show that Face-SSD outperforms () the most recently reported smile recognition result of Chen () based on a deep learning architecture ((Chen et al., 2017a)-II).

Figure 8: Performance comparison in terms of accuracy () on CelebA (Liu et al., 2015) dataset for facial attribute prediction. Face-SSD delivers excellent prediction performance that is very close to the state-of-the-art models without modifying the Face-SSD architecture. The state-of-the-art models are PANDA (Zhang et al., 2014), LNets+ANet (Liu et al., 2015), CTS-CNN (Zhong et al., 2016), MT-RBM PCA (Ehrlich et al., 2016), Walk and Learn (Wang et al., 2016), MCNN-AUUX (Hand and Chellappa, 2017), DMTL (Han et al., 2017), R-Codean (Sethi et al., 2018). (See Table 5 for more detailed accuracy comparisons.)

Testing on the CelebA dataset: In the second experiment, we used the CelebA dataset to train and test Face-SSD. In this experiment, we randomly selected of the dataset for training and used the remaining for the testing. We performed several experiments using different combinations of randomly selected training and test samples. Our experimental results show that Face-SSD detects smiling faces accurately (mean=), similarly to the state-of-the-art methods ((Liu et al., 2015): and (Ranjan et al., 2017): ), as shown in Table 4. However, Face-SSD is much faster (47.28 ) than the other methods ((Liu et al., 2015): , (Ranjan et al., 2017): ) that require region proposal methods for smile recognition (see Table 4).

Method RP Acc. Time (ms.)
Liu et al. (Liu et al., 2015) EB (Zitnick and Dollar, 2014)
Ranjan et al. (Ranjan et al., 2017) SS (van de Sande et al., 2011)
Face-SSD 92.81 47.28
Table 4: Comparison to the state-of-the-art methods on the CelebA dataset in terms of accuracy and time (ms.). RP, EB and SS refer to Region Proposal, EdgeBox (Zitnick and Dollar, 2014) and Selective Search (van de Sande et al., 2011), respectively.

5 o Clock Shadow

Arched Eyebrows


Bags Under Eyes



Big Lips

Big Nose

Black Hair

Blond Hair


Brown Hair

Bushy Eyebrows


Double Chin



Gray Hair

Heavy Makeup

High Cheekbones

PANDA (CVPR14) 88.0 78.0 81.0 79.0 96.0 92.0 67.0 75.0 85.0 93.0 86.0 77.0 86.0 86.0 88.0 98.0 93.0 94.0 90.0 86.0
LNets+ANet (ICCV15) 91.0 79.0 81.0 79.0 98.0 95.0 68.0 78.0 88.0 95.0 84.0 80.0 90.0 91.0 92.0 99.0 95.0 97.0 90.0 87.0
CTS-CNN (ICB16) 89.0 83.0 82.0 79.0 96.0 94.0 70.0 79.0 87.0 93.0 87.0 79.0 87.0 88.0 89.0 99.0 94.0 95.0 91.0 87.0
MT-RBM PCA (CVPRW16) 90.0 77.0 76.0 81.0 98.0 88.0 69.0 81.0 76.0 91.0 95.0 83.0 88.0 95.0 96.0 96.0 96.0 97.0 85.0 83.0
Walk and Learn (CVPR16) 84.0 87.0 84.0 87.0 92.0 96.0 78.0 91.0 84.0 92.0 91.0 81.0 93.0 89.0 93.0 97.0 92.0 95.0 96.0 95.0
MCNN-AUX (AAAI17) 94.5 83.4 83.1 84.9 98.9 96.0 71.5 84.5 89.8 96.0 96.2 89.2 92.8 95.7 96.3 99.6 97.2 98.2 91.5 87.6
DMTL (TPAMI17) 95.0 86.0 85.0 99.0 99.0 96.0 88.0 92.0 85.0 91.0 96.0 96.0 85.0 97.0 99.0 99.0 98.0 96.0 92.0 88.0
R-Codean (PRL18) 92.9 81.6 79.7 83.2 99.5 94.5 79.9 83.7 84.8 95.0 96.6 83.0 91.4 95.5 96.5 98.2 96.8 97.9 89.7 86.7
Face-SSD 92.9 82.0 81.3 82.5 98.6 95.2 77.8 82.3 87.9 93.6 95.0 83.5 89.6 95.1 96.0 99.2 96.3 97.6 90.7 86.8


Mouth Slightly Open


Narrow Eyes

No Beard

Oval Face

Pale Skin

Pointy Nose

Receding Hairline

Rosy Cheeks



Straight Hair

Wavy Hair

Wearing Earrings

Wearing Hat

Wearing Lipstick

Wearing Necklace

Wearing Necktie


PANDA (CVPR14) 97.0 93.0 93.0 84.0 93.0 65.0 91.0 71.0 85.0 87.0 93.0 92.0 69.0 77.0 78.0 96.0 93.0 67.0 91.0 84.0 85.42
LNets+ANet (ICCV15) 98.0 92.0 95.0 81.0 95.0 66.0 91.0 72.0 89.0 90.0 96.0 92.0 73.0 80.0 82.0 99.0 93.0 71.0 93.0 87.0 87.30
CTS-CNN (ICB16) 99.0 92.0 93.0 78.0 94.0 67.0 85.0 73.0 87.0 88.0 95.0 92.0 73.0 79.0 82.0 96.0 93.0 73.0 91.0 86.0 86.60
MT-RBM PCA (CVPRW16) 90.0 82.0 97.0 86.0 90.0 73.0 96.0 73.0 92.0 94.0 96.0 88.0 80.0 72.0 81.0 97.0 89.0 87.0 94.0 81.0 86.97
Walk and Learn (CVPR16) 96.0 97.0 90.0 79.0 90.0 79.0 85.0 77.0 84.0 96.0 92.0 98.0 75.0 85.0 91.0 96.0 92.0 77.0 84.0 86.0 88.65
MCNN-AUX (AAAI17) 98.2 93.7 96.9 87.2 96.0 75.8 97.0 77.5 93.8 95.2 97.8 92.7 83.6 83.9 90.4 99.0 94.1 86.6 96.5 88.5 91.29
DMTL (TPAMI17) 98.0 94.0 97.0 90.0 97.0 78.0 97.0 78.0 94.0 96.0 98.0 94.0 85.0 87.0 91.0 99.0 93.0 89.0 97.0 90.0 92.60
R-Codean (PRL18) 95.9 89.8 96.3 90.6 94.6 76.5 96.9 77.0 93.6 95.3 97.6 92.8 81.2 75.4 82.7 97.9 92.0 89.8 95.9 86.6 90.14
Face-SSD 97.3 91.9 96.0 89.0 94.9 74.8 95.7 74.9 93.1 94.3 96.6 91.8 83.4 85.1 86.9 98.5 92.6 87.8 95.6 87.6 90.29
Table 5: Comparison to the state-of-the-art methods for facial attribute prediction on the CelebA dataset in terms of prediction accuracy. The average accuracies of PANDA (Zhang et al., 2014), LNets+ANet (Liu et al., 2015), CTS-CNN (Zhong et al., 2016), MT-RBM PCA (Ehrlich et al., 2016), Walk and Learn (Wang et al., 2016), MCNN-AUUX (Hand and Chellappa, 2017), DMTL (Han et al., 2017), R-Codean (Sethi et al., 2018), and the proposed Face-SSD are , , , , , , , and , respectively.

4.3.2 Facial Attribute Prediction

In this section, we evaluated the performance of attribute prediction using Face-SSD for the prediction of attributes such as gender, age, etc. Our framework treats this problem as multiple binary classification problems using

heatmaps at the output layers. The only difference with the smile recognition case is the number of filter kernels used at the final layer – everything else remains the same, including the learning hyperparameters. The effects of modifying various settings during training are presented in Table


Our experiment focuses specifically on the effects of using the Gamma Correction (GC) and Hide-and-Seek (H-a-S) strategies used in the data augmentation process. Depending on the attribute label, there are two possible data augmentation strategies that might affect the accuracy of facial attribute prediction. Gamma correction (colour value adjustment) affects the accuracy of predicting colour-related attributes, such as hair colour (e.g., Black, Blond, Brown and Gray Hair), skin colour (e.g., Pale Skin and Rosy Cheeks) and presence of cosmetics (e.g., Heavy Makeup and Wearing Lipstick). Hide-and-Seek, which forces the Face-SSD to seek more of the overall face area, seems to affect the accuracy of predicting the overall face area-related attributes including “Attractive, Blurry, Chubby, Heavy Makeup, Oval Face, Pale Skin and Young”.

Figure 9: Removing Gamma Correction (GC) during training Face-SSD (Case C in Table 6) improves the accuracy of predicting color-related attributes comparing to using GC (Case A in Table 6). Using Hide-and-Seek (H-a-S) (Case B in Table 6) does not improve overall face area-related attributes as expected.

As shown in Table 6, we tested Face-SSD with all possible combinations using Gamma Correction and Hide-and-Seek during training, and all other settings remained the same as face detection part in Face-SSD (See Table 2). As we expected, using Gamma Correction (Case A and B in Table 6), which modifies the original colour of the training image, degrades the attribute recognition performance compared to training without Gamma Correction (Case C and D in Table 6). Although training without Gamma Correction primarily improves the accuracy of the colour-related attributes (e.g., Black Hair, Blond Hair, Brown Hair and Heavy Makeup), it also helps improve overall accuracy in other attributes, as shown in Fig. 9. By removing only Gamma Correction, Face-SSD achieves state-of-the-art accuracy () that is competitive results () similarly to MCNN-AUX (Hand and Chellappa, 2017), DMTL (Han et al., 2017) and R-Codean (Sethi et al., 2018). (See Fig. 8)

Interestingly, the use of Hide-and-Seek improves accuracy, but does not primarily improve the accuracy of attributes that are related to large facial areas, such as “Attractive, Blurry, Chubby, Heavy Makeup, Oval Face, Pale Skin and Young” as it was originally expected. On the contrary, it helps to identify more details in certain face areas (e.g., Bushy Eyebrows, Mouth Slightly Open, Straight Hair, Wavy Hair, Wearing Earrings, Wearing Necktie), as shown in Fig. 9. When training without using Gamma Correction, Face-SSD does not benefit from the use of Hide-and-Seek, as shown in Table 6 (Case D). The reason for this is that training without using Gamma Correction has had more impact on improving the accuracy of the same attributes, as shown in Fig. 9. The results of the Face-SSD shown in Table 5 are obtained by training Face-SSD using of Hide-and-Seek (Case D in Table 6), but not using Gamma Correction. Although we use the generalised Face-SSD architecture as opposed to using a specially designed architecture for facial attribute prediction, we achieved state-of-the-art accuracy (in the top three among the performances of related works).

Using GC Using H-a-S Accuracy ()
Face-SSD A 89.57
Face-SSD B 90.06
Face-SSD C 90.15
Face-SSD D 90.29
Table 6: The effect of using Gamma Correction (GC) and Hide-and-Seek (H-a-S) in the data augmentation process when training Face-SSD for attribute prediction using CelebA dataset.

4.3.3 Valence and Arousal Estimation

In this section, we investigated the performance of valence-arousal estimation using Face-SSD. Unlike the previous sections that address binary classification (smile recognition) and multi-class recognition (facial attribute prediction) problems, Face-SSD for valence-arousal solves a regression problem. To this end we used a state-of-the-art dataset called AffectNet (Mollahosseini et al., 2017). AffectNet consists of face images captured in the wild and its corresponding annotations of valence-arousal and emotion. To confirm the regression ability of Face-SSD, we only investigated the valence-arousal estimation performance.

Note that, as AffectNet consists only of cropped face images, we trained Face-SSD using a data augmentation strategy that allows only minor variations in terms of face size. Therefore, during testing, Face-SSD typically handles large faces for valence-arousal estimation. Despite this limitation during training, however, Face-SSD is able to handle not only large faces but also faces of medium size during testing, as shown in Fig. 1(c).

The performance of the valence-arousal estimation is shown in Table 7. For valence estimation, AffectNet yields slightly better results than Face-SSD. On the other hand, in terms of arousal, Face-SSD provides better results. Overall, Face-SSD provides close to the state-of-the-art performance without any modification to the original architecture of the Face-SSD network. See (Mollahosseini et al., 2017) for a detailed description of the units in the Table 7.

Valence Arousal
AffectNet Face-SSD AffectNet Face-SSD
RMSE 0.37 0.4406 0.41 0.3937
CORR 0.66 0.5750 0.54 0.4953
SAGR 0.74 0.7284 0.65 0.7129
CCC 0.60 0.5701 0.34 0.4665
Table 7: Experimental results of valence and arousal estimation using AffectNet (Mollahosseini et al., 2017) dataset. Experimental results are reported using Root Mean Square Error (RMSE), Pearsons Correlation Coefficient (CoRR), Sign Agreement Metric (SAGR) and Concordance Correlation Coefficient (CCC) (see (Mollahosseini et al., 2017) for the detailed description of the metrics).

4.4 Computational Speed and Complexity

For all of the Face-SSD applications presented in this paper, we obtained an average processing time of 47.39 (21.10

) during testing, with an experimental environment consisting of an Intel Core i7-6700HQ CPU processor and an NVIDIA GeForce GTX 960M GPU with 23.5GB of DRAM. We used Theano for Face-SSD implementation. As shown in Table

8, most Face-SSD applications achieve near real-time processing speed. Smile recognition (binary classification), facial attribute prediction (40-class recognition) and valence-arousal estimation (multiple task regression) take ( ), ( ) and ( ), respectively. Using the proposed generic Face-SSD for face analysis, the number of model parameters indicating complexity does not increase linearly even when the number of facial analysis tasks and classes increases. Although facial attribute prediction performs times more tasks than smile recognition, the processing time by the attribute prediction task increases only by and requires a small number of additional parameters ( ).

As shown in Table 4, the proposed Face-SSD is significantly faster than traditional methods that use the steps of region proposal and task prediction to analyse faces. For example, the work of Liu et al. (Liu et al., 2015) requires to generate the face confidence heatmap and to classify the attributes. In addition, this method requires another to find the candidate bounding box (EdgeBox (Zitnick and Dollar, 2014)) for localising the final bounding box that ends up with a total processing time of ( ). The work of Ranjan et al. (Ranjan et al., 2017) takes an average of ( ) to process an image. Ranjan et al. (Ranjan et al., 2017) explains that the main bottleneck for speed is the process of proposing regions (Selective Search (van de Sande et al., 2011)) and the repetitive CNN process for every individual proposal.

Face Analysis Task Parameter Number ms (FPS)
Face Detection Part (only) 25.57 (39.11)
Smile Recognition 47.28 (21.15)
Facial Attribute Prediction 47.55 (21.03)
Valence-Arousal Estimation 47.37 (21.11)
Average of All Applications 47.39 (21.10)
Table 8: The total number of parameters and processing time for various face analysis applications using Face-SSD.

To ensure a fair comparison of the processing times, we should measure the time in the same experimental environment. However, Liu et al. (Liu et al., 2015) does not provide detailed information about the experimental environment, except that they use GPUs. Ranjan et al. (Ranjan et al., 2017) implemented their all-in-one network using CPU cores and GTX TITAN-X GPUs. The processing speed of the proposed Face-SSD is times faster than the all-in-one network, even in a less powerful experimental environment.

Although Face-SSD is faster than other face analysis methods, the processing speed is lower than the base object detection (SSD) model (Liu et al., 2016) as the complexity of Face-SSD is nearly twice that of SSD, as shown in Table 8. Placing more layers to perform face analysis tasks increased the number of parameters in Face-SSD. However, the structure of the all-in-one network (Ranjan et al., 2017) shows that sharing more convolutional features does not degrade the performance of various tasks. Capitalising on this idea, we expect to further reduce the complexity of Face-SSD by sharing more layers and assigning a relatively small number of layers to other face analysis tasks.

5 Conclusions

In this paper, we tackled the problem of multiple face analysis tasks, namely smile recognition, facial attribute prediction and valence-arousal estimation in the wild, without the traditional pre-normalisation steps of face detection and registration. To this end, we proposed Face-SSD which performs face detection and face analysis simultaneously in a single framework. For fast and scale-invariant detection, Face-SSD builds upon the state-of-the-art object detection network SSD. In addition, we used pre-trained parameters of two different networks, trained for object classification and for face detection, to learn the face and task-relevant patterns. Consequently, we built a single framework that enables real-time scale-invariant face analysis in the wild. By exploring various data augmentation strategies for face analysis while maintaining the same Face-SSD architecture, we achieved state-of-the-art performance for various face analysis tasks without increasing model complexity. Our experimental results show that Face-SSD achieves state-of-the-art performance (accuracy of for smile recognition, and for attribute prediction, RMSE of and for valence and arousal estimation) while maintaining real-time speed ( for smile recognition, for attribute prediction, for valence-arousal estimation). For our future work, we plan to investigate a way of using facial attributes to improve the face detection performance. The challenge for doing this involves using heterogeneous annotations contained in separate datasets.


This work has been supported by the Technology Strategy Board, UK / Innovate UK project Sensing Feeling (project no. 102547). This work was undertaken while Youngkyoon Jang was a research associate affiliated with Queen Mary University of London and University of Cambridge.


  • Agustsson et al. (2017) Agustsson, E., Timofte, R., Escalera, S., Baró, X., Guyon, I., Rothe, R., 2017. Apparent and real age estimation in still images with deep residual regressors on appa-real database, in: FG, IEEE Computer Society. pp. 87–94.
  • An et al. (2015) An, L., Yang, S., Bhanu, B., 2015. Efficient smile detection by extreme learning machine. Neurocomput. 149, 354–363.
  • Chang et al. (2017) Chang, W., Hsu, S., Chien, J., 2017.

    Fatauva-net: An integrated deep learning framework for facial attribute recognition, action unit detection, and valence-arousal estimation, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1963–1971.

  • Chen et al. (2017a) Chen, J., Ou, Q., Chi, Z., Fu, H., 2017a. Smile detection in the wild with deep convolutional neural networks. Machine Vision Applications 28, 173–183.
  • Chen et al. (2017b) Chen, S., Zhang, C., Dong, M., Le, J., Rao, M., 2017b. Using ranking-cnn for age estimation, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. ImageNet: A Large-Scale Hierarchical Image Database, in: CVPR.
  • Du and Martinez (2011) Du, S., Martinez, A.M., 2011. The resolution of facial expressions of emotion. Journal of Vision 11, 24.
  • Ehrlich et al. (2016) Ehrlich, M., Shields, T.J., Almaev, T., Amer, M.R., 2016. Facial attributes classification using multi-task representation learning, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 752–760.
  • Everingham et al. (2010) Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A., 2010. The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88, 303–338.
  • Girshick (2015) Girshick, R., 2015. Fast R-CNN, in: Proceedings of the International Conference on Computer Vision (ICCV).
  • Han et al. (2017) Han, H., Jain, A.K., Wang, F., Shan, S., , Chen, X., 2017. Heterogeneous face attribute estimation: A deep multi-task learning approach. IEEE Trans. on PAMI .
  • Hand and Chellappa (2017) Hand, E.M., Chellappa, R., 2017. Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classification, in: AAAI, AAAI Press. pp. 4068–4074.
  • Hao et al. (2017) Hao, Z., Liu, Y., Qin, H., Yan, J., Li, X., Hu, X., 2017. Scale-aware face detection, in: CVPR.
  • He et al. (2017) He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., Li, X., 2017. Single shot text detector with regional attention, in: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 3066–3074.
  • Hsu et al. (2017) Hsu, G.S.J., Cheng, Y.T., Ng, C.C., Yap, M.H., 2017. Component biologically inspired features with moving segmentation for age estimation, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 540–547.
  • (2009), 2009. The MPLab GENKI Database, GENKI-4K Subset.
  • Jain and Crowley (2013) Jain, V., Crowley, J.L., 2013. Smile Detection Using Multi-scale Gaussian Derivatives, in: 12th WSEAS International Conference on Signal Processing, Robotics and Automation, Cambridge, United Kingdom. URL:
  • Jang et al. (2017) Jang, Y., Gunes, H., Patras, I., 2017. SmileNet: Registration-Free Smiling Face Detection in the Wild, in: The IEEE International Conference on Computer Vision (ICCV) Workshops.
  • Jourabloo and Liu (2017) Jourabloo, A., Liu, X., 2017. Pose-invariant face alignment via cnn-based dense 3d model fitting. Int. J. Comput. Vision 124, 187–203.
  • Kahou et al. (2014) Kahou, S.E., Froumenty, P., Pal, C.J., 2014. Facial expression analysis based on high dimensional binary features, in: Computer Vision - ECCV 2014 Workshops - Zurich, Switzerland, September 6-7 and 12, 2014, Proceedings, Part II, pp. 135–147. doi:10.1007/978-3-319-16181-5_10.
  • Kehl et al. (2017) Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N., 2017. SSD-6D: making rgb-based 3d detection and 6d pose estimation great again, in: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 1530–1538.
  • Koelstra and Patras (2013) Koelstra, S., Patras, I., 2013. Fusion of facial expressions and eeg for implicit affective tagging. Image Vision Comput. 31, 164–174.
  • Koestinger et al. (2011) Koestinger, M., Wohlhart, P., Roth, P.M., Bischof, H., 2011. Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization, in: First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies.
  • Kossaifi et al. (2017) Kossaifi, J., Tzimiropoulos, G., Todorovic, S., Pantic, M., 2017. Afew-va database for valence and arousal estimation in-the-wild. Image Vision Comput. 65, 23–36.
  • Li et al. (2016) Li, J., Chen, J., Chi, Z., 2016. Smile detection in the wild with hierarchical visual feature, in: 2016 IEEE International Conference on Image Processing (ICIP), pp. 639–643. doi:10.1109/ICIP.2016.7532435.
  • Li et al. (2017) Li, S., Deng, W., Du, J., 2017. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Lin et al. (2017) Lin, T., Zhao, X., Shou, Z., 2017. Single shot temporal action detection, in: Proceedings of the 2017 ACM on Multimedia Conference, ACM, New York, NY, USA. pp. 988–996.
  • Liu et al. (2012) Liu, M., Li, S., Shan, S., Chen, X., 2012. Enhancing expression recognition in the wild with unlabeled reference data, in: Computer Vision - ACCV 2012, 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, Part II, pp. 577–588.
  • Liu et al. (2016) Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C., 2016. SSD: Single shot multibox detector, in: Proceedings of the European Conference on Computer Vision.
  • Liu et al. (2015) Liu, Z., Luo, P., Wang, X., Tang, X., 2015. Deep learning face attributes in the wild, in: Proceedings of International Conference on Computer Vision (ICCV).
  • Lv et al. (2017) Lv, J., Shao, X., Xing, J., Cheng, C., Zhou, X., 2017. A deep regression architecture with two-stage re-initialization for high performance facial landmark detection, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Mollahosseini et al. (2017) Mollahosseini, A., Hasani, B., Mahoor, M.H., 2017. Affectnet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Transactions on Affective Computing .
  • Neubeck and Van Gool (2006) Neubeck, A., Van Gool, L., 2006. Efficient non-maximum suppression, in: Proceedings of the 18th International Conference on Pattern Recognition - Volume 03, IEEE Computer Society, Washington, DC, USA. pp. 850–855.
  • Poirson et al. (2016) Poirson, P., Ammirato, P., Fu, C., Liu, W., Kosecka, J., Berg, A.C., 2016. Fast single shot detection and pose estimation, in: 3DV, IEEE Computer Society. pp. 676–684.
  • Ranjan et al. (2017) Ranjan, R., Sankaranarayanan, S., Castillo, C.D., Chellappa, R., 2017. An all-in-one convolutional neural network for face analysis, in: 12th IEEE International Conference on Automatic Face and Gesture Recognition FG 2017, Washington, DC, USA, May 30-June 3.
  • Ren et al. (2016) Ren, S., He, K., Girshick, R., Sun, J., 2016. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) .
  • Rudd et al. (2016) Rudd, E.M., Günther, M., Boult, T.E., 2016. Moon: A mixed objective optimization network for the recognition of facial attributes., in: ECCV, Springer. pp. 19–35.
  • Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L., 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 211–252. doi:10.1007/s11263-015-0816-y.
  • Russell (2003) Russell, J.A., 2003. Core affect and the psychological construction of emotion. Psychological review 110, 145–72.
  • Russell and Barrett (1999) Russell, J.A., Barrett, L.F., 1999. Core affect, prototypical emotional episodes, and other things called emotion: dissecting the elephant. Journal of Personality and Social Psychology 76, 805–819.
  • van de Sande et al. (2011) van de Sande, K., Uijlings, J., Gevers, T., Smeulders, A., 2011. Segmentation as Selective Search for Object Recognition, in: IEEE International Conference on Computer Vision.
  • Sariyanidi et al. (2015) Sariyanidi, E., Gunes, H., Cavallaro, A., 2015. Automatic analysis of facial affect: A survey of registration, representation, and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 1113–1133.
  • Sethi et al. (2018) Sethi, A., Singh, M., Singh, R., Vatsa, M., 2018.

    Residual codean autoencoder for facial attribute analysis.

    Pattern Recognition Letters .
  • Shan (2012) Shan, C., 2012. Smile detection by boosting pixel differences. Trans. Img. Proc. 21, 431–436.
  • Shen et al. (2015) Shen, J., Zafeiriou, S., Chrysos, G., Kossaifi, J., Tzimiropoulos, G., Pantic, M., 2015. The first facial landmark tracking in-the-wild challenge: Benchmark and results, in: 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), pp. 50–58.
  • Simonyan and Zisserman (2015) Simonyan, K., Zisserman, A., 2015. Very deep convolutional networks for large-scale image recognition, in: International Conference on Learning Representations (ICLR).
  • Singh and Lee (2017) Singh, K.K., Lee, Y.J., 2017. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization, in: International Conference on Computer Vision (ICCV).
  • Sun et al. (2013) Sun, Y., Wang, X., Tang, X., 2013. Deep convolutional network cascade for facial point detection, in: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, Washington, DC, USA. pp. 3476–3483.
  • Sun et al. (2014) Sun, Y., Wang, X., Tang, X., 2014. Deep learning face representation from predicting 10,000 classes, in: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, Washington, DC, USA. pp. 1891–1898.
  • Tam et al. (2013) Tam, G.K.L., Cheng, Z.Q., Lai, Y.K., Langbein, F.C., Liu, Y., Marshall, D., Martin, R.R., Sun, X.F., Rosin, P.L., 2013. Registration of 3d point clouds and meshes: A survey from rigid to nonrigid. IEEE Transactions on Visualization and Computer Graphics 19, 1199–1217.
  • Viola and Jones (2004) Viola, P., Jones, M.J., 2004. Robust real-time face detection. Int. J. Comput. Vision 57, 137–154.
  • Wang et al. (2016) Wang, J., Cheng, Y., Feris, R.S., 2016. Walk and learn: Facial attribute representation learning from egocentric video and contextual data, in: CVPR, IEEE Computer Society. pp. 2295–2304.
  • Wang et al. (2014) Wang, N., Gao, X., Tao, D., Li, X., 2014. Facial feature point detection: A comprehensive survey. arXiv .
  • Whitehill et al. (2009) Whitehill, J., Littlewort, G., Fasel, I., Bartlett, M., Movellan, J., 2009. Toward practical smile detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 2106–2111.
  • Wu et al. (2017a) Wu, W., Kan, M., Liu, X., Yang, Y., Shan, S., Chen, X., 2017a. Recursive spatial transformer (rest) for alignment-free face recognition, in: The IEEE International Conference on Computer Vision (ICCV).
  • Wu et al. (2017b) Wu, Y., Gou, C., Ji, Q., 2017b. Simultaneous facial landmark detection, pose and deformation estimation under facial occlusion, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Zeng et al. (2016) Zeng, J., Chu, W.S., la Torre Frade, F.D., Cohn, J., Xiong, Z., 2016. Confidence preserving machine for facial action unit detection. IEEE Transactions on Image Processing .
  • Zhang and Zhang (2010) Zhang, C., Zhang, Z., 2010. A Survey of Recent Advances in Face Detection. Microsoft Research, Technical Report .
  • Zhang et al. (2015) Zhang, K., Huang, Y., Wu, H., Wang, L., 2015. Facial smile detection based on deep learning features, in: 3rd IAPR Asian Conference on Pattern Recognition, ACPR 2015, Kuala Lumpur, Malaysia, November 3-6, 2015, pp. 534–538. doi:10.1109/ACPR.2015.7486560.
  • Zhang et al. (2017a) Zhang, K., Zhang, Z., Wang, H., Li, Z., Qiao, Y., Liu, W., 2017a. Detecting faces using inside cascaded contextual cnn, in: The IEEE International Conference on Computer Vision (ICCV).
  • Zhang et al. (2014) Zhang, N., Paluri, M., Ranzato, M., Darrell, T., Bourdev, L.D., 2014. Panda: Pose aligned networks for deep attribute modeling, in: CVPR, IEEE Computer Society. pp. 1637–1644.
  • Zhang et al. (2017b) Zhang, S., Zhu, X., Lei, Z., Shi, H., Wang, X., Li, S.Z., 2017b. FD: Single Shot Scale-invariant Face Detector, in: The IEEE International Conference on Computer Vision (ICCV).
  • Zhang et al. (2018) Zhang, S., Zhu, X., Lei, Z., Wang, X., Shi, H., Li, S.Z., 2018. Detecting face with densely connected face proposal network. Neurocomputing 284, 119–127.
  • Zhong et al. (2016) Zhong, Y., Sullivan, J., Li, H., 2016. Face attribute prediction using off-the-shelf cnn features, in: IEEE International Conference on Biometrics (ICB), pp. 1–7.
  • Zitnick and Dollar (2014) Zitnick, L., Dollar, P., 2014. Edge boxes: Locating object proposals from edges, in: ECCV, European Conference on Computer Vision.