Facial age estimation by deep residual decision making

08/28/2019 ∙ by Shichao Li, et al. ∙ 10

Residual representation learning simplifies the optimization problem of learning complex functions and has been widely used by traditional convolutional neural networks. However, it has not been applied to deep neural decision forest (NDF). In this paper we incorporate residual learning into NDF and the resulting model achieves state-of-the-art level accuracy on three public age estimation benchmarks while requiring less memory and computation. We further employ gradient-based technique to visualize the decision-making process of NDF and understand how it is influenced by facial image inputs. The code and pre-trained models will be available at https://github.com/Nicholasli1995/VisualizingNDF.



There are no comments yet.


page 2

page 8

page 9

Code Repositories


The repository contains pre-trained models and code for visualizing deep neural decisoin forest.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Facial age estimation is a task that aims to infer the chronological age from digital facial images, which is challenging due to large appearance variation and varying growing patterns in different life periods [survey]

. Divide-and-conquer is a strategy that solves a complex problem by dividing the problem space and tackle the sub-problems instead. This strategy is inherently used by traditional decision tree based models 

[tree_detector, tree_regressor, randomforest] and has been applied on facial age estimation [randomforest]. Past works using decision trees [tree_detector, tree_regressor, randomforest]

usually adopt hand-crafted features in the hard splitting functions and utilize a heuristic training scheme, limiting their representation learning power. Recent works extend traditional decision trees into deep neural decision forest (NDF) 

[NDF, Depth, DRFs] by using soft splitting functions and enables decision tree with deep representation learning ability.

In another parallel line of research, the architecture of deep convolutional neural networks (CNN) keeps evolving [VGG, resnet, densenet] to boost model performance on large-scale image classification problem. Among them residual learning [resnet] was an influential breakthrough by hypothesizing that learning residuals could ease the parameter optimization for very deep neural networks, which is now popularly adopted in CNN architectures [resdenoise, H_module]. Despite a rigorous mathematical analysis is still beyond our touch, a seminal work by Li et al [loss_ls]

visualized the loss function landscape with and without residual learning, which intuitively demonstrated that residual learning did make the optimization problem easier. We believe it’s a natural step to use residual learning in other deep models. However, despite its popularity in traditional CNN, residual learning has not yet been attempted for NDF to the best of our knowledge.

Understanding the learned representation and inference process of deep computer vision models is arousing growing interests due to its close connection with the model’s trustworthiness 

[car]. Gradient-based visualizing [saliency, invert] and explainable representation learning [interpret, inter_tree] have already been tried, yet most works focus on image classification task and traditional CNN, leaving regression problems and other deep models less explored.

In this work we take the step to use residual learning when optimizing the soft decision functions of NDF and apply it on age estimation problem. We also try to better understand the inference process of NDF by computing saliency maps based on the routing probabilities. Finally, considering its remarkable performance 


yet lower popularity compared to traditional CNN, we provide an easy-to-use implementation based on Python and PyTorch to encourage the community to consider NDF for other vision tasks. In summary, our main contributions are twofold:

  • We employ residual learning to learn the complex soft decision functions of NDF for the first time. The trained model achieves state-of-the-art accuracy on facial age estimation task while consumes less memory and computation.

  • We are the first to apply network visualization technique on NDF to obtain insightful observations during its inference process.

Figure 1:

Illustration of deep residual neural decision forest (RNDF). Input images are routed (red arrows) by the splitting nodes and arrive at the prediction given by leaf nodes. Deep representation is extracted from the input and sent (blue arrows) to each splitting node for decision making. Redisual learning is incorporated in the feature extraction process to help the optimization of complex soft decision functions. Here only one tree of depth two is drawn for simplicity.

2 Related Work

Divide-and-conquer for age estimation. To reduce the difficulty of facial age regression, past works tried to split the data space and learn multiple regressors [HumanPerform]

or use traditional random forests 

[randomforest]. Recent works combine end-to-end deep representation learning with soft decision tree [NDF, Depth, DRFs] and achieved state-of-the-art accuracy [DRFs]. However, conventional feature extractor [VGG] was employed in [DRFs] which resulted in inefficient parameter usage and residual learning was not attempted.

Residual learning. After being proposed in [resnet], residual learning has been widely adopted in CNN to solve various problems [resdenoise, VQA, H_module]. Nevertheless, it has not yet been used for NDF, which is a another type of deep model different from traditional CNN.

Network visualizing and explainable AI.. Network visualization techniques [saliency, invert] have been used to understand the learned representation in CNN, yet not applied for NDF where a different type of inference is conducted via decision making. These works also focused on image classification and not on a regression problem like age estimation. Recent efforts tried to build explainable CNN model [interpret] and organize it by decision tree [inter_tree]. Our work is different from them since we do not add extra loss function in the training phase. Our previous extended abstract [li2019visualizing] was targeted for image classification, and here we extend it to facial age regression.

3 Methodology

3.1 Residual Neural Decision Forest

A deep neural decision forest (NDF) is an ensemble of deep neural decision trees. Each tree consists of splitting nodes and leaf nodes. In general each tree can have unconstrained topology but here we specify each tree as full binary tree for simplicity. We index the nodes sequentially with integer as shown in Figure 1. A splitting node is associated with a recommendation (splitting) function

that extracts deep features from the input

and gives the recommendation score (routing probability) that the input is recommended (routed) to its left sub-tree. We denote the unique path from the root node to a leaf node a computation path

. Each leaf node stores one real-valued prediction vector

that represents the "answer" given by it. To get the final prediction , each leaf node contributes its prediction vector weighted by the probability of taking its computation path as


and is the set of all leaf nodes. The weight can be obtained by multiplying all the recommendation scores given by the splitting nodes along the path. Assume the path consists of a sequence of splitting nodes and one leaf node as , where the superscript for a splitting node denotes to which child node to route the input. Here means the input is routed to the left child and otherwise. Then the weight can be expressed as


Note that the weights of all leaf nodes sum to 1 and the final prediction is hence a convex combination of all the prediction vectors of the leaf nodes. In addition, we assume the recommendation functions mentioned above are differentiable and parametrized by at node . Then the final prediction is a differentiable function with respect to all the parameters which we omit above to ensure clarity. A loss function defined upon the final prediction can hence be minimized with gradient descent.

Similar to previous works [NDF, DRFs]

, we use deep CNN to extract features from the input and assign each splitting node to one neuron of the last fully-connected layer, where sigmoid function is used to compute the final recommendation scores. Specifically,


where is the th feature mapping function represented by one or multiple layers in deep neural networks, is the linear mapping function associated with the assigned neuron in the last fully-connected layer and is sigmoid function to enforce a valid range for routing probability. Inspired by the success of residual learning in CNN [resnet], we hypothesize that learning residual can help the learning of recommendation functions, where we specify the feature mapping functions as


and we call the NDF incorporated with residual learning as residual neural decision forest (RNDF).

For a regression problem where the dataset contains labeled instances , we directly use the squared loss function,


The network parameters are optimized by gradient descent keeping the leaf node prediction vectors fixed. For a splitting node , we denote nodes in its left and right sub-trees as node sets and , respectively. We denote the probability of recommending the input to a leaf node as . The gradient of loss function with respect to the recommendation score is computed as,


where and . This gradient is back-propagated to the former layers to optimize layer parameter with gradient descent.

The leaf node prediction vectors are optimized by leaf node update rule keeping the network paramters fixed. There are different leaf node update rules available [Depth, DRFs] and we adopt the one [DRFs] with theoretically guaranteed loss reduction performance. Here a covariance matrix

is used at each leaf node to specify prediction uncertainty and a Gaussian distribution is assumed with the prediction vector used as the mean. The prediction vector and covariance matrix can be updated jointly as


where the weighting factor is computed as


and is the probability density given by the assumed leaf node Gaussian distribution


The gradient descent and leaf node update rule are carried out alternately and the training process is depicted by Algorithm 1.

0:  training set , trainable network parameters and leaf node parameters , SGD batch number
1:  Initialize and randomly.
2:  while Not converge do
3:     , fix
4:     while  do
5:        Select a random batch from
6:        Update by SGD (Eqn. 6)
8:     end while
9:     Select a random batch and update (Eqn. 7 and Eqn. 8)
10:  end while
Algorithm 1 Training algorithm for RNDF

3.2 Decision Saliency Map

To understand how the input can influence the decision-making of this model, we take the gradient of the routing probability with respect to the input and name it decision saliency map (DSM),


The definition of DSM is inspired from the past gradient-based network visualization technique [saliency], but is unique since the past work focus on traditional CNN and image classification while here the saliency map is computed for NDF and age regression. NDF conducts inference by decision-making and saliency maps are more meaningful in this scenario. In experiment we trace one computation path for each input and compute DSMs for each splitting node on the path. Multiple paths are available for the same input since there are multiple leaf nodes and trees. We take the path that contributes most, i.e., the path whose weight is the largest.

Method Year Morph FG-NET CACD
AGES [Method:Ages] 2007 8.83/46.8% 6.77/64.1% -/-
LARR [LARR] 2008 -/- 5.07/68.9% -/-
IIS-LDL [LDL] 2010 -/- 5.77/- -/-
Rank [Method:rank] 2010 6.49/49.1% 5.79/66.5% -/-
MTWGP [MTWGP] 2010 6.28/52.1% 4.83/72.3% -/-
CAM [CAM] 2011 -/- 4.12/73.5% -/-
OHRank [OHRank] 2011 6.07/56.3% 4.48/74.4% -/-
CPNN [LDL] 2013 -/- 4.76/- -/-
CA-SVR [CASVR] 2013 5.88/57.9% 4.67/74.5% -/-
DIF [HumanPerform] 2015 -/- 4.80/74.3% -/-
Human Workers [HumanPerform] 2015 6.30/51% 4.70/69.5% -/-
DLA [DLA] 2015 4.77/63.4% 4.26/- -/-
Rothe et al [Rothe_2016_CVPR] 2016 3.45/- 5.01/- -/-
DEX [DEX] 2016 3.25/- 4.63/- 4.785/-
dLDLF [dLDLF] 2017 3.02/81.3% -/- 4.734/-
ARN [ARN] 2017 3.00/- -/- -/-
DRFs [DRFs] 2018 2.91/82.9% 3.85/80.6% 4.637/-
RNDF (Ours) 2019 2.97/83.2% 3.87/76.1% 4.595/-
Table 1: Mean absolute error (MAE) and cumulative score (CS) of different methods are reported as MAE/CS. Some previous works do not report both metrics.

4 Experiments

4.1 Datasets

We employ three public datasets for experimental evaluation:

  • FG-NET [FGNET] contains 1002 images taken from 82 subjects at different ages. These images have large variation in terms of pose, illumination and facial expression. We conduct leave-one-out cross validation as previous works [DEX, DRFs]. In each experiment images from 81 subjects are used as the training set and the model is validated on the images from the remaining one subject. The experiments are repeated 82 times to validate on every subject and the final results are averaged.

  • MORPH [MORPH] contains 55134 annotated images from more than 13000 individuals of different races. We follow the previous work [DRFs] by selecting 5475 images111The list of selected images were released by the previous work at https://github.com/shenwei1231/caffe-DeepRegressionForests/blob/master/morph_setting1.list and randomly choose 80% of them for training and validate on the remaining images. The experiments are repeated 5 times and the averaged results are reported.

  • CACD [CACD] is a large challenging dataset containing 166417 images collected from 2000 celebrities from Internet. The celebrities are grouped into three subsets: the training set containing 145275 images from 1,800 celebrities, the testing set that has 10517 images from 120 celebrities and the validation set having the remaining images from 80 celebrities. We train our model on the training subset and report its performance on the test subset.

Method MAE (CACD) Model size FLOPs
DRFs [DRFs] 4.637 539.4MB 16G
RNDF (Ours) 4.595 112.4MB 4G
Table 2: Comparison of model size and FLOPs.

4.2 Implementation details


. We use facial landmarks to locate the face region and eliminate in-plane face rotation. Only Morph dataset does not officially provide facial landmarks and we use OpenCV and Dlib for face detection and alignment. When the detector fails we manually crop and the face region. Finally we resize all images to 256 by 256 pixel and normalize the images based on computed mean and standard deviation of three color channels before feeding them to the model. The training data is augmented by random horizontal flipping with probability 0.5 and random cropping so that the final spatial size of the inputs is 224 by 224. For testing images we only conduct central cropping.

Figure 2: Detailed architecture of RNDF. The input goes through four types of residual bottleneck blocks and two fully-connected layers, whose output (after sigmoid activation) are sent to the forest for decision making.

Model Architecture. The detailed model architecture is shown in Fig. 2, where a Resnet50-like [resnet] architecture is adopted and two fully-connected layers are used whose final outputs are activated by sigmoid function and sent to the forest to give the final prediction.

Hyper-parameter setting. For fair comparison with the previous work [DRFs], we only use a forest of 5 trees and each one is of depth 6. A batch size of 50 is used for back-propagation to train the network parameters and we did not notice significant performance discrepancy with a batch size range from 15 to 100. We update the leaf node prediction vectors after every 50 batches of network parameter update (), where 500 samples are randomly drawn for each time. Each leaf node update will run 20 iterations of the update rule.

Training settings.

SGD optimizer is used with momentum set as 0.9 and the initial learning rate is 0.5. We train 40 epochs in every experiment for FG-NET, 100 epochs for Morph and 8 epochs for CACD. The learning rate is halved when model training gets stuck in plateau using the scheduler provided by PyTorch. Detailed settings can be found in our released code.

4.3 Results

We use two widely used metrics to evaluate the accuracy of our model. Mean absolute error (MAE) is the average absolute error over the testing set and cumulative score (CS) is the portion of testing images whose test error is smaller than a threshold. Following previous works[OHRank, DRFs] we use a threshold of 5 years.

The model accuracy on the three benchmarks is shown in Table 1 and compared with previous works. Our model achieves state-of-the-art level accuracy on all of the benchmarks and noticeably good performance on the largest dataset CACD. The memory and computation efficiency of the model is compared with the previous most accurate model[DRFs] in Table 2, where our model has a memory saving of 4.8 times and computation saving of 4 times.

Figure 3: Decision saliency maps for CACD. Each row gives the decision-making process of one image, where the left-most image is the input and the others are DSMs along the computation path of the input. Each DSM is computed by taking derivative of the routing probability with respect to the input image. Model prediction and ground truth are given above the input image as (Pred, GT). (Na, Pb) means the input arrives at splitting node a with probability b during the decision-making process.
Figure 4: Decision saliency maps for Morph using the same annotation style as Figure 3.

The DSMs for Morph and CACD datasets are shown in Fig. 3 and Fig. 4, respectively. It can be seen that for different input, the model learns to make decision based on different regions. Note that the decision making is usually based on the skin region and the model learns to ignore irrelevant texture (the hair region and background) for most cases. Another observation is that the splitting nodes along the path usually "look at" similar facial regions, and the reason maybe all the splitting nodes are associated with the same fully-connected layer in the model. Finally, we intentionally use different face sizes in the input (face region occupies a larger portion for CACD input than Morph) to show that the model is not sensitive to pre-processing.

5 Conclusion

In this work we employ residual learning, a successful technique validated by traditional convolutional neural networks, in deep neural decision forest (NDF) to learn the soft decision functions. Our model achieves state-of-the-art accuracy on large facial age estimation dataset, requires less memory and is more computationally efficient. We also apply network visualization technique on NDF to obtain deeper understanding of the decision-making process of this model.

The are several angles for future research. Firstly, all the trees are limited to full binary trees and all the splitting nodes are restricted to one fully-connected layer in this study. It’s an interesting question whether one can learn different tree topology or associate splitting functions to different network layers. In that case the model can be more flexible and can combine different information from different layers. Secondly, we only compute saliency maps for this model and one can also employ other network visualization techniques for NDF. Finally, the network architecture in this work is a first trial and we expect future works to further improve the accuracy or compress the model size.