FP-Age: Leveraging Face Parsing Attention for Facial Age Estimation in the Wild

by   Yiming Lin, et al.
Imperial College London

Image-based age estimation aims to predict a person's age from facial images. It is used in a variety of real-world applications. Although end-to-end deep models have achieved impressive results for age estimation on benchmark datasets, their performance in-the-wild still leaves much room for improvement due to the challenges caused by large variations in head pose, facial expressions, and occlusions. To address this issue, we propose a simple yet effective method to explicitly incorporate facial semantics into age estimation, so that the model would learn to correctly focus on the most informative facial components from unaligned facial images regardless of head pose and non-rigid deformation. To this end, we design a face parsing-based network to learn semantic information at different scales and a novel face parsing attention module to leverage these semantic features for age estimation. To evaluate our method on in-the-wild data, we also introduce a new challenging large-scale benchmark called IMDB-Clean. This dataset is created by semi-automatically cleaning the noisy IMDB-WIKI dataset using a constrained clustering method. Through comprehensive experiment on IMDB-Clean and other benchmark datasets, under both intra-dataset and cross-dataset evaluation protocols, we show that our method consistently outperforms all existing age estimation methods and achieves a new state-of-the-art performance. To the best of our knowledge, our work presents the first attempt of leveraging face parsing attention to achieve semantic-aware age estimation, which may be inspiring to other high level facial analysis tasks.



There are no comments yet.


page 1

page 3

page 4

page 5


Soft-ranking Label Encoding for Robust Facial Age Estimation

Automatic facial age estimation can be used in a wide range of real-worl...

Speeding-up Age Estimation in Intelligent Demographics System via Network Optimization

Age estimation is a difficult task which requires the automatic detectio...

FOSS: Multi-Person Age Estimation with Focusing on Objects and Still Seeing Surroundings

Age estimation from images can be used in many practical scenes. Most of...

A Unified Framework for Biphasic Facial Age Translation with Noisy-Semantic Guided Generative Adversarial Networks

Biphasic facial age translation aims at predicting the appearance of the...

A High-Efficiency Framework for Constructing Large-Scale Face Parsing Benchmark

Face parsing, which is to assign a semantic label to each pixel in face ...

Modelling of Facial Aging and Kinship: A Survey

Computational facial models that capture properties of facial cues relat...

BridgeNet: A Continuity-Aware Probabilistic Network for Age Estimation

Age estimation is an important yet very challenging problem in computer ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Age estimation from facial images has been an active research topic in computer vision and it can be utilised in a variety of real-world applications, such as forensics, security, health and well-being, and social media. There are several branches in this topic. In this work, we focus on the estimation of real/biological age, which is arguably the most difficult task among others such as apparent age estimation 

[1] or age group classification [2]. Predicting a person’s age from facial images in the wild can be very challenging as it involves a variety of intrinsic and subtle factors such as pose, expression, gender, illuminations, occlusions,

Recently, deep learning approaches have been widely employed to construct end-to-end age estimations models. Deep embedding learnt from large-scale datasets is a very effective facial representation that has greatly improved the state-of-the-art in automatic estimation of facial age. However, most deep models are not explicitly trained to learn facial semantic information like eyes and noses, and therefore the extracted embedding may not appropriately attend to those more informative facial regions.

It has been shown that the most informative features for age estimation are located in the local regions such as eyes and mouth corners [3]

. On the other hand, face parsing is designed to classify each pixel into different facial regions and to give the regional boundaries. Therefore, a Convolutional Neural Network (CNN) trained for face parsing could also pick up the features around the facial regions that are also useful for determining the age. Moreover, due to the hierarchical structure of CNNs, the intermediate features can encode both local and global information that can be fused for age estimation.

To this end, we propose FP-Age for leveraging features in a face parsing network for facial age estimation. In particular, we adopt both coarse and fine-grained features from a pre-trained face parsing network [4] to represent facial semantic information at different levels and built a small network on top of it to predict the age. To avoid the loss of details in the high level features, we design a Face Parsing Attention (FPA) module to explicitly drive the network’s attention to those more informative facial parts. The attended high-level features are then concatenated to the low-level features and fed into a small add-on network for age prediction. Since the semantic features are extracted using a pre-trained face parsing model, no additional face parsing annotations are required and thus our FP-Age network can be trained in an end-to-end fashion, similar to other age estimation networks.

We have also developed a semi-automatic approach to clean the noisy data in IMDB-WIKI, leading to a new large-scale age estimation benchmark titled IMDB-Clean. Our FP-Age network achieves state-of-the-art results on this IMDB-Clean , as well as on several other age estimation datasets, under both intra-dataset and cross-dataset evaluation protocols. To the best of our knowledge, this is the first reported effort to adopt semantic facial information for age estimation based on an attention mechanism on different facial regions. The idea of Face Parsing Attention can be inspiring to other facial analysis tasks too, and the proposed FP-Age network can be easily adapted to perform on those tasks as well, facial gesture recognition and emotion recognition.

Our main contributions are as follows:

  • The IMDB-Clean dataset: a large-scale, clean image dataset for age estimation in the wild;

  • FP-Age: a simple yet effective framework that leverages facial semantic features for semantic-aware age estimation;

  • We also demonstrate that for age estimation, different facial parts have variable importance with “nose” being the least important region;

  • Our FP-Age achieves new state-of-the-art results on IMDB-Clean, Morph [5] and CACD [6];

  • When trained on IMDB-Clean, our FP-Age also achieves state-of-the-art results on KANFace [7], FG-Net [8], Morph [5] and CACD [6] under cross-dataset evaluation.

Fig. 1: FP-Age. A pre-trained face parsing framework [4] (top) is used to extract features of the target face in the input image. A lightweight network (bottom) aggregates low-level features, high-level features and face masks to predict the age.

Ii Related Work

Ii-a Image-based Biological Age Estimation

Early works on age estimations are mainly based on handcrafted features, and we refer interested readers to [9] for a detailed survey. Recently, deep learning techniques have achieved significantly improved performance in this field. In this section, we briefly explain several deep learning approaches on age estimation. They are roughly organised into four categories depending on how they model the problem: regression based, classification based, ranking based and label distribution based.

Regression approaches treat facial ageing as a regression problem and directly predict true age values from facial images. Euclidean loss is therefore a popular choice among those methods. Yi [10] adopted mean squared loss to train a multi-scale CNN for age regression. Similarly, Wang  [11] apply the same loss on the representation obtained by fusing feature maps from different layers of a CNN.

In contrast to regression methods, classification based works [2, 12] formulate the age estimation as a multi-class classification problem and treat different ages as independent classes. Although such formulations make it easier to train CNNs, this ignores the correlations between different classes.

Ranking approaches inspect the ordinal property embedded in the ageing process. OR-CNN [13]

proposed to formulate age estimation as an ordinal regression problem and built multiple binary classification neurons on top of a CNN. Ranking-CNN 

[14] ensembled a series of CNN-based binary classifiers and aggregated their predictions to obtain the estimated age. In SVRT [15], a strategy of triplet learning were introduced into the ranking loss. CORAL [16] improved OR-CNN [13]

by proposing the Consistent Rank Logits framework to address the problem of classifier inconsistency.

Label Distribution Learning (LDL) [17]

, however, models the age prediction as a probability distribution over all potential age values. LDL-based methods have achieved the current state-of-the-art performance on various age estimation benchmarks. Dex 

[18, 19] proposed to take the expectation value of output distribution as the predicted age. MV-Loss [20]

introduced the mean–variance loss to regularise the shape of the output distribution complementing the cross-entropy loss. DLDL 

[21] and DLDL-v2 [22]

represented the age label as a Gaussian distribution and applied Kullback-Leibler divergence to measure the discrepancy between the output age distribution and the target label distribution. Shen 

[23, 24]

used an ensemble of decision trees in the LDL formulation. Akbari 


proposed the distribution cognisant loss to regularise the predicted age distribution, improving the robustness against outliers. In this work, we follow the problem formulation of LDL-based methods, considering that they have consistently achieved most state-of-the-art results.

Noticeably, several approaches [26, 15, 22]

involve applying pre-trained face recognition models as the initialisation of ages estimation models, while in contrast, we freeze the weights of the face parsing network to avoid unnecessary computational cost. Additionally, some works 

[2, 27, 28, 26] tackled age estimation simultaneously with other tasks like gender classification through a multi-task framework, sharing representations across different tasks. Although our network also share features, it differs from multi-task framework as it requires no semantic labels and also Face Parsing Attention is leveraged to transit semantic-level knowledge.

Ii-B Face Parsing

Face parsing aims to classify each pixel in a facial image into different categories like background, hair, eyes, nose, . Earlier works [29, 30] used holistic priors and hand-crafted features. Deep learning has largely improved the performance of face parsing models Liu [31] combined CNNs with conditional random fields and proposed a multi-objective learning method to model pixel-wise likelihoods and label dependencies. Luo  [32]

applied multiple Deep Belief Networks to detect facial parts and built a hierarchical face parsing framework. Jackson  


employed facial landmarks as a shape constraint to guide Fully Convolution Networks (FCNs) for face parsing. Multiple deep methods including CRFs, Recurrent Neural Networks (RNNs) and Generative Adversarial Networks (GAN) were integrated by authors of

[34] to formulate an end-to-end trainable face parsing model, while the facial landmarks also served as the shape constraints for segmentation predictions. The idea of leveraging shape priors to regularise segmentation masks can also be found in the Shape Constrained Network (SCN) [35] for eye segmentation. In [36]

, a spatial Recurrent Neural Networks was used to model spatial relations within face segmentation masks. A spatial consensus learning technique was explored in

[37] to model the relations between output pixels, while graph models were adopted in [38] to learn implicit relationships between facial components. To better utilise the temporal information of sequential data, authors of [39] integrated ConvLSTM [40] with the FCN model [41] to simultaneously learn the spatial-temporal information in face videos and to obtain temporally-smoothed face masks. In [42]

, a Reinforcement-Learning-based key scheduler was introduced to select online key frames for video face segmentation such that the overall efficiency can be globally optimised.

Most of those methods assume the target face has already been cropped out and is well aligned. Moreover, they often ignore the hair class due to the unpredictable margins for cropping the hair region. To solve this, Lin [43] proposed to warp the entire image using the Tanh function. However, the warping still requires not only the facial bounding boxes but also the facial landmarks. Recently, RoI Tanh-polar transform [4] has been proposed to solve face parsing in the wild. The RoI Tanh-polar transform warps the entire image to the Tanh-polar space and the only requirement is to have the target bounding box. With the Tanh-polar representation, a simple FCN architecture has already achieved state-of-the-art results [4]. The proposed FP-Age builds atop of this method.

Iii Methodology

The overall architecture of FP-Age is shown in Fig. 1. The network at the top is an off-the-shelf, pre-trained face parsing model [4] whose parameters are not updated for the training. At the bottom is the proposed age estimation network that contains the proposed face parsing attention module and some standard operational layers to predict the age. In this section, we formulate age estimation as the distribution learning problem and explain further in detail the components in the proposed FP-Age.

Iii-a Problem Formulation

Let denote a set of training example triplets where are -th input image, its target face bounding box, and its corresponding age label, respectively. The bounding box is a four-dimensional tuple defined by the top-left and the bottom-right corners of the target face location. The age label is an integer from a set of age labels . We denote the total number of age classes Y as .

Our goal is to learn a mapping function from the target face in , specified by , to the label

. When learning such function using DNNs, one way is to set the last layer as one output neuron and employ an Euclidean loss function. However, it has been shown 

[18, 21] that training such DNNs are relatively unstable; outliers can cause large errors. Another way is to formulate age estimation as a

-class classification problem and use the one-hot encoding to represent age labels. But this formulation ignores the fact that the faces with close ages share similar features, causing visual label ambiguity 


Considering the above, we formulate the age estimation as a label distribution learning problem [17]. Specifically, we encode each scalar age label

as a probability distribution

on the interval . Each element in represents the probability of the target face in having the -th label. A Gaussian distribution centred at

with a standard deviation

is used to map to . We follow Gao [22] and set in all experiments.

Using this formulation, we use a Fully-Connected (FC) layer followed by a Softmax layer to map the DNN’s output logits to the predicted distribution

. The learning problem becomes


where is the DNN and is its corresponding parameters. denotes a loss function. The predicted age is obtained by taking the expectation over the as .

Iii-B Face Parsing Network

We use RTNet [4] for extracting face parsing features. RTNet has a simple FCN-like encoder-decoder architecture and achieves state-of-the-art results for in-the-wild face parsing tasks. The encoder contains

residual convolutional layers for feature extraction, similar to the original ResNet-50 

[44]. Two convolutional layers are used in the decoder to perform per-pixel classification to obtain the face masks. In the encoder, the first three convolutional layers gradually reduce the spatial resolution to , and the last two layers use dilated convolutions [45] to aggregate multi-scale contextual information without reducing the resolution.

In contrary to traditional methods that require facial landmarks to align the faces, RTNet uses RoI Tanh-polar transform to warp the entire image given the target bounding box. Some examples of the warping effect can be seen in Fig. 2. The warped representation not only retains all the information in the original image, but also amplifies the target face.

Fig. 2: RoI Tanh-polar Transform [4] warps the whole image to a fix-sized representation in the Tanh-polar space with the given bounding box.

Iii-C Face Parsing Attention

As shown in Fig. 1, there are five feature maps produced by the encoder and one feature map given by the decoder. We take the third feature map in the encoder and denote it as the low-level feature . We consider the only feature map in the decoder as the high-level feature and denote it as . Lastly, we denote the output -channel face masks as .

We first divide into groups along the channel dimension. The -th is group representation, after convolution, is denoted as for . Next, we multiply each group with the corresponding mask group:


The representations are then concatenated along the channel dimension to get . After that, we apply a channel attention block [46]

to capture the dependencies between face regions. This block is formed by a sequence of layers: AvgPool, FC, ReLU, FC and Sigmoid. And the output attention weights are

. The final output of this module is and each feature group is obtained by


Iii-D Age Estimation Network

After the face parsing attention module is applied, we concatenate and along the channel dimension, and apply a convolutional layer to reduce the channel number to . Next, residual blocks [44] are employed. Finally, we use a FC layer followed by a SoftMax layer to map the output logits to the predicted distribution . The predicted age is obtained by taking the expectation over as .


Iii-E Loss Function

We use the weighted sum of Kullback–Leibler divergence and L1 loss as our loss function for the -th example:


where denotes taking the absolute value and is a weight balancing two terms. We empirically set for all examples [21].

Fig. 3: Some examples from IMDB-WIKI [18] and our proposed IMDB-Clean. Each column shows the faces cropped from the same image using the groundtruth bounding boxes. The face detector used by IMDB-WIKI is biased towards middle-aged faces when encountering multiple faces, and fails for low-quality images. Our proposed semi-automatic cleaning method has corrected these errors (see Section IV-B for details).

Iv Experimental Setup

Iv-a Exisiting Datasets

Iv-A1 Imdb-Wiki

The IMDB-WIKI [18] is a large-scale dataset containing 523,051 images with age labels ranging from 0 to 100 years old. The images were crawled from IMDB and Wikipedia, where the IMDB subset contains images and the Wikipedia subset contains images. These images, especially the IMDB subset, were mostly captured in-the-wild and thus are potentially useful for evaluating age estimation in real-world environment. However, the annotations of IMDB-Wiki are very noisy, such that the provided face box is often centred around the wrong person when multiple people are presented in the same image. Because of this, IMDB-Wiki has only been used for pre-training by existing age estimation methods [19, 47, 20].

Iv-A2 Cacd

Cross-Age Celebrity Dataset (CACD) [6] is an in-the-wild dataset that has about facial images of people. These images are divided into the training set, the validation set and the test set which contain people, people and people, respectively. We adopt the common practice originally used in  [24] and report results on the testing set obtained by using the models trained on the training set and the validation set.

Iv-A3 KANFace

KANFace [7] is an in-the-wild dataset consisting of images from subjects. The age range of this dataset is from 0 to 100 years. The images are extremely challenging due to large variations in pose, expression and lightning conditions. Since the authors do not provide splits, we use this dataset only as a test set and the evaluation results obtained by models trained on other datasets.

Iv-A4 Morph

Morph [5] consists of mugshot images from subjects with the age ranging from to years old. Even though it is not an in-the-wild dataset, we report our results on it given its popularity. For intra-dataset evaluations, we follow the setting used in [48, 22, 49]: we randomly divide the dataset into two non-overlapping sets, the training set () and the testing part (). For cross-dataset evaluations, we use all images for testing.

Iv-B Creating the IMDB-Clean Dataset

Although there have been efforts such as those reported in  [50, 51]

to manually clean the IMDB-WIKI dataset, many images still have incorrect annotations. This is mainly because the previous efforts either relied on simple heuristics to remove low-quality images 

[50], or asked human raters to annotate apparent ages for the images based on their visual perception [51]. The latter is a very difficult task, resulting in incorrect guesses due to low quality images and very high quality make-ups.

To identify the source of noise, we revisited the annotation process for the images in the IMDB subset [18]. We concluded that a relatively weak face detector was used to provide bounding box labels and that, when multiple faces are encountered, the one with the highest detection score is selected.

The main problem with such an annotation process is that when there are multiple faces, the adopted face [52] is biased towards large, frontal, middle-aged faces and give high scores to them. Another problem is that the utilised face detector fails to detect faces when the image has large variations in imaging quality, lightning, background , because it has not been trained on in-the-wild images. Some errors are shown in Fig. 3.

Based on the above analysis, we cleaned the dataset following the process below:

  1. For each subject, we use an advanced face detector SFD [53] to detect all faces in all images of the target subject crawled from IMDB.

  2. We use FAN-Face [54] to map these face images into the face recognition embedding space.

  3. We then use a constrained version of the DBSCAN [55] clustering algorithm to cluster these faces. Here, cannot-link constraints are applied to faces occurring in the same images.

  4. Because the method can yield different results when the order of the input faces is changed, we repeat the clustering process multiple times using random ordering.

  5. After that, for each subject, we take the largest cluster obtained from all runs, and consider this to be the correct cluster containing the face images of the target subject.

  6. For one subject, if the second largest cluster is larger than of the largest cluster, we consider this an ambiguous case. These ambiguous cases () are manually checked and filtered.

  7. Finally, we manually examine the dataset again to remove obvious mistakes caused by incorrect timestamps.

Fig. 3 shows some noisy examples and the cleaned results. Note that the above cleaning process is not applied to the WIKI subset because most identities in this subset have only one image crawled from their Wikipedia page.

We refer to the cleaned dataset as IMDB-Clean, which contains images of subjects with age labels ranging from to . We split IMDB-Clean into three subject-independent sets: training, validation and testing. The distributions of these sets are shown in Fig. 4 and a comparison to other publicly available age datasets is given in Table I.

Dataset # Images # ID Age In-the-wild?
FG-Net [8] 0-69 Yes
Morph [5] 16-77 No
CACD [6] 14-62 Yes
KANFace [7] 0-100 Yes
IMDB-Clean (ours) 0-97 Yes

TABLE I: Comparison of age estimation datasets used.
Fig. 4: Age distributions of the proposed IMDB-Clean.

Iv-C Evaluation Metrics

The performance of models is measured by Mean Absolute Error (MAE) and Cumulative Score (CS). MAE is calculated using the average of the absolute errors between age predictions and groundtruth labels on the testing set; CS is calculated by CS where is the total number of testing examples and is the number of examples whose absolute error between the estimated age and the groundtruth age is not greater than years. We report MAEs and CS for all models.

Iv-D Implementation Details

We use RoI Tanh-polar transform [4] to warp each input image to a Tanh-polar representation of resolution . In the training stage, we apply image augmentation techniques including horizontal flipping, scaling, rotation and translation, as well as bounding box augmentations [4]. For all experiments, we employed mini-batch SGD optimiser. The batch size, the weight decay and the momentum were set to , and , respectively. The initial learning rate is 0.0001 and gradually increases to in epochs. Then the learning rate decreases exponentially at each epoch and the training is stopped either when the MAE on the validation set stops decreasing for epochs or we reach 90 training epochs. During testing, the test image and its flipped copy are fed into the model and their predictions are averaged.

For the comparisons reasons, we have re-implemented the following models from scratch: Dex [18, 19], OR-CNN [13], DLDL [21], DLDL-V2 [22] and MV-Loss [20] while ResNet-18 [44]

was used as the backbone. The pre-processing, training and testing steps follow the above procedure. For the models with open-sourced training code, C3AE 

[47], SVRT [15], SSRNet [56] and Coral [16], we used their default training setups and hyper-parameters. RetinaFace [57] was applied to detect facial landmarks (left and right eye centres, nose tip, left and right mouth corners). The input images were aligned using these landmarks with the method proposed in SSRNet111https://github.com/shamangary/SSR-Net/blob/master/data/TYY_MORPH_create_db.py and then resized to pixels.

V Experiments

V-a Can Face Parsing Mask Help?

As a motivational example, we first test whether existing age estimation methods can benefit from facial parts segmentation. This is done by simply stacking the face parsing masks to the input image and using the resulted 14-channel tensor as the input to the models. During this experiment, we re-train three state-of-the-art methods, Dex, DLDL-V2 and MV-Loss, with the modified 14-channel input and test the models on IMDB-Clean. From Table 

II we observe that by taking the stacked representation as input, all three models can achieve better performance in terms of both MAE and CS.

Method MAE CS(%)
Dex [18] 5.34 58.31
Dex with stacked input 5.29 58.61
DLDL-V2 [22] 5.19 54.28
DLDL-V2 with stacked input 5.12 55.14
MV-Loss [20] 5.27 53.97
MV-Loss with stacked input 5.13 59.74
TABLE II: Stacking images and face masks helps (evaluated on IMDB-Clean).

V-B Which Face Parsing Features to Use?

We study which face parsing features are more informative for age estimation. We remove the face parsing attention module in FP-Age and use take face parsing features directly as input. We use four kinds of features as input: 1) low-level; 2) high-level; 3) stacking low and high; and 4) stacking low, high and mask.

From Table III, we observe that using high-level features gives worse performance than using low-level features. This is consistent with earlier research [3] which argues that local features are more informative as they capture ageing patterns around the facial regions, such as the dropping skin around the eyes, and the wrinkles around the mouth. On the other hand, due to the dilated convolutions in RTNet, the high-level features capture a larger perceptive field and thus the details can be lost. Stacking low-level and high-level features gives better performance which shows that these two types of features are complementary and combining them can help age estimation network.

We also observe that adding mask further improves the model. This can be attributed to the fact that face mask contains semantics about different regions and adding it as an explicit attention mechanism helps the model to effortlessly locate these regions and extract ageing patterns. Furthermore, our face parsing attention module yields better results than simple stacking, which we investigate in the following section.

Features from RTNet MAE CS(%)
Low-level 5.01 60.97
High-level 5.24 58.30
Stacking Low and High 4.96 61.01
Stacking Low, High and Masks 4.90 61.84
Full Model (with Face Parsing Attention) 4.68 63.78
TABLE III: Using Different Face Parsing Features for Age Estimation on IMDB-Clean

V-C Attention? Attention!

To provide a clearer picture of the function of the proposed face parsing attention module, we study the 11-class activation output of the Sigmoid layer. Specifically we show the mean and standard deviations of the activations for images in the IMDB-Clean dataset in Fig. 5.

We observe that the network consistently gives higher attention weights to most inner facial regions, especially eyes (“l-eye” and “r-eye”) and mouth (“upper-lip”, “i-mouth”, and “lower-lip”). This is in line with the observations reported in [3]. Interestingly, it can also be seen that the “background“ class contributes more than the “skin” class. This could be attributed to the fact that the face parsing network classifies objects like “beard”, “glasses” and “accessories” as “background”, and such context information could give hints about the person’s age.

We have also performed the same test on separate age groups and observed the importance of different facial regions follows the same trend as shown in Fig. 5. This means that the face parsing attention allows the model to focus on informative regions that are universally important for judging different ages. Although there are some works such as  [10, 58, 59, 60]

that used attention, we are the first to present the evidence that the network attends to specific facial parts and that such attention modelling improves age estimation.

Fig. 5: Attention weights for facial regions induced by the face parsing attention module on IMDB-Clean.

V-D Effectiveness of IMDB-Clean

We conduct experiments on the effectiveness of the proposed IMDB-Clean. Specifically, we train 6 models on three datasets,  IMDB-Clean, IMDB-WIKI and CACD. We then directly test them on KANFace without any fine-tuning. For IMDB-WIKI, we randomly sampled images for training; for the other two datasets, we used their provided training splits. Table IV shows the cross-dataset evaluation results on KANFace. We observe that 1) all models have improved when they are trained on our IMDB-Clean; 2) our model outperforms other methods when trained on IMDB-Clean and IMDB-WIKI, and is comparable to DLDL-V2 when trained on CACD.

width= Trained on IMDB-Clean IMDB-WIKI CACD Method MAE CS(%) MAE CS(%) MAE CS(%) DLDL[21] 9.84 37.37 12.19 27.20 11.66 29.20 DLDL-V2[22] 8.05 41.74 11.46 28.83 10.88 30.66 Dex[19] 7.91 42.30 11.70 20.91 11.90 28.62 M-V Loss[20] 7.71 43.31 11.95 28.30 11.30 29.07 OR-CNN[13] 7.71 47.51 11.10 33.07 11.18 32.90 FP-Age (ours) 6.81 48.49 10.83 29.63 10.91 30.27

TABLE IV: Effectiveness of IMDB-Clean (Testing dataset: KANFace [7]).

V-E Comparison to the State-of-the-arts

V-E1 Intra-Dataset Evaluation

In this section, the performance of the proposed FP-Age is compared with the state-of-the-art age estimation methods under the intra-dataset evaluation protocol. Three benchmarks are used: IMDB-Clean, Morph and CACD. On IMDB-Clean, we train all the models from scratch on the same training set and test them on the testing set. For Morph and CACD, we only train our own models and compare the performance with the reported values for the other methods on the testing set.

The benchmarking results are shown in Table V. It can be seen that our model achieves state-of-the-art results on IMDB-Clean dataset. When all model are trained under the same settings, our model achieves in terms of MAE and in terms of CS. Additionally, the results show that IMDB-Clean is quite challenging compared to other datasets, such as Morph where the state-of-the-art MAEs have achieved below . We provide significance testing analysis in Appendix VI which shows our results are significantly better than the other methods.

From Table VI, it can be seen that our model achieves state-of-the-art results on Morph dataset. When directly trained on Morph, our model achieves in terms of MAE and in terms of CS. When pre-trained on IMDB-Clean and fine-tuned the weights on Morph, FP-Age achieves a MAE of and a CS of , which is the new state-of-the-art result.

Table VII shows the results on the CACD dataset. Following the training protocols of CACD [23], we train our models with both the training set and the validation set, and report the MAE values on the testing set. Our model achieves when trained on CACD-train and when trained on CACD-val. Similar to above experiments, when pre-trained on IMDB-Clean, our model achieves and .

V-E2 Cross-Dataset Evaluation

To test the generalisation ability of different models, we conduct experiments on a cross-dataset evaluation protocol. Our results are compared with advanced models: SSRNet, C3AE, SVRT, DLDL, DLDL-V2, Coral, Dex, MV-Loss, and OR-CNN. We train all models on IMDB-Clean and test them on different testing datasets without fine-tuning. The reuslts are summarised in Table VIII. It can be seen that when all models are trained on IMDB-Clean, the proposed FP-Age achieves the best results on most of evaluation datasets.

Method MAE CS(%) Year
OR-CNN [13] 5.85 49.72 2016
DLDL [21] 6.04 56.94 2017
SSRNet [56] 7.08 27.87 2018
Dex [19] 5.34 58.61 2018
M-V Loss[20] 5.27 59.74 2018
DLDL-V2 [22] 5.19 54.28 2018
SVRT [15] 5.85 49.72 2019
C3AE [47] 6.75 47.98 2019
FP-Age (ours) 4.68 63.78 -
  • Bold indicates the best and italic the second

  • Our results are statistically significant according to paired t-test and Bonferroni corrections (See Appendix 


TABLE V: Intra-Dataset Evaluation on IMDB-Clean.
Method MAE CS(%) Year
Human workers [13] 6.30 51.0 2015
OR-CNN [13] 3.34 81.5 2016
DLDL [21] 2.42 - 2017
ARN [61] 3.00 - 2017
Ranking-CNN [14] 2.96 85.2 2017
M-V Loss[20] 2.41 91.2 2018
DLDL-V2 [22] 1.97 - 2018
BridgeNet [62] 2.38 - 2019
C3AE [47] 2.75 - 2019
AVDL [63] 1.94 - 2020
PML [49] 2.15 - 2021
DRF [24] 2.14 91.3 2021
FP-Age (ours) 2.04 92.8 -
FP-Age  (ours) 1.90 93.7 -
  • Bold indicates the best and italic the second

  • pre-trained on IMDB-WIKI

  • pre-trained on MS-Celeb-1M

  • pre-trained on the proposed IMDB-Clean

TABLE VI: Intra-Dataset Evaluation on Morph [5].
Method Trained on Year
CACD-train CACD-val
Dex[19] 4.78 6.52 2018
DLDLF [23] 4.67 6.16 2018
DRF [24] 4.61 5.63 2021
FP-Age (ours) 4.50 5.62 -
FP-Age  (ours) 4.33 4.95 -
  • pre-trained on the proposed IMDB-Clean

TABLE VII: Intra-Dataset Evaluation (MAEs) on CACD [6].
FG-Net [8] Morph [5] KANFace [7] CACD-test [6]
Method MAE CS(%) MAE CS(%) MAE CS(%) MAE CS(%)
SSRNet  [56] 12.04 19.86 7.12 40.77 11.36 30.11 11.76 22.01
C3AE  [47] 11.23 27.34 7.03 41.81 10.41 31.71 12.71 16.14
SVRT  [15] 9.77 23.75 5.87 43.71 10.89 27.55 11.73 14.37
DLDL  [21] 11.40 24.05 6.07 33.06 9.84 37.37 6.53 55.12
Coral  [16] 6.12 45.61 6.13 42.33 7.88 39.01 12.58 11.38
Dex  [19] 6.52 41.52 5.63 53.03 7.91 42.30 6.08 55.94
DLDL-V2  [22] 6.65 42.41 5.10 55.64 8.05 41.74 5.92 57.39
M-V Loss  [20] 6.49 42.12 4.99 56.94 7.71 43.31 5.88 57.22
OR-CNN  [13] 6.44 40.72 5.04 60.87 7.71 47.51 5.83 62.47
Ours 5.60 48.80 4.67 60.54 6.81 48.49 5.60 60.91

  • inputs are pre-processed with 5-point face alignment

  • inputs are pre-processed with RoI Tanh-polar Transform [4]

TABLE VIII: Cross-Dataset Evaluation (Training set: IMDB-Clean).

Vi Conclusion

In this paper, we have proposed a simple yet effective approach of exploiting face parsing semantics for age estimation. We have designed a framework to aggregate features from different levels of the face parsing network. A novel face parsing attention module is proposed to explicitly introduce facial semantics into the age estimation network. To train the model, we propose an semi-automatic clustering method for cleaning existing dataset and introduce the resulting IMDB-Clean dataset as a new in-the-wild benchmark. Thanks to the attention mechanism and the large-scale dataset, we have observed that the network focuses on certain facial parts when predicting ages. The nose region appears least informative for age estimation. Moreover, the extensive experiments have shown that our model outperforms the current state-of-the-art methods on various dataset in both intra-dataset and cross-dataset evaluations. To the best of our knowledge, this is the first attempt of leveraging face parsing attention to achieve age estimation. We hope our design could inspire the readers to consider similar attention models for different deep face analysis tasks.

[Statistical Significance Analysis] We conduct paired t-tests on the Absolute Error (AE) on the testing set of IMDB-Clean between FP-Age and the other eight methods, OR-CNN [13], DLDL [21], SSRNet [56], Dex [19], MV-Loss [20], DLDL-V2 [22], SVRT [15] and C3AE [47]. Concretely, suppose there are images in the testing set, then is the AE between the predicted age of FP-Age and the groundtruth age on the -th testing image and is such AE for another method . The difference between the -th pair is defined as . The t statistic is calculated as


where and are the average and standard deviation of . We correct the p-values using Bonferroni correction. The alpha value is set to . From Table V and Table IX, we observe our results are significantly better than those of the other methods. We can, thus, reject the null hypotheses.

Method t-statistic p-value Corrected p-value
SSRNet [56] -137.73 0.00 0.00
C3AE [47] -84.66 0.00 0.00
DLDL [22] -66.44 0.00 0.00
Dex [19] -39.08 0.00 0.00
OR-CNN [13] -33.83
DLDL-V2 [22] -31.83
M-V Loss[20] -28.03
SVRT [15] -22.89
  • indicates underflow

TABLE IX: Paired t-Tests between FP-Age and Other Methods on IMDB-Clean.


Data cleaning and all experiments have been conducted at Imperial College London.


  • [1] S. Escalera, M. Torres Torres, B. Martinez, X. Baro, H. Jair Escalante, I. Guyon, G. Tzimiropoulos, C. Corneou, M. Oliu, M. Ali Bagheri, and M. Valstar, “Chalearn looking at people and faces of the world: Face analysis workshop and challenge 2016,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    , June 2016.
  • [2] G. Levi and T. Hassner, “Age and gender classification using convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2015.
  • [3] H. Han, C. Otto, X. Liu, and A. K. Jain, “Demographic Estimation from Face Images: Human vs. Machine Performance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 6, pp. 1148–1161, Jun. 2015.
  • [4] Y. Lin, J. Shen, Y. Wang, and M. Pantic, “RoI Tanh-polar Transformer Network for Face Parsing in the Wild,” arXiv:2102.02717 [cs], Feb. 2021.
  • [5] K. Ricanek and T. Tesafaye, “MORPH: A longitudinal image database of normal adult age-progression,” in 7th International Conference on Automatic Face and Gesture Recognition (FGR06), Apr. 2006, pp. 341–345.
  • [6] B. Chen, C. Chen, and W. H. Hsu, “Face Recognition and Retrieval Using Cross-Age Reference Coding With Cross-Age Celebrity Dataset,” IEEE Transactions on Multimedia, vol. 17, no. 6, pp. 804–815, Jun. 2015.
  • [7] M. Georgopoulos, Y. Panagakis, and M. Pantic, “Investigating bias in deep face analysis: The KANFace dataset and empirical study,” Image and Vision Computing, vol. 102, p. 103954, Oct. 2020.
  • [8] A. Lanitis, C. J. Taylor, and T. F. Cootes, “Toward automatic simulation of aging effects on face images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 4, pp. 442–455, 2002.
  • [9] G. Panis, A. Lanitis, N. Tsapatsoulis, and T. F. Cootes, “Overview of research on facial ageing using the FG-NET ageing database,” IET Biometrics, vol. 5, no. 2, pp. 37–46, Jun. 2016.
  • [10] D. Yi, Z. Lei, and S. Z. Li, “Age estimation by multi-scale convolutional network,” in Computer Vision – ACCV 2014, 2014, pp. 144–158.
  • [11] X. Wang, R. Guo, and C. Kambhamettu, “Deeply-Learned Feature for Age Estimation,” in 2015 IEEE Winter Conference on Applications of Computer Vision, Jan. 2015, pp. 534–541.
  • [12] E. Eidinger, R. Enbar, and T. Hassner, “Age and gender estimation of unfiltered faces,” IEEE Transactions on Information Forensics and Security, vol. 9, no. 12, pp. 2170–2179, 2014.
  • [13] Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua, “Ordinal Regression with Multiple Output CNN for Age Estimation,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 4920–4928.
  • [14] S. Chen, C. Zhang, M. Dong, J. Le, and M. Rao, “Using Ranking-CNN for Age Estimation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 742–751.
  • [15] W. Im, S. Hong, S.-E. Yoon, and H. S. Yang, “Scale-Varying Triplet Ranking with Classification Loss for Facial Age Estimation,” in Computer Vision – ACCV 2018, ser. Lecture Notes in Computer Science, C. Jawahar, H. Li, G. Mori, and K. Schindler, Eds., Cham, 2019, pp. 247–259.
  • [16] W. Cao, V. Mirjalili, and S. Raschka, “Rank consistent ordinal regression for neural networks with application to age estimation,” Pattern Recognition Letters, vol. 140, pp. 325–331, 2020.
  • [17] X. Geng, “Label Distribution Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 7, pp. 1734–1748, Jul. 2016.
  • [18] R. Rothe, R. Timofte, and L. V. Gool, “DEX: Deep EXpectation of Apparent Age from a Single Image,” in 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), Dec. 2015, pp. 252–257.
  • [19] R. Rothe, R. Timofte, and L. Van Gool, “Deep Expectation of Real and Apparent Age from a Single Image Without Facial Landmarks,” International Journal of Computer Vision, vol. 126, no. 2, pp. 144–157, Apr. 2018.
  • [20] H. Pan, H. Han, S. Shan, and X. Chen, “Mean-Variance Loss for Deep Age Estimation from a Face,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp. 5285–5294.
  • [21] B. Gao, C. Xing, C. Xie, J. Wu, and X. Geng, “Deep Label Distribution Learning With Label Ambiguity,” IEEE Transactions on Image Processing, vol. 26, no. 6, pp. 2825–2838, Jun. 2017.
  • [22] B.-B. Gao, H.-Y. Zhou, J. Wu, and X. Geng, “Age Estimation Using Expectation of Label Distribution Learning,” in

    International Joint Conference on Artificial Intelligence

    , Stockholm, Sweden, Jul. 2018, pp. 712–718.
  • [23] W. Shen, Y. Guo, Y. Wang, K. Zhao, B. Wang, and A. Yuille, “Deep Regression Forests for Age Estimation,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp. 2304–2313.
  • [24]

    ——, “Deep Differentiable Random Forests for Age Estimation,”

    IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 2, pp. 404–419, Feb. 2021.
  • [25] A. Akbari, M. Awais, Z. Feng, A. Farooq, and J. Kittler, “Distribution Cognisant Loss for Cross-Database Facial Age Estimation with Sensitivity Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2020.
  • [26] R. Ranjan, S. Sankaranarayanan, C. D. Castillo, and R. Chellappa, “An All-In-One Convolutional Neural Network for Face Analysis,” in 2017 12th IEEE International Conference on Automatic Face Gesture Recognition (FG 2017), May 2017, pp. 17–24.
  • [27] H. Han, A. K. Jain, F. Wang, S. Shan, and X. Chen, “Heterogeneous Face Attribute Estimation: A Deep Multi-Task Learning Approach,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 11, pp. 2597–2609, Nov. 2018.
  • [28] F. Wang, H. Han, S. Shan, and X. Chen, “Deep Multi-Task Learning for Joint Prediction of Heterogeneous Face Attributes,” in 2017 12th IEEE International Conference on Automatic Face Gesture Recognition (FG 2017), May 2017, pp. 173–179.
  • [29] J. Warrell and S. J. D. Prince, “Labelfaces: Parsing facial features by multiclass labeling with an epitome prior,” in 2009 IEEE International Conference on Image Processing (ICIP), 2009, pp. 2481–2484.
  • [30] B. M. Smith, L. Zhang, J. Brandt, Z. Lin, and J. Yang, “Exemplar-based face parsing,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013.
  • [31] Sifei Liu, J. Yang, Chang Huang, and M. Yang, “Multi-objective convolutional learning for face labeling,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3451–3459.
  • [32] P. Luo, X. Wang, and X. Tang, “Hierarchical face parsing via deep learning,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2480–2487.
  • [33] A. S. Jackson, M. Valstar, and G. Tzimiropoulos, “A cnn cascade for landmark guided semantic part segmentation,” in Computer Vision – ECCV 2016, Springer.   Cham: Springer International Publishing, 2016, pp. 143–155.
  • [34] U. Güçlü, Y. Güçlütürk, M. Madadi, S. Escalera, X. Baró, J. González, R. van Lier, and M. A. van Gerven, “End-to-end semantic face segmentation with conditional random fields as convolutional, recurrent and adversarial networks,” arXiv preprint arXiv:1703.03305, 2017.
  • [35] B. Luo, J. Shen, S. Cheng, Y. Wang, and M. Pantic, “Shape constrained network for eye segmentation in the wild,” in 2020 IEEE Winter Conference on Applications of Computer Vision, 2020, pp. 1952–1960.
  • [36] L. J. Sifei Liu, Jianping Shi and M.-H. Yang, “Face parsing via recurrent propagation,” in Proceedings of the British Machine Vision Conference (BMVC), September 2017.
  • [37] I. Masi, J. Mathai, and W. AbdAlmageed, “Towards Learning Structure via Consensus for Face Segmentation and Parsing,” in 2020 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [38] G. Te, Y. Liu, W. Hu, H. Shi, and T. Mei, “Edge-aware graph representation learning and reasoning for face parsing,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds.   Cham: Springer International Publishing, 2020, pp. 258–274.
  • [39] Y. Wang, B. Luo, J. Shen, and M. Pantic, “Face mask extraction in video sequence,” International Journal of Computer Vision, vol. 127, no. 6-7, pp. 625–641, 2019.
  • [40]

    S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” in

    Advances in neural information processing systems, 2015, pp. 802–810.
  • [41] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 04, pp. 640–651, apr 2017.
  • [42] Y. Wang, M. Dong, J. Shen, Y. Wu, S. Cheng, and M. Pantic, “Dynamic face video segmentation via reinforcement learning,” in 2020 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [43] J. Lin, H. Yang, D. Chen, M. Zeng, F. Wen, and L. Yuan, “Face parsing with roi tanh-warping,” in 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [44] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  • [45] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” in ICLR, 2016.
  • [46] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [47] C. Zhang, S. Liu, X. Xu, and C. Zhu, “C3AE: Exploring the Limits of Compact Model for Age Estimation,” in 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [48]

    H. Liu, J. Lu, J. Feng, and J. Zhou, “Ordinal Deep Feature Learning for Facial Age Estimation,” in

    2017 12th IEEE International Conference on Automatic Face Gesture Recognition (FG 2017), May 2017, pp. 157–164.
  • [49] Z. Deng, H. Liu, Y. Wang, C. Wang, Z. Yu, and X. Sun, “PML: Progressive Margin Loss for Long-tailed Age Classification,” in 2021 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • [50] G. Antipov, M. Baccouche, S. Berrani, and J. Dugelay, “Apparent Age Estimation from Face Images Combining General and Children-Specialized Deep Learning Models,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun. 2016, pp. 801–809.
  • [51] K. Zhang, C. Gao, L. Guo, M. Sun, X. Yuan, T. X. Han, Z. Zhao, and B. Li, “Age Group and Gender Estimation in the Wild With Deep RoR Architecture,” IEEE Access, vol. 5, pp. 22 492–22 503, 2017.
  • [52] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool, “Face detection without bells and whistles,” in Computer Vision – ECCV 2014, 2014, pp. 720–735.
  • [53] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, “SFD: Single shot scale-invariant face detector,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 192–201.
  • [54] J. Yang, A. Bulat, and G. Tzimiropoulos, “FAN-Face: A simple orthogonal improvement to deep face recognition,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 12 621–12 628, Apr. 2020.
  • [55] E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, “DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN,” ACM Trans. Database Syst., vol. 42, no. 3, Jul. 2017.
  • [56] T.-Y. Yang, Y.-H. Huang, Y.-Y. Lin, P.-C. Hsiu, and Y.-Y. Chuang, “SSR-Net: A compact soft stagewise regression network for age estimation,” in Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18.   International Joint Conferences on Artificial Intelligence Organization, 7 2018, pp. 1078–1084. [Online]. Available: https://doi.org/10.24963/ijcai.2018/150
  • [57] J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “Retinaface: Single-shot multi-level face localisation in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [58] M. Angeloni, R. de Freitas Pereira, and H. Pedrini, “Age estimation from facial parts using compact multi-stream convolutional neural networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct 2019.
  • [59] Z. Liao, S. Petridis, and M. Pantic, “Local deep neural networks for age and gender classification,” arXiv preprint arXiv:1703.08497, 2017.
  • [60] W. Pei, H. Dibeklioğlu, T. Baltrušaitis, and D. M. J. Tax, “Attended End-to-End Architecture for Age Estimation From Facial Expression Videos,” IEEE Transactions on Image Processing, vol. 29, pp. 1972–1984, 2020.
  • [61]

    E. Agustsson, R. Timofte, and L. V. Gool, “Anchored Regression Networks Applied to Age Estimation and Super Resolution,” in

    2017 IEEE International Conference on Computer Vision (ICCV), Oct. 2017, pp. 1652–1661.
  • [62] W. Li, J. Lu, J. Feng, C. Xu, J. Zhou, and Q. Tian, “BridgeNet: A Continuity-Aware Probabilistic Network for Age Estimation,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2019, pp. 1145–1154.
  • [63] X. Wen, B. Li, H. Guo, Z. Liu, G. Hu, M. Tang, and J. Wang, “Adaptive Variance Based Label Distribution Learning for Facial Age Estimation,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds., vol. 12368, Cham, 2020, pp. 379–395.