Age estimation from facial images has been an active research topic in computer vision and it can be utilised in a variety of real-world applications, such as forensics, security, health and well-being, and social media. There are several branches in this topic. In this work, we focus on the estimation of real/biological age, which is arguably the most difficult task among others such as apparent age estimation or age group classification . Predicting a person’s age from facial images in the wild can be very challenging as it involves a variety of intrinsic and subtle factors such as pose, expression, gender, illuminations, occlusions,
Recently, deep learning approaches have been widely employed to construct end-to-end age estimations models. Deep embedding learnt from large-scale datasets is a very effective facial representation that has greatly improved the state-of-the-art in automatic estimation of facial age. However, most deep models are not explicitly trained to learn facial semantic information like eyes and noses, and therefore the extracted embedding may not appropriately attend to those more informative facial regions.
It has been shown that the most informative features for age estimation are located in the local regions such as eyes and mouth corners 
. On the other hand, face parsing is designed to classify each pixel into different facial regions and to give the regional boundaries. Therefore, a Convolutional Neural Network (CNN) trained for face parsing could also pick up the features around the facial regions that are also useful for determining the age. Moreover, due to the hierarchical structure of CNNs, the intermediate features can encode both local and global information that can be fused for age estimation.
To this end, we propose FP-Age for leveraging features in a face parsing network for facial age estimation. In particular, we adopt both coarse and fine-grained features from a pre-trained face parsing network  to represent facial semantic information at different levels and built a small network on top of it to predict the age. To avoid the loss of details in the high level features, we design a Face Parsing Attention (FPA) module to explicitly drive the network’s attention to those more informative facial parts. The attended high-level features are then concatenated to the low-level features and fed into a small add-on network for age prediction. Since the semantic features are extracted using a pre-trained face parsing model, no additional face parsing annotations are required and thus our FP-Age network can be trained in an end-to-end fashion, similar to other age estimation networks.
We have also developed a semi-automatic approach to clean the noisy data in IMDB-WIKI, leading to a new large-scale age estimation benchmark titled IMDB-Clean. Our FP-Age network achieves state-of-the-art results on this IMDB-Clean , as well as on several other age estimation datasets, under both intra-dataset and cross-dataset evaluation protocols. To the best of our knowledge, this is the first reported effort to adopt semantic facial information for age estimation based on an attention mechanism on different facial regions. The idea of Face Parsing Attention can be inspiring to other facial analysis tasks too, and the proposed FP-Age network can be easily adapted to perform on those tasks as well, facial gesture recognition and emotion recognition.
Our main contributions are as follows:
The IMDB-Clean dataset: a large-scale, clean image dataset for age estimation in the wild;
FP-Age: a simple yet effective framework that leverages facial semantic features for semantic-aware age estimation;
We also demonstrate that for age estimation, different facial parts have variable importance with “nose” being the least important region;
Ii Related Work
Ii-a Image-based Biological Age Estimation
Early works on age estimations are mainly based on handcrafted features, and we refer interested readers to  for a detailed survey. Recently, deep learning techniques have achieved significantly improved performance in this field. In this section, we briefly explain several deep learning approaches on age estimation. They are roughly organised into four categories depending on how they model the problem: regression based, classification based, ranking based and label distribution based.
Regression approaches treat facial ageing as a regression problem and directly predict true age values from facial images. Euclidean loss is therefore a popular choice among those methods. Yi  adopted mean squared loss to train a multi-scale CNN for age regression. Similarly, Wang  apply the same loss on the representation obtained by fusing feature maps from different layers of a CNN.
In contrast to regression methods, classification based works [2, 12] formulate the age estimation as a multi-class classification problem and treat different ages as independent classes. Although such formulations make it easier to train CNNs, this ignores the correlations between different classes.
Ranking approaches inspect the ordinal property embedded in the ageing process. OR-CNN 
proposed to formulate age estimation as an ordinal regression problem and built multiple binary classification neurons on top of a CNN. Ranking-CNN ensembled a series of CNN-based binary classifiers and aggregated their predictions to obtain the estimated age. In SVRT , a strategy of triplet learning were introduced into the ranking loss. CORAL  improved OR-CNN 
by proposing the Consistent Rank Logits framework to address the problem of classifier inconsistency.
Label Distribution Learning (LDL) 
, however, models the age prediction as a probability distribution over all potential age values. LDL-based methods have achieved the current state-of-the-art performance on various age estimation benchmarks. Dex[18, 19] proposed to take the expectation value of output distribution as the predicted age. MV-Loss 
introduced the mean–variance loss to regularise the shape of the output distribution complementing the cross-entropy loss. DLDL and DLDL-v2 23, 24]
used an ensemble of decision trees in the LDL formulation. Akbari
proposed the distribution cognisant loss to regularise the predicted age distribution, improving the robustness against outliers. In this work, we follow the problem formulation of LDL-based methods, considering that they have consistently achieved most state-of-the-art results.
involve applying pre-trained face recognition models as the initialisation of ages estimation models, while in contrast, we freeze the weights of the face parsing network to avoid unnecessary computational cost. Additionally, some works[2, 27, 28, 26] tackled age estimation simultaneously with other tasks like gender classification through a multi-task framework, sharing representations across different tasks. Although our network also share features, it differs from multi-task framework as it requires no semantic labels and also Face Parsing Attention is leveraged to transit semantic-level knowledge.
Ii-B Face Parsing
Face parsing aims to classify each pixel in a facial image into different categories like background, hair, eyes, nose, . Earlier works [29, 30] used holistic priors and hand-crafted features. Deep learning has largely improved the performance of face parsing models Liu  combined CNNs with conditional random fields and proposed a multi-objective learning method to model pixel-wise likelihoods and label dependencies. Luo 
applied multiple Deep Belief Networks to detect facial parts and built a hierarchical face parsing framework. Jackson
employed facial landmarks as a shape constraint to guide Fully Convolution Networks (FCNs) for face parsing. Multiple deep methods including CRFs, Recurrent Neural Networks (RNNs) and Generative Adversarial Networks (GAN) were integrated by authors of to formulate an end-to-end trainable face parsing model, while the facial landmarks also served as the shape constraints for segmentation predictions. The idea of leveraging shape priors to regularise segmentation masks can also be found in the Shape Constrained Network (SCN)  for eye segmentation. In 
, a spatial Recurrent Neural Networks was used to model spatial relations within face segmentation masks. A spatial consensus learning technique was explored in to model the relations between output pixels, while graph models were adopted in  to learn implicit relationships between facial components. To better utilise the temporal information of sequential data, authors of  integrated ConvLSTM  with the FCN model  to simultaneously learn the spatial-temporal information in face videos and to obtain temporally-smoothed face masks. In 
, a Reinforcement-Learning-based key scheduler was introduced to select online key frames for video face segmentation such that the overall efficiency can be globally optimised.
Most of those methods assume the target face has already been cropped out and is well aligned. Moreover, they often ignore the hair class due to the unpredictable margins for cropping the hair region. To solve this, Lin  proposed to warp the entire image using the Tanh function. However, the warping still requires not only the facial bounding boxes but also the facial landmarks. Recently, RoI Tanh-polar transform  has been proposed to solve face parsing in the wild. The RoI Tanh-polar transform warps the entire image to the Tanh-polar space and the only requirement is to have the target bounding box. With the Tanh-polar representation, a simple FCN architecture has already achieved state-of-the-art results . The proposed FP-Age builds atop of this method.
The overall architecture of FP-Age is shown in Fig. 1. The network at the top is an off-the-shelf, pre-trained face parsing model  whose parameters are not updated for the training. At the bottom is the proposed age estimation network that contains the proposed face parsing attention module and some standard operational layers to predict the age. In this section, we formulate age estimation as the distribution learning problem and explain further in detail the components in the proposed FP-Age.
Iii-a Problem Formulation
Let denote a set of training example triplets where are -th input image, its target face bounding box, and its corresponding age label, respectively. The bounding box is a four-dimensional tuple defined by the top-left and the bottom-right corners of the target face location. The age label is an integer from a set of age labels . We denote the total number of age classes Y as .
Our goal is to learn a mapping function from the target face in , specified by , to the label
. When learning such function using DNNs, one way is to set the last layer as one output neuron and employ an Euclidean loss function. However, it has been shown[18, 21] that training such DNNs are relatively unstable; outliers can cause large errors. Another way is to formulate age estimation as a
-class classification problem and use the one-hot encoding to represent age labels. But this formulation ignores the fact that the faces with close ages share similar features, causing visual label ambiguity.
Considering the above, we formulate the age estimation as a label distribution learning problem . Specifically, we encode each scalar age label
as a probability distributionon the interval . Each element in represents the probability of the target face in having the -th label. A Gaussian distribution centred at
with a standard deviationis used to map to . We follow Gao  and set in all experiments.
Using this formulation, we use a Fully-Connected (FC) layer followed by a Softmax layer to map the DNN’s output logits to the predicted distribution. The learning problem becomes
where is the DNN and is its corresponding parameters. denotes a loss function. The predicted age is obtained by taking the expectation over the as .
Iii-B Face Parsing Network
We use RTNet  for extracting face parsing features. RTNet has a simple FCN-like encoder-decoder architecture and achieves state-of-the-art results for in-the-wild face parsing tasks. The encoder contains
residual convolutional layers for feature extraction, similar to the original ResNet-50. Two convolutional layers are used in the decoder to perform per-pixel classification to obtain the face masks. In the encoder, the first three convolutional layers gradually reduce the spatial resolution to , and the last two layers use dilated convolutions  to aggregate multi-scale contextual information without reducing the resolution.
In contrary to traditional methods that require facial landmarks to align the faces, RTNet uses RoI Tanh-polar transform to warp the entire image given the target bounding box. Some examples of the warping effect can be seen in Fig. 2. The warped representation not only retains all the information in the original image, but also amplifies the target face.
Iii-C Face Parsing Attention
As shown in Fig. 1, there are five feature maps produced by the encoder and one feature map given by the decoder. We take the third feature map in the encoder and denote it as the low-level feature . We consider the only feature map in the decoder as the high-level feature and denote it as . Lastly, we denote the output -channel face masks as .
We first divide into groups along the channel dimension. The -th is group representation, after convolution, is denoted as for . Next, we multiply each group with the corresponding mask group:
The representations are then concatenated along the channel dimension to get . After that, we apply a channel attention block 
to capture the dependencies between face regions. This block is formed by a sequence of layers: AvgPool, FC, ReLU, FC and Sigmoid. And the output attention weights are. The final output of this module is and each feature group is obtained by
Iii-D Age Estimation Network
After the face parsing attention module is applied, we concatenate and along the channel dimension, and apply a convolutional layer to reduce the channel number to . Next, residual blocks  are employed. Finally, we use a FC layer followed by a SoftMax layer to map the output logits to the predicted distribution . The predicted age is obtained by taking the expectation over as .
Iii-E Loss Function
We use the weighted sum of Kullback–Leibler divergence and L1 loss as our loss function for the -th example:
where denotes taking the absolute value and is a weight balancing two terms. We empirically set for all examples .
Iv Experimental Setup
Iv-a Exisiting Datasets
The IMDB-WIKI  is a large-scale dataset containing 523,051 images with age labels ranging from 0 to 100 years old. The images were crawled from IMDB and Wikipedia, where the IMDB subset contains images and the Wikipedia subset contains images. These images, especially the IMDB subset, were mostly captured in-the-wild and thus are potentially useful for evaluating age estimation in real-world environment. However, the annotations of IMDB-Wiki are very noisy, such that the provided face box is often centred around the wrong person when multiple people are presented in the same image. Because of this, IMDB-Wiki has only been used for pre-training by existing age estimation methods [19, 47, 20].
Cross-Age Celebrity Dataset (CACD)  is an in-the-wild dataset that has about facial images of people. These images are divided into the training set, the validation set and the test set which contain people, people and people, respectively. We adopt the common practice originally used in  and report results on the testing set obtained by using the models trained on the training set and the validation set.
KANFace  is an in-the-wild dataset consisting of images from subjects. The age range of this dataset is from 0 to 100 years. The images are extremely challenging due to large variations in pose, expression and lightning conditions. Since the authors do not provide splits, we use this dataset only as a test set and the evaluation results obtained by models trained on other datasets.
Morph  consists of mugshot images from subjects with the age ranging from to years old. Even though it is not an in-the-wild dataset, we report our results on it given its popularity. For intra-dataset evaluations, we follow the setting used in [48, 22, 49]: we randomly divide the dataset into two non-overlapping sets, the training set () and the testing part (). For cross-dataset evaluations, we use all images for testing.
Iv-B Creating the IMDB-Clean Dataset
to manually clean the IMDB-WIKI dataset, many images still have incorrect annotations. This is mainly because the previous efforts either relied on simple heuristics to remove low-quality images, or asked human raters to annotate apparent ages for the images based on their visual perception . The latter is a very difficult task, resulting in incorrect guesses due to low quality images and very high quality make-ups.
To identify the source of noise, we revisited the annotation process for the images in the IMDB subset . We concluded that a relatively weak face detector was used to provide bounding box labels and that, when multiple faces are encountered, the one with the highest detection score is selected.
The main problem with such an annotation process is that when there are multiple faces, the adopted face  is biased towards large, frontal, middle-aged faces and give high scores to them. Another problem is that the utilised face detector fails to detect faces when the image has large variations in imaging quality, lightning, background , because it has not been trained on in-the-wild images. Some errors are shown in Fig. 3.
Based on the above analysis, we cleaned the dataset following the process below:
For each subject, we use an advanced face detector SFD  to detect all faces in all images of the target subject crawled from IMDB.
We use FAN-Face  to map these face images into the face recognition embedding space.
We then use a constrained version of the DBSCAN  clustering algorithm to cluster these faces. Here, cannot-link constraints are applied to faces occurring in the same images.
Because the method can yield different results when the order of the input faces is changed, we repeat the clustering process multiple times using random ordering.
After that, for each subject, we take the largest cluster obtained from all runs, and consider this to be the correct cluster containing the face images of the target subject.
For one subject, if the second largest cluster is larger than of the largest cluster, we consider this an ambiguous case. These ambiguous cases () are manually checked and filtered.
Finally, we manually examine the dataset again to remove obvious mistakes caused by incorrect timestamps.
Fig. 3 shows some noisy examples and the cleaned results. Note that the above cleaning process is not applied to the WIKI subset because most identities in this subset have only one image crawled from their Wikipedia page.
We refer to the cleaned dataset as IMDB-Clean, which contains images of subjects with age labels ranging from to . We split IMDB-Clean into three subject-independent sets: training, validation and testing. The distributions of these sets are shown in Fig. 4 and a comparison to other publicly available age datasets is given in Table I.
Iv-C Evaluation Metrics
The performance of models is measured by Mean Absolute Error (MAE) and Cumulative Score (CS). MAE is calculated using the average of the absolute errors between age predictions and groundtruth labels on the testing set; CS is calculated by CS where is the total number of testing examples and is the number of examples whose absolute error between the estimated age and the groundtruth age is not greater than years. We report MAEs and CS for all models.
Iv-D Implementation Details
We use RoI Tanh-polar transform  to warp each input image to a Tanh-polar representation of resolution . In the training stage, we apply image augmentation techniques including horizontal flipping, scaling, rotation and translation, as well as bounding box augmentations . For all experiments, we employed mini-batch SGD optimiser. The batch size, the weight decay and the momentum were set to , and , respectively. The initial learning rate is 0.0001 and gradually increases to in epochs. Then the learning rate decreases exponentially at each epoch and the training is stopped either when the MAE on the validation set stops decreasing for epochs or we reach 90 training epochs. During testing, the test image and its flipped copy are fed into the model and their predictions are averaged.
was used as the backbone. The pre-processing, training and testing steps follow the above procedure. For the models with open-sourced training code, C3AE, SVRT , SSRNet  and Coral , we used their default training setups and hyper-parameters. RetinaFace  was applied to detect facial landmarks (left and right eye centres, nose tip, left and right mouth corners). The input images were aligned using these landmarks with the method proposed in SSRNet111https://github.com/shamangary/SSR-Net/blob/master/data/TYY_MORPH_create_db.py and then resized to pixels.
V-a Can Face Parsing Mask Help?
As a motivational example, we first test whether existing age estimation methods can benefit from facial parts segmentation. This is done by simply stacking the face parsing masks to the input image and using the resulted 14-channel tensor as the input to the models. During this experiment, we re-train three state-of-the-art methods, Dex, DLDL-V2 and MV-Loss, with the modified 14-channel input and test the models on IMDB-Clean. From TableII we observe that by taking the stacked representation as input, all three models can achieve better performance in terms of both MAE and CS.
V-B Which Face Parsing Features to Use?
We study which face parsing features are more informative for age estimation. We remove the face parsing attention module in FP-Age and use take face parsing features directly as input. We use four kinds of features as input: 1) low-level; 2) high-level; 3) stacking low and high; and 4) stacking low, high and mask.
From Table III, we observe that using high-level features gives worse performance than using low-level features. This is consistent with earlier research  which argues that local features are more informative as they capture ageing patterns around the facial regions, such as the dropping skin around the eyes, and the wrinkles around the mouth. On the other hand, due to the dilated convolutions in RTNet, the high-level features capture a larger perceptive field and thus the details can be lost. Stacking low-level and high-level features gives better performance which shows that these two types of features are complementary and combining them can help age estimation network.
We also observe that adding mask further improves the model. This can be attributed to the fact that face mask contains semantics about different regions and adding it as an explicit attention mechanism helps the model to effortlessly locate these regions and extract ageing patterns. Furthermore, our face parsing attention module yields better results than simple stacking, which we investigate in the following section.
|Features from RTNet||MAE||CS(%)|
|Stacking Low and High||4.96||61.01|
|Stacking Low, High and Masks||4.90||61.84|
|Full Model (with Face Parsing Attention)||4.68||63.78|
V-C Attention? Attention!
To provide a clearer picture of the function of the proposed face parsing attention module, we study the 11-class activation output of the Sigmoid layer. Specifically we show the mean and standard deviations of the activations for images in the IMDB-Clean dataset in Fig. 5.
We observe that the network consistently gives higher attention weights to most inner facial regions, especially eyes (“l-eye” and “r-eye”) and mouth (“upper-lip”, “i-mouth”, and “lower-lip”). This is in line with the observations reported in . Interestingly, it can also be seen that the “background“ class contributes more than the “skin” class. This could be attributed to the fact that the face parsing network classifies objects like “beard”, “glasses” and “accessories” as “background”, and such context information could give hints about the person’s age.
We have also performed the same test on separate age groups and observed the importance of different facial regions follows the same trend as shown in Fig. 5. This means that the face parsing attention allows the model to focus on informative regions that are universally important for judging different ages. Although there are some works such as [10, 58, 59, 60]
that used attention, we are the first to present the evidence that the network attends to specific facial parts and that such attention modelling improves age estimation.
V-D Effectiveness of IMDB-Clean
We conduct experiments on the effectiveness of the proposed IMDB-Clean. Specifically, we train 6 models on three datasets, IMDB-Clean, IMDB-WIKI and CACD. We then directly test them on KANFace without any fine-tuning. For IMDB-WIKI, we randomly sampled images for training; for the other two datasets, we used their provided training splits. Table IV shows the cross-dataset evaluation results on KANFace. We observe that 1) all models have improved when they are trained on our IMDB-Clean; 2) our model outperforms other methods when trained on IMDB-Clean and IMDB-WIKI, and is comparable to DLDL-V2 when trained on CACD.
V-E Comparison to the State-of-the-arts
V-E1 Intra-Dataset Evaluation
In this section, the performance of the proposed FP-Age is compared with the state-of-the-art age estimation methods under the intra-dataset evaluation protocol. Three benchmarks are used: IMDB-Clean, Morph and CACD. On IMDB-Clean, we train all the models from scratch on the same training set and test them on the testing set. For Morph and CACD, we only train our own models and compare the performance with the reported values for the other methods on the testing set.
The benchmarking results are shown in Table V. It can be seen that our model achieves state-of-the-art results on IMDB-Clean dataset. When all model are trained under the same settings, our model achieves in terms of MAE and in terms of CS. Additionally, the results show that IMDB-Clean is quite challenging compared to other datasets, such as Morph where the state-of-the-art MAEs have achieved below . We provide significance testing analysis in Appendix VI which shows our results are significantly better than the other methods.
From Table VI, it can be seen that our model achieves state-of-the-art results on Morph dataset. When directly trained on Morph, our model achieves in terms of MAE and in terms of CS. When pre-trained on IMDB-Clean and fine-tuned the weights on Morph, FP-Age achieves a MAE of and a CS of , which is the new state-of-the-art result.
Table VII shows the results on the CACD dataset. Following the training protocols of CACD , we train our models with both the training set and the validation set, and report the MAE values on the testing set. Our model achieves when trained on CACD-train and when trained on CACD-val. Similar to above experiments, when pre-trained on IMDB-Clean, our model achieves and .
V-E2 Cross-Dataset Evaluation
To test the generalisation ability of different models, we conduct experiments on a cross-dataset evaluation protocol. Our results are compared with advanced models: SSRNet, C3AE, SVRT, DLDL, DLDL-V2, Coral, Dex, MV-Loss, and OR-CNN. We train all models on IMDB-Clean and test them on different testing datasets without fine-tuning. The reuslts are summarised in Table VIII. It can be seen that when all models are trained on IMDB-Clean, the proposed FP-Age achieves the best results on most of evaluation datasets.
Bold indicates the best and italic the second
|Human workers ||6.30||51.0||2015|
Bold indicates the best and italic the second
pre-trained on IMDB-WIKI
pre-trained on MS-Celeb-1M
pre-trained on the proposed IMDB-Clean
pre-trained on the proposed IMDB-Clean
|FG-Net ||Morph ||KANFace ||CACD-test |
|M-V Loss ||6.49||42.12||4.99||56.94||7.71||43.31||5.88||57.22|
inputs are pre-processed with 5-point face alignment
inputs are pre-processed with RoI Tanh-polar Transform 
In this paper, we have proposed a simple yet effective approach of exploiting face parsing semantics for age estimation. We have designed a framework to aggregate features from different levels of the face parsing network. A novel face parsing attention module is proposed to explicitly introduce facial semantics into the age estimation network. To train the model, we propose an semi-automatic clustering method for cleaning existing dataset and introduce the resulting IMDB-Clean dataset as a new in-the-wild benchmark. Thanks to the attention mechanism and the large-scale dataset, we have observed that the network focuses on certain facial parts when predicting ages. The nose region appears least informative for age estimation. Moreover, the extensive experiments have shown that our model outperforms the current state-of-the-art methods on various dataset in both intra-dataset and cross-dataset evaluations. To the best of our knowledge, this is the first attempt of leveraging face parsing attention to achieve age estimation. We hope our design could inspire the readers to consider similar attention models for different deep face analysis tasks.
[Statistical Significance Analysis] We conduct paired t-tests on the Absolute Error (AE) on the testing set of IMDB-Clean between FP-Age and the other eight methods, OR-CNN , DLDL , SSRNet , Dex , MV-Loss , DLDL-V2 , SVRT  and C3AE . Concretely, suppose there are images in the testing set, then is the AE between the predicted age of FP-Age and the groundtruth age on the -th testing image and is such AE for another method . The difference between the -th pair is defined as . The t statistic is calculated as
where and are the average and standard deviation of . We correct the p-values using Bonferroni correction. The alpha value is set to . From Table V and Table IX, we observe our results are significantly better than those of the other methods. We can, thus, reject the null hypotheses.
Data cleaning and all experiments have been conducted at Imperial College London.
S. Escalera, M. Torres Torres, B. Martinez, X. Baro, H. Jair Escalante,
I. Guyon, G. Tzimiropoulos, C. Corneou, M. Oliu, M. Ali Bagheri, and
M. Valstar, “Chalearn looking at people and faces of the world: Face
analysis workshop and challenge 2016,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2016.
-  G. Levi and T. Hassner, “Age and gender classification using convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2015.
-  H. Han, C. Otto, X. Liu, and A. K. Jain, “Demographic Estimation from Face Images: Human vs. Machine Performance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 6, pp. 1148–1161, Jun. 2015.
-  Y. Lin, J. Shen, Y. Wang, and M. Pantic, “RoI Tanh-polar Transformer Network for Face Parsing in the Wild,” arXiv:2102.02717 [cs], Feb. 2021.
-  K. Ricanek and T. Tesafaye, “MORPH: A longitudinal image database of normal adult age-progression,” in 7th International Conference on Automatic Face and Gesture Recognition (FGR06), Apr. 2006, pp. 341–345.
-  B. Chen, C. Chen, and W. H. Hsu, “Face Recognition and Retrieval Using Cross-Age Reference Coding With Cross-Age Celebrity Dataset,” IEEE Transactions on Multimedia, vol. 17, no. 6, pp. 804–815, Jun. 2015.
-  M. Georgopoulos, Y. Panagakis, and M. Pantic, “Investigating bias in deep face analysis: The KANFace dataset and empirical study,” Image and Vision Computing, vol. 102, p. 103954, Oct. 2020.
-  A. Lanitis, C. J. Taylor, and T. F. Cootes, “Toward automatic simulation of aging effects on face images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 4, pp. 442–455, 2002.
-  G. Panis, A. Lanitis, N. Tsapatsoulis, and T. F. Cootes, “Overview of research on facial ageing using the FG-NET ageing database,” IET Biometrics, vol. 5, no. 2, pp. 37–46, Jun. 2016.
-  D. Yi, Z. Lei, and S. Z. Li, “Age estimation by multi-scale convolutional network,” in Computer Vision – ACCV 2014, 2014, pp. 144–158.
-  X. Wang, R. Guo, and C. Kambhamettu, “Deeply-Learned Feature for Age Estimation,” in 2015 IEEE Winter Conference on Applications of Computer Vision, Jan. 2015, pp. 534–541.
-  E. Eidinger, R. Enbar, and T. Hassner, “Age and gender estimation of unfiltered faces,” IEEE Transactions on Information Forensics and Security, vol. 9, no. 12, pp. 2170–2179, 2014.
-  Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua, “Ordinal Regression with Multiple Output CNN for Age Estimation,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 4920–4928.
-  S. Chen, C. Zhang, M. Dong, J. Le, and M. Rao, “Using Ranking-CNN for Age Estimation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 742–751.
-  W. Im, S. Hong, S.-E. Yoon, and H. S. Yang, “Scale-Varying Triplet Ranking with Classification Loss for Facial Age Estimation,” in Computer Vision – ACCV 2018, ser. Lecture Notes in Computer Science, C. Jawahar, H. Li, G. Mori, and K. Schindler, Eds., Cham, 2019, pp. 247–259.
-  W. Cao, V. Mirjalili, and S. Raschka, “Rank consistent ordinal regression for neural networks with application to age estimation,” Pattern Recognition Letters, vol. 140, pp. 325–331, 2020.
-  X. Geng, “Label Distribution Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 7, pp. 1734–1748, Jul. 2016.
-  R. Rothe, R. Timofte, and L. V. Gool, “DEX: Deep EXpectation of Apparent Age from a Single Image,” in 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), Dec. 2015, pp. 252–257.
-  R. Rothe, R. Timofte, and L. Van Gool, “Deep Expectation of Real and Apparent Age from a Single Image Without Facial Landmarks,” International Journal of Computer Vision, vol. 126, no. 2, pp. 144–157, Apr. 2018.
-  H. Pan, H. Han, S. Shan, and X. Chen, “Mean-Variance Loss for Deep Age Estimation from a Face,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp. 5285–5294.
-  B. Gao, C. Xing, C. Xie, J. Wu, and X. Geng, “Deep Label Distribution Learning With Label Ambiguity,” IEEE Transactions on Image Processing, vol. 26, no. 6, pp. 2825–2838, Jun. 2017.
B.-B. Gao, H.-Y. Zhou, J. Wu, and X. Geng, “Age
Estimation Using Expectation of Label Distribution Learning,” in
International Joint Conference on Artificial Intelligence, Stockholm, Sweden, Jul. 2018, pp. 712–718.
-  W. Shen, Y. Guo, Y. Wang, K. Zhao, B. Wang, and A. Yuille, “Deep Regression Forests for Age Estimation,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp. 2304–2313.
——, “Deep Differentiable Random Forests for Age Estimation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 2, pp. 404–419, Feb. 2021.
-  A. Akbari, M. Awais, Z. Feng, A. Farooq, and J. Kittler, “Distribution Cognisant Loss for Cross-Database Facial Age Estimation with Sensitivity Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2020.
-  R. Ranjan, S. Sankaranarayanan, C. D. Castillo, and R. Chellappa, “An All-In-One Convolutional Neural Network for Face Analysis,” in 2017 12th IEEE International Conference on Automatic Face Gesture Recognition (FG 2017), May 2017, pp. 17–24.
-  H. Han, A. K. Jain, F. Wang, S. Shan, and X. Chen, “Heterogeneous Face Attribute Estimation: A Deep Multi-Task Learning Approach,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 11, pp. 2597–2609, Nov. 2018.
-  F. Wang, H. Han, S. Shan, and X. Chen, “Deep Multi-Task Learning for Joint Prediction of Heterogeneous Face Attributes,” in 2017 12th IEEE International Conference on Automatic Face Gesture Recognition (FG 2017), May 2017, pp. 173–179.
-  J. Warrell and S. J. D. Prince, “Labelfaces: Parsing facial features by multiclass labeling with an epitome prior,” in 2009 IEEE International Conference on Image Processing (ICIP), 2009, pp. 2481–2484.
-  B. M. Smith, L. Zhang, J. Brandt, Z. Lin, and J. Yang, “Exemplar-based face parsing,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013.
-  Sifei Liu, J. Yang, Chang Huang, and M. Yang, “Multi-objective convolutional learning for face labeling,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3451–3459.
-  P. Luo, X. Wang, and X. Tang, “Hierarchical face parsing via deep learning,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2480–2487.
-  A. S. Jackson, M. Valstar, and G. Tzimiropoulos, “A cnn cascade for landmark guided semantic part segmentation,” in Computer Vision – ECCV 2016, Springer. Cham: Springer International Publishing, 2016, pp. 143–155.
-  U. Güçlü, Y. Güçlütürk, M. Madadi, S. Escalera, X. Baró, J. González, R. van Lier, and M. A. van Gerven, “End-to-end semantic face segmentation with conditional random fields as convolutional, recurrent and adversarial networks,” arXiv preprint arXiv:1703.03305, 2017.
-  B. Luo, J. Shen, S. Cheng, Y. Wang, and M. Pantic, “Shape constrained network for eye segmentation in the wild,” in 2020 IEEE Winter Conference on Applications of Computer Vision, 2020, pp. 1952–1960.
-  L. J. Sifei Liu, Jianping Shi and M.-H. Yang, “Face parsing via recurrent propagation,” in Proceedings of the British Machine Vision Conference (BMVC), September 2017.
-  I. Masi, J. Mathai, and W. AbdAlmageed, “Towards Learning Structure via Consensus for Face Segmentation and Parsing,” in 2020 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
-  G. Te, Y. Liu, W. Hu, H. Shi, and T. Mei, “Edge-aware graph representation learning and reasoning for face parsing,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing, 2020, pp. 258–274.
-  Y. Wang, B. Luo, J. Shen, and M. Pantic, “Face mask extraction in video sequence,” International Journal of Computer Vision, vol. 127, no. 6-7, pp. 625–641, 2019.
S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” inAdvances in neural information processing systems, 2015, pp. 802–810.
-  E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 04, pp. 640–651, apr 2017.
-  Y. Wang, M. Dong, J. Shen, Y. Wu, S. Cheng, and M. Pantic, “Dynamic face video segmentation via reinforcement learning,” in 2020 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
-  J. Lin, H. Yang, D. Chen, M. Zeng, F. Wen, and L. Yuan, “Face parsing with roi tanh-warping,” in 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
-  F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” in ICLR, 2016.
-  J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  C. Zhang, S. Liu, X. Xu, and C. Zhu, “C3AE: Exploring the Limits of Compact Model for Age Estimation,” in 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
H. Liu, J. Lu, J. Feng, and J. Zhou, “Ordinal Deep Feature Learning for Facial Age Estimation,” in2017 12th IEEE International Conference on Automatic Face Gesture Recognition (FG 2017), May 2017, pp. 157–164.
-  Z. Deng, H. Liu, Y. Wang, C. Wang, Z. Yu, and X. Sun, “PML: Progressive Margin Loss for Long-tailed Age Classification,” in 2021 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
-  G. Antipov, M. Baccouche, S. Berrani, and J. Dugelay, “Apparent Age Estimation from Face Images Combining General and Children-Specialized Deep Learning Models,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun. 2016, pp. 801–809.
-  K. Zhang, C. Gao, L. Guo, M. Sun, X. Yuan, T. X. Han, Z. Zhao, and B. Li, “Age Group and Gender Estimation in the Wild With Deep RoR Architecture,” IEEE Access, vol. 5, pp. 22 492–22 503, 2017.
-  M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool, “Face detection without bells and whistles,” in Computer Vision – ECCV 2014, 2014, pp. 720–735.
-  S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, “SFD: Single shot scale-invariant face detector,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 192–201.
-  J. Yang, A. Bulat, and G. Tzimiropoulos, “FAN-Face: A simple orthogonal improvement to deep face recognition,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 12 621–12 628, Apr. 2020.
-  E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, “DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN,” ACM Trans. Database Syst., vol. 42, no. 3, Jul. 2017.
-  T.-Y. Yang, Y.-H. Huang, Y.-Y. Lin, P.-C. Hsiu, and Y.-Y. Chuang, “SSR-Net: A compact soft stagewise regression network for age estimation,” in Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization, 7 2018, pp. 1078–1084. [Online]. Available: https://doi.org/10.24963/ijcai.2018/150
-  J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “Retinaface: Single-shot multi-level face localisation in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
-  M. Angeloni, R. de Freitas Pereira, and H. Pedrini, “Age estimation from facial parts using compact multi-stream convolutional neural networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct 2019.
-  Z. Liao, S. Petridis, and M. Pantic, “Local deep neural networks for age and gender classification,” arXiv preprint arXiv:1703.08497, 2017.
-  W. Pei, H. Dibeklioğlu, T. Baltrušaitis, and D. M. J. Tax, “Attended End-to-End Architecture for Age Estimation From Facial Expression Videos,” IEEE Transactions on Image Processing, vol. 29, pp. 1972–1984, 2020.
E. Agustsson, R. Timofte, and L. V. Gool, “Anchored Regression Networks Applied to Age Estimation and Super Resolution,” in2017 IEEE International Conference on Computer Vision (ICCV), Oct. 2017, pp. 1652–1661.
-  W. Li, J. Lu, J. Feng, C. Xu, J. Zhou, and Q. Tian, “BridgeNet: A Continuity-Aware Probabilistic Network for Age Estimation,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2019, pp. 1145–1154.
-  X. Wen, B. Li, H. Guo, Z. Liu, G. Hu, M. Tang, and J. Wang, “Adaptive Variance Based Label Distribution Learning for Facial Age Estimation,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds., vol. 12368, Cham, 2020, pp. 379–395.