Age and gender, two of the key facial attributes, play very foundational roles in social interactions, making age and gender estimation from a single face image an important task in intelligent applications, such as access control, human-computer interaction, law enforcement, marketing intelligence and visual surveillance, etc .
Over the last decade, most methods used manually-designed features and statistical models [2, 3] to estimate age and gender [4, 5, 6, 7, 8, 9, 10], and they achieved respectable results on the benchmarks of constrained images, such as FG-NET  and MORPH . However, manually-designed features based methods behave unsatisfactorily on recent benchmarks of unconstrained images, namely “in-the-wild” benchmarks, including Public Figures , Gallagher group photos , Adience  and the apparent age data set LAP  for these features’ ineptitude to approach large variations in appearance, noise, pose and lighting.
Deep learning, especially deep Convolutional Neural Networks (CNN) [17, 18, 19, 20, 21, 22, 23, 24, 25, 26], has proven itself to be a strong competitor to the more sophisticated and highly tuned methods . Although unconstrained photographic conditions bring about various challenges to age and gender prediction in the wild, we can still enjoy great improvements brought by CNNs [28, 29, 30, 35, 1]. The optimization ability of neural networks is critical to the performance of age and gender estimation, while existing CNNs designed for age and gender estimation only have several layers, which severely limit the development of age and gender estimation. Therefore, we construct a very deep CNN, Residual networks of Residual networks (RoR) 
, for age group and gender estimation in the wild. To begin with, we construct RoR with different residual block types, and analyze the effects of drop-path, dropout, maximum epoch number, residual block type and depth in order to promote the learning capability of CNN. In addition, analysis of the characteristics of age estimation suggests two modest mechanisms, pre-trained CNN by gender and weighted loss layer, to further increase the accuracy of age estimation, as shown in Fig.1(a). Moreover, in order to further improve the performance and alleviate over-fitting problem on small scale data set, we train RoR model on ImageNet firstly, and then fine-tune it on IMDB-WIKI-101 data set, thirdly, we use the model to further fine-tune on Adience data set. Fig. 1(b) shows the pipeline of our framework. Finally, through massive experiments on Adience data set, our RoR model achieves the new state-of-the-art results on Adience data set.
The remainder of the paper is organized as follows. Section II briefly reviews related work for age and gender estimation methods and deep convolutional neural networks. The proposed RoR age and gender estimation method and the two mechanisms are described in Section III. Experimental results and analysis are presented in Section IV, leading to conclusions in Section V.
Ii Related Work
Ii-a Age and gender estimation
In the past twenty years, human age and gender estimation from face image has benefited tremendously from the evolutionary development in facial analysis. Early methods for age estimation were based on geometric features calculating ratios between different measurements of facial features . Geometry features can separate baby from adult easily but are unable to distinguish between adult and elderly people. Therefore, Active Appearance Model (AAM) based methods  incorporated geometric and texture features to achieve desired result. However, these pixel-based methods are not suitable for in-the-wild images which have large variations in pose, illumination, expression, aging, cosmetics and occlusion. After 2007, most existing methods used manually-designed features in this field, such as Gabor , LBP , SFP , and BIF . Based on these manually-designed features, regression and classification methods are used to predict the age or gender of face images. SVM based methods [6, 15]
are used for age group and gender classification. For Regression, linear regression, SVR , PLS , and CCA  are the most popular methods for accurate age prediction. However, all of these methods were only proven effective on constrained benchmarks, and could not achieve respectable results on the benchmarks in the wild [46, 15].
Recent research on CNN showed that CNN model can learn a compact and discriminative feature representation when the size of training data is sufficiently large, so an increasing number of researchers start to use CNN for age and gender estimation. Yi et al.  first proposed a CNN based age and gender estimation method, Multi-Scale CNN. Wang et al.  extracted CNN features, and employed different regression and classification methods for age estimation on FG-NET and MORPH. Levi et al.  used CNN for age and gender classification on unconstrained Adience benchmark. Ekmekji 
proposed a chained gender-age classification model by training age classifiers on each gender separately. With the development of deeper CNNs, Liu et al. addressed the apparent age estimation problem by fusing two kinds of models, real-value based regression models and Gaussian label distribution based GoogLeNet on LAP data set. Antipov et al.  improved the previous year’s results fusing general model and children model on LAP. Huo et al.  proposed a novel method called Deep Age Distribution Learning(DADL) to use the deep CNN model to predict the age distribution. Hou et al. 
proposed a VGG-16-like model with Smooth Adaptive Activation Functions (SAAF) to predict age group on Adience benchmark. Then he used the exact squared Earth Mover’s Distance(EMD2)
in loss function for CNN training and obtained better age estimation result. VGG-16 architecture and SVR were used for age estimation on top of the CNN features. Deep EXpectation (DEX) formulation  was proposed for age estimation based on VGG-16 architecture and a classification followed by a expected value formulation, and it got good results on FG-NET, MORPH, Adience and LAP data sets. Iqbal et al.  proposed a local face description, Directional Age-Primitive Pattern(DAPP), which inherits discernible aging cue information and achieved higher accuracy on Adience data set. Recently, Hou et al. used the R-SAAFc2+IMDB-WIKI  method, and achieved the state-of-the-art results on Adience benchmark.
Ii-B Deep convolutional neural networks
It is widely acknowledged that the performance of CNN based age and gender estimation relies heavily on the optimization ability of the CNN architecture, where deeper and deeper CNNs have been constructed. From 5-conv+3-fc AlexNet  to the 16-conv+3-fc VGG networks  and 21-conv+1-fc GoogleNet , then to thousand-layer ResNets, both the accuracy and depth of CNNs were promptly increasing. With a dramatic rise in depth, residual networks (ResNets) 
achieved the state-of-the-art performance at ILSVRC 2015 classification, localization, detection, and COCO detection, segmentation tasks. Then in order to alleviate the vanishing gradient problem and further improve the performance of ResNets, Identity Mapping ResNets (Pre-ResNets)
simplified the residual networks training by BN-ReLU-conv order. Huang and Sun et al. proposed Stochastic Depth residual networks (SD), which randomly dropped a subset of layers and bypassed them with shortcut connections for every mini-batch to alleviate over-fitting and reduce vanishing gradient problem. In order to dig the optimization ability of residual networks family, Zhang et al.  proposed Residual Networks of Residual Networks architecture (RoR), which added shortcuts level by level based on residual networks, and achieved the state-of-the-art results on low-resolution image data sets such as CIFAR-10, CIFAR-100  and SVHN  at that time. Instead of sharply increasing the feature map dimension, PyramidNet  gradually increases the feature map dimension at all units and gets superior generalization ability. DenseNet  uses densely connected paths to concatenate the input features with the output features, and enables each micro-block to receive raw information from all previous micro-blocks. To enjoy the benefits from both path topologies of ResNets and DenseNet, Dual Path Network  shares common features while maintaining the flexibility to explore new features through dual path architectures.
In this section, we describe the proposed RoR architecture with two modest mechanisms for age group and gender classification. Our methodology is essentially composed of four steps: Constructing RoR architecture for improving optimization ability of model, pre-training with gender and training with weighted loss layer for promoting the performance of age group classification, pre-training on ImageNet and further fine-tuning on IMDB-WIKI-101 data set for alleviating over-fitting problem and improving the performance of age group and gender classification. In the following, we describe the four main components in detail.
Iii-a Network architecture
RoR  is based on a hypothesis: The residual mapping of residual mapping is easier to optimize than original residual mapping. To enhance the optimization ability of residual networks, RoR can optimize the residual mapping of residual mapping by adding shortcuts level by level based on residual networks. By experiments, Zhang et al.  argued that the optimization ability of Pre-RoR is better than RoR with the same number of layers, so we choose Pre-RoR in this paper except pre-training on ImageNet or IMDB-WIKI.
In order to train the high-resolution Adience data set, we first construct RoR based on the basic Pre-ResNets for Adience, and denote this kind of RoR as Pre-RoR. Pre-ResNets  include two types of residual block designs: basic residual block and bottleneck residual block. Fig. 2 shows the Pre-RoR with basic block constructed based on original Pre-ResNets with basic blocks. The shortcuts in these original residual blocks are denoted as the final-level shortcuts. To start with, we add a shortcut above all basic blocks, and this shortcut can be called root shortcut or first-level shortcut. We use 64, 128, 256 and 512 filters sequentially in the convolutional layers, and each kind of filter has different number (, respectively) of basic blocks which form four basic block groups. Furthermore, we add a shortcut above each basic block group, and these four shortcuts are called second-level shortcuts. Then we can continue adding shortcuts as the inner-level shortcuts. Lastly, the shortcuts in basic residual blocks are regarded as the final-level shortcuts. Let denote a shortcut level number. In this paper, we choose level number =3 according to the analysis of Zhang et al. , so the RoR has root-level, middle-level and final-level shortcuts, shown in Fig. 2.
The junctions which are located at the end of each residual block group can be expressed by the following formulations.
where and are input and output of the -th block, and is a residual mapping function, and are both identity mapping functions. expresses the identity mapping of first-level and second-level shortcuts, and denotes the identity mapping of the final-level shortcuts. function is type B projection shortcut.
For bottleneck block, He al et.  used a stack of three layers instead of two layers that first reduce the dimensions and then re-increase it. Both basic block and bottleneck block have similar time complexity, so we can get deeper networks easily through bottleneck. In this paper, we also construct a Pre-RoR based on bottleneck Pre-ResNets. The architecture details of Pre-RoR with bottleneck blocks are shown in Fig. 3. We use to control the output dimensions of the blocks. He et al.  chose
=4 led to the results that the input and output planes of these shortcuts are very different. Since the zero-padding (Type A) shortcut will bring more deviation and projection (Type B) shortcut will aggravate over-fitting, our RoR adopts=4, =2 and =1 in this paper.
Iii-B Pre-training with gender
Like face recognition, age estimation can be easily affected by many intrinsic and extrinsic factors. Some of the most important factors include identity, gender and ethnicity, together with other factors like Pose, Illumination and Expression (PIE). We can alleviate the effects of these factors by using large data sets in the wild, but the existing data sets for age estimation are generally relatively small. To some extent, gender affects age judgments. On the one hand, the aging process of men slightly differs from women due to different longevity, hormones, skin thickness, etc. On the other hand, women are more likely to hide their real age by using makeup. So real-world age estimations for men and women are not exactly the same. Guo et al. and Ekmekji  first manually separated the data set according to the gender labels, then trained an age estimator on each subset separately. Inspired by this, we train CNN by gender initially, then replace the gender prediction layer with age prediction layer, and fine-tune the whole CNN structure at last.
Iii-C Training with weighted loss layer
There are some diversities lying between general image classification and age estimation. Firstly, the different classes in general image classification are uncorrelated, but the age groups have a sequential relationship between labels. These interrelated age groups are more difficult to distinguish. Secondly, human aging processes show variations in different age ranges. For example, aging processes between mid-life adults and children are not equivalent. In this paper, we will analyze the law of human aging, and do age estimation under its guidance. For human, it is easier to distinguish who is the older one out of two people than to determine the persons’ actual ages. Based on this characteristic and age-ordered groups, we define , =1,2…,, where is the number of age group labels. Then for a given age group , we separate the data set into two subsets and as follows:
Next, we use the two subsets to learn a binary classifier that can be considered as a query: “Is the face older than age group ?” There are eight classes (0-2, 4-6, 8-13, 15-20, 25-32, 38-43, 48-53, 60-) in Adience data set, so we can choose =1,2,…,7. By doing so, we get seven binary-class data sets, and the results of these binary classifiers can form a human aging curve which represents the human aging process. We execute some experiments on folder0 of Adience data set with 4c2f CNN described in  (just using two classes instead of eight classes), and the aging curve is described in Fig. 4 We discover that the 4th, 5th and 6th results are smaller than the others. As a conclusion, the aging process of smaller and greater age group is faster than intermediate age groups, so it is harder to distinguish intermediate age groups comparing to smaller and greater age groups.
|Name||Loss Weight Distribution|
Through above analysis, we realize the 4th, 5th, 6th and 7th groups are more difficult to estimate, so we apply higher loss weights to these age groups. Thus, we define four different settings of loss weight distributions for optimal results, as shown in Table I.
Iii-D Pre-training on ImageNet
Due to using small scale data sets for age and gender estimation, the over-fitting problem is easy to occur during training, so we use RoR network training ImageNet data set to obtain the basic feature expression model firstly. And then we use the pre-trained RoR model to fine-tune on the Adience data set, so as to alleviate the over-fitting problem brought by the direct training on Adience.
The preceding data sets using RoR were all small scale image data sets, in this paper we first conduct experiments on large scale and high-resolution image data set, ImageNet. We evaluate our RoR method on the ImageNet 2012 classification data set , which contains 1.28 million high-resolution training images and 50,000 validation images with 1000 object categories. During training of RoR, we notice that RoR is slower than ResNets. So instead of training RoR from scratch, we use the ResNets models from  for pre-training. The weights from pre-trained ResNets models remain unchanged, but the new added weights are initialized as in . In addition, SD is not used here because SD makes RoR difficult to converge on ImageNet. Then we replace the 1000 classes prediction layer with age and gender prediction layer, and fine-tune the whole RoR structure on Adience.
Iii-E Fine-tuning on IMDB-WIKI-101
In order to make the RoR model further learn the feature expression of facial images and also reduce the over-fitting problem, we use large-scale face image data set IMDB-WIKI-101  to fine-tune the model after pre-training on ImageNet.
IMDB-WIKI is the largest publicly available data set for age estimation of people in the wild, containing more than half million images with accurate age labels, whose age ranges from 0 to 100. For the IMDB-WIKI data set, the images were crawled from IMDb and Wikipedia, where IMDB contains 460723 images of 20,284 celebrities and Wikipedia contains 62328 images. As the images of IMDB-WIKI data set are obtained directly from the website, the IMDB-WIKI data set contains many low-quality images, such as human comic images, sketch images, severe facial mask, full body images, multi-person images, blank images, and so on. The example images are shown in Fig. 5. Those bad images seriously affect the network learning effect. Therefore, in this paper, we spend a week manually removing the low quantity images by four people. In our removing process we mainly consider: a) the bad images, which are not standard face images from the IMDB-WIKI data set and b) the images with wrong age labels, especially the age images from 0 to 10 years old. The remaining IMDB-WIKI dataset remains 440607 images. The data set after cleaning is divided into 101 classes representing the age of each age, which we name IMDB-WIKI-101 data set.
Firstly, we replace the 1000 classes prediction layer on ImageNet with 101 classes prediction layer for age prediction, and fine-tune the RoR structure on IMDB-WIKI-101. When fine-tuning the RoR model, the IMDB-WIKI-101 data set is randomly divided into 90% for training and 10% for testing. Then we replace the 101 classes prediction layer with age and gender prediction layer, and fine-tune the whole RoR structure on Adience.
In this section, extensive experiments are conducted to present the effectiveness of the proposed RoR architecture, two mechanisms, pre-training on ImageNet and further fine-tuning on IMDB-WIKI-101 data set. The experiments are conducted on unconstrained age group and gender data set, Adience . Firstly, we introduce our experimental implementation. Secondly, we empirically demonstrate the effectiveness of two mechanisms for age group classification. Thirdly, we analyze different Pre-RoR models for age group and gender classification. Fourthly, we improve the performance of age and gender estimation by pre-training on ImageNet with RoR models. Furthermore, the RoR model are fine-tuned on IMDB-WIKI-101 data set for learning the feature expression of face images. Finally, the results of our best models are compared with several state-of-the-art approaches.
For Adience data set, we do experiments by using 4c2f-CNN , VGG , Pre-ResNets , our Pre-RoR architectures, respectively.
4c2f-CNN: The CNN structure described in  is denoted as baseline for the experiments with two mechanisms. Compared to the original 4c2f-CNN in 
, our baseline adds preprocessing of data by subtracting the mean and dividing the standard deviation.
VGG: We choose VGG-16 in  to construct age group and gender classifiers.
Pre-ResNets: We use Pre-ResNets-34, Pre-ResNets-50 and Pre-ResNets-101 in  as the basic architectures.
Pre-RoR: We use the basic block and bottleneck block Pre-ResNets in  to construct RoR architecture. The original Pre-ResNets contain four groups (64 filters, 128 filters, 256 filters and 512 filters) of residual blocks, the feature map sizes are 56, 28, 14 and 7, respectively. Pre-RoR with basic blocks includes Pre-RoR-34 (34 layers), Pre-RoR-58 (58 layers) and Pre-RoR-82 (82 layers). Pre-RoR with bottleneck blocks includes RoR-50 (50 layers) and RoR-101 (101 layers). Each residual block group in different Pre-RoR has different number of residual blocks, as shown in Table II. Pre-RoR contains four middle-level residual blocks (every middle-level residual block contains some final-level residual blocks) and one root-level residual block (the root-level residual block contains four middle-level residual blocks). We adopt BN-ReLU-conv order, as shown in Fig. 2 and Fig. 3.
|Block Type||Number of Layers||Number of blocks in each Group|
|Basic Block||34||3, 4, 6, 3|
|Basic Block||58||5, 6, 12, 5|
|Basic Block||82||7, 8, 14, 7|
|Bottleneck Block||50||3, 4, 6, 3|
|Bottleneck Block||101||3, 4, 23, 3|
Our implementations are based on Torch 7 with one Nvidia Geforce Titan X. We initialize the weights as in
. We use SGD with a mini-batch size of 64 for these architectures except Pre-RoR with neckbottle block where we use mini-batch size 32. The total epoch number is 164. The learning rate starts from 0.1, and is divided by a factor of 10 after epoch 80 and 122. We use a weight decay of 1e-4, momentum of 0.9, and Nesterov momentum with 0 dampening. For stochastic depth drop-path method, we set with the linear decay rule of = 1 and =0.5 .
The entire Adience collection includes 26,580 256256 color facial images of 2,284 subjects, with eight classes of age groups and two classes of gender. Testing for both age and gender classification is performed using a standard five-fold, subject-exclusive cross-validation protocol, defined in . We use the in-plane aligned version of the faces, originally used in . For data augmentation, VGG, PreResNets and Pre-RoR use scale and aspect ratio augmentation  instead of scale augmentation used in 4c2f-CNN.
Iv-B Effectiveness of two mechanisms
In this section, we do age group classification experiments on folder0 of Adience data set with two mechanisms based on 4c2f-CNN architecture, and the results are described in Fig. 6. Here, we report the exact accuracy(correct age group predicted) and 1-off accuracy (correct or adjacent age group predicted) as .
Previously, we use 4c2f-CNN with each mechanism individually. In Fig. 6, 4c2f-CNN pre-training by gender (4c2f-CNN-pt) achieves apparent progress compared to 4c2f-CNN without pre-training. And then, Fig. 6 also shows that 4c2f-CNN with loss weight distribution LW3 (4c2f-CNN-LW3) achieves best performance among all the loss weight distributions on folder0 of Adience data set, so we will choose LW3 as the loss weight distribution in the following experiments. Finally, we combine above the two mechanisms to predict age group and Fig. 6 shows that 4c2f-CNN combined of pre-training by gender and loss weight distribution LW3 together (4c2f-CNN-pt-LW3) achieves better performance than other models. These experiments demonstrate the effectiveness of pre-training method by gender and weighted loss layer for promoting performance of age group classification.
Iv-C Age group and gender classification by Pre-RoR
In order to find the optimal model of Pre-RoR on Adience data set, we do a lot of comparative experiments with folder0 validation, and then we evaluate the effect of SD, dropout, shortcut type, block type, maximum epoch number and depth for age estimation results.
|Method||Age Exact Accuracy(%)||Age 1-off(%)||Gender Accuracy(%)|
|Pre-ResNets-34 (Type B)||58.81||88.31||90.23|
|Pre-ResNets-34+SD (Type B)||59.56||90.43||89.91|
|Pre-RoR-34+SD (Type B)||60.21||91.14||90.72|
|Pre-RoR-34+SD+dropout (Type B)||59.87||88.68||90.32|
|Pre-RoR-34+SD (Type A+B)||61.56||91.59||90.78|
|Pre-RoR-34+SD (Type A+B) 300 epochs||61.52||91.56||90.84|
|Pre-RoR-58+SD (Type A+B)||62.48||92.31||90.85|
|Pre-RoR-82+SD (Type A+B)||61.78||92.15||90.87|
Firstly, basic blocks are used in experiments, and the results of different architectures are shown in Table III. We do some experiments by Pre-ResNets-34 (34 convolutional layers) with and without SD. Because Adience data set only has about 26,580 high-resolution images, over-fitting is a critical problem. In Table III, the performance of Pre-ResNets-34 with SD is better than that without SD, which means SD alleviates the effect of over-fitting. We then use Pre-RoR-34 +SD to estimate age and gender. Pre-RoR-34+SD outperforms Pre-ResNets-34+SD, because RoR can promote the learning capability of residual networks. To further reduce over-fitting, we try dropout between convolutional layers in residual blocks, but the result of Pre-RoR-34+SD+dropout shows that dropout method in RoR does not make a big difference. This is consistent with WRN . Zhang et al.  noted that extra parameters would escalate over-fitting and the zero-padding (type A) would bring more deviation, so shortcut Type A should be used in the final-level and Type B should be used in other levels (called Type A+B). Table III shows that the Pre-RoR-34+SD with Type A+B has better performance than Pre-RoR-34+SD which uses Type B in all levels. Fig. 7 shows that the test errors by Pre-ResNets-34, Pre-ResNets-34+SD and Pre-RoR-34+SD (Type A+B) at different training epochs with folder0 validation. Zhang et al.  proofed that maximum epoch number of 500 is necessary to optimize RoR on CIFAR-10 and CIFAR-100, but the results of Pre-RoR-34+SD with 300 epochs show that 164 for maximum epoch number is enough for Adience data set. Generally, ResNets  and RoR  can improve performance by increasing depth. We estimate age and gender by Pre-RoR-58+SD and Pre-RoR-82+SD. The age estimation result of Pre-RoR-58+SD is better than Pre-RoR-34+SD, but Pre-RoR-82+SD is worse than Pre-RoR-58+SD, which is caused by degradation. Gender estimation gets better when adding more layers, since degradation is less critical for binary classification.
Secondly, we use bottleneck blocks instead of basic blocks, and the results of different architectures are shown in Table IV and Table V. We do some experiments by Pre-ResNets-50+SD (Type B, =4) and Pre-RoR-50+SD (Type A+B, =4). As can be observed, the performance of Pre-RoR-50+SD (Type A+B, =4) is worse than Pre-ResNets-50+SD (Type B, =4). When we use type A in final levels, the input and output planes of these shortcuts are very different, the zero-padding (type A) will bring more deviation. So we reduce the output dimensions by using =2 and =1. The results of Pre-RoR-50+SD (Type A+B, =2) and Pre-RoR-50+SD (Type A+B, =1) show that deviation problem is largely alleviated by reducing dimensions. The performance of Pre-RoR-50+SD (Type A+B, =2) is better than Pre-RoR-50+SD (Type A+B, =1), because reducing dimensions also reduces parameters and the optimizing ability of networks. Pre-RoR-50+SD (Type A+B, =2) achieves the balance of deviation and over-fitting problems, but it can not catch up Pre-RoR with basic blocks because of these two problems.
|Method||Age Exact Acc(%)||Age 1-off(%)||Gender Acc(%)|
|Pre-ResNets-50+SD (Type B) =4||60.05||88.98||89.82|
|Pre-RoR-50+SD (Type A+B) =4||58.62||90.10||88.71|
|Pre-RoR-50+SD (Type A+B) =2||61.68||91.63||88.92|
|Pre-RoR-50+SD (Type A+B) =1||61.12||91.14||90.03|
We do the same experiments by increasing the depth to 101 convolutional layers. We find the similar results shown in Table V as the networks with 50 convolutional layers in Table IV. Pre-RoR-101+SD (Type A+B, =2) achieves the best performance, and also outperforms Pre-RoR-50+SD (Type A+B, =2).
|Method||Age Exact Acc(%)||Age 1-off(%)||Gender Acc(%)|
|Pre-ResNets-101+SD (Type B) =4||59.16||89.61||89.12|
|Pre-RoR-101+SD (Type A+B) =4||60.46||90.95||88.37|
|Pre-RoR-101+SD (Type A+B) =2||62.26||91.54||89.15|
|Pre-RoR-101+SD (Type A+B) =1||60.49||91.14||89.41|
In above experiments, we only use one folder to analyze different network architectures. Now we will demonstrate the generality of our method by using standard five-fold, subject-exclusive cross-validation protocol. In the following experiments, we only use Type A+B for Pre-RoR+SD. The age cross-validation results of Pre-RoR+SD (Type A+B) with different block types and depths are shown in Table VI, where we achieve the similar results with folder0 validation. The performance of Pre-RoR+SD with basic block is better than Pre-RoR+SD with bottleneck block. We analyze that this is because of deviation by zero-padding. Our Pre-ROR-58+SD achieves the best performance, which outperforms 4c2f-CNN by 18.8% and 5.7% on exact and 1-off accuracy of Adience data set.
Iv-D Age group and gender classification by Pre-training on ImageNet
Because we can not find the well-trained Pre-ResNets on the web, we construct RoR based on the well-trained ResNets from  for ImageNet. The well-trained ResNets from  use Type B in the residual blocks, so we use Type B in all levels of RoR. We use SGD with a mini-batch size of 128 (18 layers and 34 layers) or 64 (101 layers) or 48 (152 layers) for 10 epochs to fine-tune RoR. The learning rate starts from 0.001 and is divided by a factor of 10 after epoch 5. For data augmentation, we use scale and aspect ratio augmentation . Both Top-1 and Top-5 error rates with 10-crop testing are evaluated. From Table VII, our implementation of residual networks achieves the best performance compared to ResNets methods for single model evaluation on validation data set. These experiments verify the effectiveness of RoR on ImageNet.
|Method||Top-1 Error||Top-5 Error|
When we use pre-trained RoR model to fine-tune on Adience, we replace the 1000 classes prediction layer with age or gender prediction layer. We use SGD with a mini-batch size of 64 for 120 epochs to fine-tune on Adience. The learning rate starts from 0.01 and is divided by a factor of 10 after epoch 80. Based on the analysis of above section, we find deep Pre-RoR maybe outperform very deep Pre-RoR, so we use RoR-34 instead of deeper RoR as the basic pre-trained model. The results of different methods are shown in Table VIII. We do some experiments by ResNets-34 and RoR-34. The results of ResNets-34 and RoR-34 by Pre-training on ImageNet are better than the results of ResNets-34 and RoR-34, because pre-training on ImageNet can reduce over-fitting problem. When we add SD method in these experiments, the performance are promoted too. Especially, RoR-34+SD by Pre-training on ImageNet achieves very competitive performance, which outperforms Pre-RoR-34+SD. These experiments verify the effectiveness of pre-training on ImageNet for age group and gender classification.
|Method||Age Exact Acc(%)||Age 1-off(%)||Gender Acc(%)|
|ResNets-34 by Pre-training on ImageNet||61.154.53||92.901.98||91.181.53|
|ResNets-34+SD by Pre-training on ImageNet||61.475.17||93.391.95||91.981.49|
|RoR-34 by Pre-training on ImageNet||61.734.31||92.971.55||91.961.53|
|RoR-34+SD by Pre-training on ImageNet||62.344.53||93.641.47||92.431.51|
Iv-E Age group and gender classification by fine-tuning on IMDB-WIKI-101
As the amount of training data strongly affects the accuracy of the trained models, there is a greater need for large datasets. Thus, we use IMDB-WIKI-101 to further fine-tune the RoR model. After pre-training on the ImageNet, we further fine-tune the RoR model on the IMDB-WIKI-101. The epoch is set to 120. The learning rate starts from 0.01 and is divided by a factor of 10 after epoch 60 and 90. When we use fine-tuned RoR model to fine-tune on Adience, we replace the 101 classes prediction layer with age or gender prediction layer. The epoch is set to 60. The learning rate is set to 0.0001.
As shown in Table IX, with the IMDB-WIKI-101 data set fine-tuning, both the performances of ResNets-34 and RoR-34 model have been significantly improved. This shows that having a large data set with face age images results in better performance. The performance of RoR-34 fine-tuning on the IMDB-WIKI-101 data set reaches the age exact accuracy of 66.74%(1-off 97.38%) compared to 60.29% (1-off 92.44%) when training directly on the Adience data set. That is competitive performance on Adience data set for age group and gender classification in the wild.
When we only use ImageNet data set to pre-train the RoR-34 model, the age estimation results on Adience with stochastic depth algorithm are better than without stochastic depth algorithm. However, when we first use the ImageNet dataset to pre-train the RoR-34 network, and then use the IMDB-WIKI-101 data set to fine-tune the RoR-34 network, the age estimation results on the Adience with stochastic depth algorithm are worse than without stochastic depth algorithm. The reason is that the ImageNet dataset is an object image dataset, the network can learn the feature expression of general object, adding the stochastic depth algorithm to the original network is effective for the results. However, the IMDB-WIKI-101 is a large-scale face image data set. The RoR-34 network can fully learn the characteristics of face images from the IMDB-WIKI-101 data set, which reduces the problem of over-fitting. After adding stochastic depth algorithm, the original structure of the network will be changed, so the network needs to relearn the characteristics of facial image parameters, that is the reason why the results with SD are not better than the results without SD.
|Method||Age Exact Acc(%)||Age 1-off(%)||Gender Acc(%)|
|RoR-34+ IMDB-WIKI +SD||66.422.64||97.350.65||92.901.76|
Iv-F Comparisons with state-of-the-art results of age group and gender classification on Adience
To begin with, we use 4c2f-CNN, VGG-16, Pre-ResNets, our RoR+SD by Pre-training on ImageNet and Pre-RoR+SD architectures to estimate gender. In addition, we use IMDB-WIKI-101 dataset to fine-tune the ResNets-34 and RoR-34 for gender estimation. The gender cross-validation results by different methods are shown in Table X. RoR-34+SD achieves a competitive accuracy 92.43% by only pretraining on ImageNet, and RoR-34+IMDB-WIKI achieves the best accuracy 93.24%, which outperforms 4c2f-CNN  by 6.44%.
|RoR-34+SD by Pre-training on ImageNet||92.431.51|
Then, we use 4c2f-CNN, VGG-16, Pre-ResNets, our RoR-34+SD by Pre-training on ImageNet and Pre-RoR-58+SD (Type A+B) architectures with the two mechanisms to estimate age. Furthermore, we use IMDB-WIKI-101 dataset to fine-tune the ResNets-34 and RoR-34, and then with the two mechanisms for further age estimation on Adience.Table XI compares the state-of-the-art methods for age group classification on Adience data set. We find that the accuracy increases with the large-scale face image dataset fine-tuning the network, and two mechanisms will further improve each architecture, which demonstrates the versatility of two mechanisms in different models. Fig. 8 shows the test errors by Pre-ROR-58+SD and Pre-ROR-58+SD with two mechanisms at different training epochs with folder0 validation. In addition, we notice that the effect of RoR-34+IMDB-WIKI with two mechanisms is a little better than RoR-34+IMDB-WIKI without two mechanisms. We argue that this is because of well-trained model by IMDB-WIKI.
|Chained gender-age CNN ||54.5||84.1|
|DEX w/o IMDB-WIKI pretrain ||55.66.1||89.71.8|
|DEX w/ IMDB-WIKI pretrain ||64.04.2||96.600.90|
|4c2f-CNN with two mechanisms||53.963.80||90.041.54|
|VGG-16 with two mechanisms||56.115.05||90.662.14|
|Pre-ResNets-34 with two mechanisms||61.894.16||93.501.33|
|Pre-RoR-58+SD with two mechanisms||64.173.81||95.771.24|
|RoR-34+SD by Pre-training on ImageNet||62.344.53||93.641.47|
|RoR-34+SD by Pre-training on ImageNet with two mechanisms||63.764.18||94.921.42|
|RoR-34+ IMDB-WIKI with two mechanisms||66.912.51||97.490.76|
|RoR-152+ IMDB-WIKI with two mechanisms||67.343.56||97.510.67|
As shown in Table XI, without using ImageNet and IMDB-WIKI101 datasets, the accuracy of Pre-ROR-58+SD with two mechanisms is better than 64.04.2% of DEX which pre-trained on ImageNet and IMDB-WIKI (523,051 face images) . Although DEX can achieve competitive results, it needs very large data set IMDB-WIKI for pre-training. Our method can learn age and gender representation from scratch without the IMDB-WIKI and achieve the best performance. Our VGG-16 with two mechanisms also outperforms DEX (also based on VGG-16) which only pre-trained on ImageNet but without IMDB-WIKI. These results demonstrate that our method can improve the optimization ability of networks and alleviate over-fitting on Adience data set. Moreover, by pre-training on ImageNet RoR-34+SD with two mechanisms also achieves 63.764.18% of accuracy, which is very close to the accuracy in , so we have reason to believe that better performance can be achieved by pre-training on more extra data sets. Particularly, our RoR-34+IMDB-WIKI with two mechanisms obtains a single-model accuracy of 66.912.51% , and the 1-off accuracy of 97.490.76% on Adience. But the single-model accuracy is slightly lower than the accuracy in . Because compared with VGG used in  RoR-34 is small. So we use RoR-152+IMDB-WIKI to repeat the experiments, we get the new state-of-the-art performance (a single-model accuracy of 67.343.56%) to our best knowledge now.
This paper proposes a new Residual networks of Residual networks (RoR) architecture for high-resolution facial images age and gender classification in the wild. Two modest mechanisms, pre-training by gender and training with weighted loss layer, are used to improve the performance of age estimation. Pre-training on ImageNet is used to alleviate over-fitting. Further fine-tuning on IMDB-WIKI-101 is for the purpose of learning the features of face images. By RoR or Pre-RoR with two mechanisms, we obtain new state-of-the-art performance on Adience data set for age group and gender classification in the wild. Through empirical studies, this work not only significantly advances the age group and gender classification performance, but also explores the application of RoR on large scale and high-resolution image classifications in the future.
The authors would like to thank the editor and the anonymous reviewers for their careful reading and valuable remarks.
R. Rothe, R. Timofte, and L. Gool, “Deep expectation of real and apparent age from a single image without facial landmarks,”
International Journal of Computer Vision. 2016.
-  Z. Ma, and A. Leijon, “Bayesian estimation of beta mixture models with variational inference,” IEEE Trans. on Pattern Analysis and Machine Intelligence,vol. 33, no. 11, pp. 2160–2173, Nov. 2011.
-  Z. Ma, A. Teschendorff, A. Leijon, Y. Qiao, H. Zhang, and J. Guo “Variational bayesian matrix factorization for bounded support data,” IEEE Trans. on Pattern Analysis and Machine Intelligence,vol. 37, no. 4, pp. 876–889, Apr. 2015.
-  F. Gao, and H. Ai, “Age classification on consumer images with gabor feature and fuzzy lda method,” in Proc. International Conference on Biometrics, 2009, pp. 132–141.
-  S. Yan, M. Liu and T. Huang, “Extracting age information from local spatially flexible patches,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, pp. 737–740.
-  G. Guo, G. Mu, Y. Fu and T. Huang, “Human age estimation using bio-inspired features,” in Proc. CVPR, 2009, pp. 112–119.
-  Y. Fu, and T. Huang, “Human age estimation with regression on discriminative aging manifold,” IEEE Transactions on Multimedia, vol. 10, no. 4, pp. 578–584, Apr. 2008.
-  G. Guo, Y. Fu, C. Dyer and T. Huang, “Image-based human age estimation by manifold learning and locally adjusted robust regression,” IEEE Transactions on Image Processing, vol. 17, no. 7, pp. 1178–1188, Jul. 2008.
-  G. Guo and G. Mu, “Simultaneous dimensionality reduction and human age estimation via kernel partial least squares regression,” in Proc. CVPR, 2011, pp. 657–664.
-  G. Guo and G. Mu, “Joint estimation of age, gender and ethnicity: CCA vs. PLS,” in Proc. IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, 2013, pp. 1–6.
-  A. Lanitis, D. Chrisina, and C. Chris, “Comparing different classifiers for automatic age estimation,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 34, no. 1, pp. 621–628, Jan. 2004.
-  R. Rothe, R. Timofte and L. Gool, “Morph: a longitudinal image database of normal adult age-progression,” in Proc. International Conference on Automatic Face and Gesture Recognition, 2006, pp. 341–345.
-  N. Kumar, A. Berg, P. Belhumeur, and S. Nayar, “Attribute and simile classifiers for face verification,” in Proc. ICCV, 2009, pp. 365–372.
-  A. Gallagher, and T. Chen, “Understanding images of groups of people,” in Proc. CVPR, 2009, pp. 256–263.
-  E. Eidinger, R. Enbar, and T. Hassner, “Age and gender estimation of unfiltered faces,” IEEE Transactions on Information Forensics and Security, vol. 9, no. 12, pp. 2170–2179, Dec. 2014.
-  S. Escalera, J. Gonzalez, X. Baro, and P. Pardo, “ChaLearn looking at people 2015 new competitions: Age estimation and cultural event recognition,” International Joint Conference on Neural Networks. IEEE, 2015, pp. 1–8.
-  A. Krizhenvshky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
-  W. Y. Zou, X. Y. Wang, M. Sun, and Y. Lin, “Generic object detection with dense neural patterns and regional,” arXiv preprint arXiv:1404.4316, 2014.
-  M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
-  P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” arXiv preprint arXiv:1312.6229, 2013.
-  K. Simonyan, and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014.
-  C. -Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” in Proc. AISTATS, 2015, pp. 562–570.
-  J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” arXiv preprint arXiv:1412.6806, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 1–9.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
-  A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features off-the-shelf: an astounding baseline for recognition,” arXiv preprint arXiv:1403.6382, 2014.
-  D. Yi, Z. Lei, and S. Li, “Age estimation by multi-scale convolutional network,” in Proc. ACCV, 2014, pp. 144–158.
-  X. Wang, R. Guo, and C. Kambhamettu, “Deeply-Learned feature for age estimation,” in Proc. IEEE Winter Conference on Applications of Computer Vision, 2015, pp. 534–541.
-  G. Levi, and T. Hassner, “Age and gender classification using convolutional neural networks,” in Proc. CVPR Workshop, 2015, pp. 34–42.
-  A. Ekmekji, “Convolutional neural networks for age and gender classification,” Research report, 2016.
-  X. Liu, S. Li, M. Kan, et al, “AgeNet: deeply learned regressor and classifier for robust apparent age estimation,” in Proc. ICCV Workshop, 2015, pp. 16–24.
-  G. Antipov, M. Baccouche, S. Berrani, et al, “Apparent age estimation from face images combining general and children-specialized deep learning models,” in Proc. CVPR Workshop, 2016, pp. 96–104.
-  Z. Huo, X. Zhang, C. Xing, et al, “Deep age distribution learning for apparent age estimation,” in Proc. CVPR Workshop, 2016, pp. 17–24.
-  L. Hou, D. Samaras, T. Kurc, Y. Gao and J. Saltz, “Neural networks with smooth adaptive activation functions for regression,” arXiv preprint arXiv:1608.06557, 2016.
-  L. Hou, C.P. Yu, D. Samaras, “Squared Earth Mover’s Distance-based Loss for training deep neural networks,” arXiv preprint arXiv:1611.05916, 2016.
-  R. Rothe, R. Timofte, and L. Gool, “Some like it hot-visual guidance for preference prediction,” arXiv preprint arXiv:1510.07867, 2015.
-  M. Iqbal, M. Shoyaib, B. Ryu, et al, “Directional age-primitive pattern (DAPP) for human age group recognition and age estimation, ” IEEE Transactions on Information Forensics and Security, accepted. 2017.
L. Hou, D. Samaras, T. Kurc, Y. Gao, J. Saltz, “ConvNets with Smooth Adaptive Activation Functions for Regression, in ”
Proc. International Conference on Artificial Intelligence and Statistics, 2017, pp. 430–439.
-  D. Han, J. Kim, J. Kim, “Deep pyramidal residual networks,” in Proc. CVPR., 2017.
-  G. Huang, Z. Liu, K. Weinberger, and L. Maaten, “Densely connected convolutional networks,” in Proc. CVPR., 2017.
-  Y. Chen, J. Li, H. Xiao, et al, “Dual Path Networks,” in Proc. CVPR., 2017.
-  K. Zhang, M. Sun, T. Han, X. Yuan, L. Guo, and T. Liu, “Residual networks of residual networks: multilevel residual networks,” IEEE Transactions on Circuits and Systems for Video Technology, accepted. 2017.
-  Y. Kwon, and N. Lobo, “Age classification from facial images,” in Proc. CVPR, 1994, pp. 762–767.
-  A. Gunay, and V. Nabiyev, “Automatic age classification with LBP,” in Proc. International Symposium on Computer and Information Sciences, 2008, pp. 1–4.
-  C. Shan, “Learning local features for age estimation on real-life faces,” in Proc. ACM international workshop on Multimodal Pervasive Video Analysis, 2010, pp. 23–28.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Identity mapping in deep residual networks,” arXiv preprint arXiv:1603.05027, 2016.
-  G. Huang, Y. Sun, Z. Liu, and K. Weinberger, “Deep networks with stochastic depth,” arXiv preprint arXiv:1605.09382, 2016.
-  A. Krizhenvshky, and G. Hinton, “Learning multiple layers of features from tiny images,” M.Sc. thesis, Dept. of Comput. Sci., Univ. of Toronto, Toronto, ON, Canada, 2009.
-  Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” in Proc. NIPS Workshop Deep Learning and Unsupervised feature learning., 2011, pp. 1–9.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “Imagenet large scale visual recognition challenge,” arXiv preprint arXiv:1409.0575, 2014.
-  S. Gross, and M. Wilber, “Training and investigating residual nets,” Facebook AI Research, CA. [Online]. Avilable:http://torch.ch/blog/2016/02/04/resnets.html, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” arXiv preprint arXiv:1502.01852, 2015.
-  T. Hassner, S. Harel, E. Paz, and R. Enbar, “Effective face frontalization in unconstrained images,” in Proc. CVPR., 2015, pp. 4295–4304.
-  S. Zagoruyko, and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.