I Introduction
Convolutional Neural Networks (ConvNets) have achieved stateoftheart performance on various visual recognition tasks such as image classification [1], object detection [2] and semantic segmentation [3]. The availability of a huge set of training images is one of the most important factors for their success. However, it is difficult to collect sufficient training images with unambiguous labels in domains such as age estimation [4], head pose estimation [5], multilabel classification and semantic segmentation. Therefore, exploiting deep learning methods with limited samples and ambiguous labels has become an attractive yet challenging topic.
Why is it difficult to collect a large and accurately labeled training set? Firstly, it is difficult (even for domain experts) to provide exact labels to some tasks. For example, the pixels close to object boundaries are very difficult to label for annotators in semantic segmentation. In addition, pixel labeling is a timeconsuming task that may limit the amount of training samples. Another example is that people’s apparent age and head pose is difficult to describe with an accurate number. Secondly, it is very hard to gather complete and sufficient data. For example, it is difficult to build an age dataset covering people from 1 to 85 years old, and ensure that every age in this range has enough associated images. Similar difficulties arise in head pose estimation, where head poses are usually collected at a small set of angles with a 10 or 15 increment. Thus, the publicly available age, head pose and semantic segmentation datasets are small scale compared to those in image classification tasks.
These aforementioned small datasets have a common characteristic, i.e., label ambiguity, which refers to the uncertainty among the groundtruth labels. On one hand, label ambiguity is unavoidable in some applications. We usually predict another person’s age in a way like “around 25”, which indicates using not only 25, but also neighboring ages to describe the face. And, different people may have different guesses towards the same face. Similar situations also hold for other types of tasks. The labels of pixels at object boundaries are difficult to annotate because of the inherent ambiguity of these pixels in semantic segmentation. On the other hand, label ambiguity can also happen if we are not confident in the labels we provide for an image. In the multilabel classification task, some objects are clearly visible but difficult to recognize. This type of objects are annotated as Difficult in the PASCAL Visual Object Classes (VOC) classification challenge [6], e.g., the chair in the third image of the first row in Fig. 1.




There are two main types of labeling methods: singlelabel recognition (SLR) and multilabel recognition (MLR). SLR assumes one image or pixel has one label and MLR assumes that one image or pixel may be assigned multiple labels. Both SLR and MLR aim to answer the question of which labels can be used to describe an image or pixel, but they can not describe the label ambiguity associated with it. Label ambiguity will help improve recognition performance if it can be reasonably exploited. In order to utilize label correlation (which may be considered as a consequence of label ambiguity in some applications), Geng et al. proposed a label distribution learning (LDL) approach for age estimation [4] and head pose estimation [7]. Recently, some improvements of LDL have been proposed. Xing et al. proposed two algorithms named LDLogitBoost and AOSOLDLogitBoost to learn general models to relax the maximum entropy model in traditional LDL methods [8]. Furthermore, He et al. generated age label distributions through weighted linear combination of the input image’s label and its contextneighboring samples [9]. However, these methods are suboptimal because they only utilize the correlation of neighboring labels in classifier learning, but not in learning the visual representations.
Deep ConvNets have natural advantages in feature learning. Existing ConvNet frameworks can be viewed as classification and regression models based on different optimization objective functions. In many cases, the softmax loss and loss are used in deep ConvNet models for classification [10] and regression problems [11]
, respectively. The softmax loss maximizes the estimated probability of the groundtruth class without considering other classes, and the
loss minimizes the squared difference between the estimated values of the network and the groundtruth. These methods have achieved satisfactory performance in some domains such as image classification, human pose estimation and object detection. However, existing deep learning methods cannot utilize the label ambiguity information. Moreover, a wellknown fact is that learning a good ConvNet requires a lot of images.In order to solve the issues mentioned above, we convert both traditional SLR and MLR problems to label distribution learning problems. Every instance is assigned a discrete label distribution according to its groundtruth. The label distribution can naturally describe the ambiguous information among all possible labels. Through deep label distribution learning, the training instances associated with each class label is significantly increased without actually increase the number of the total training examples. Fig. 1 intuitively shows four examples of label distribution for different recognition tasks. Then, we utilize a deep ConvNet to learn the label distribution in both feature learning and classifier learning. Since we learn label distribution with deep ConvNets, we call our method DLDL: Deep Label Distribution Learning. The benefits of DLDL are summarized as follows:

DLDL is an endtoend learning framework which utilizes the label ambiguity in both feature learning and classifier learning;

DLDL not only achieves more robust performance than existing classification and regression methods, but also effectively relaxes the requirement for large amount of training images, e.g., a training face image with groundtruth label 25 is also useful for predicting faces at age 24 or 26;

DLDL (only single model without ensemble) achieves better performance than the stateoftheart methods on age and head pose estimation tasks. DLDL also improves the performance for multilabel classification and semantic segmentation.
The rest of this paper is organized as follows. We first review the related work in Section II. Then, Section III proposes the DLDL framework, including the DLDL problem definition, DLDL theory, label distribution construction and training details. After that, the experiments are reported in Section IV. Finally, Section V presents discussions and the conclusion is given in Section VI.
Ii Related Work
In the past two decades, many efforts have been devoted to visual recognition, including at least image classification, object detection, semantic segmentation, and facial attribute (apparent age and head pose) estimation. These works can be divided into two streams. Earlier research was mainly based on handcrafted features, while more recent ones are usually deep learning methods. In this section, we briefly review these related approaches.
Methods based on handcrafted features usually include two stages. The first stage is feature extraction. The second stage learns models for recognition, detection or estimation using these features. SVM, random forest
[12] and neural networks have commonly been used during the learning stage. In addition, Geng et al. proposed the label distribution learning approach to utilize the correlation among adjacent labels, which further improved performance on age estimation [4] and head pose estimation [7].Although important progresses have been made with these features, the handcrafted features render them suboptimal for particular tasks such as age or head pose estimation. More recently, learning feature representation has shown great advantages. For example, Lu et al. [13] tried to learn costsensitive local binary features for age estimation.
Deep learning has substantially improved upon the stateoftheart in image classification [10], object detection [2], semantic segmentation [3] and many other vision tasks. In many cases, the softmax loss is used in deep models for classification [10]. Besides classification, deep ConvNets have also been trained for regression tasks such as head pose estimation [14] and facial landmark detection [15]. In regression problems, the training procedure usually optimizes a squared loss function. Satisfactory performance has also been obtained by using Tukey’s biweight function in human pose estimation [11]. In terms of model architecture, deep ConvNet models which use deeper architecture and smaller convolution filters (e.g., VGGNets [16] and VGGFace [17]) are very powerful. Nevertheless, these deep learning methods do not make use of the presence of label ambiguity in the training set, and usually require a large amount of training data.
A latest approach, in Inceptionv3 [18]
, is based on label smoothing (LS). Instead of only using the groundtruth label, they utilize a mixture of the groundtruth label and a uniform distribution to regularize the classifier. However, LS is limited to the uniform distribution among labels rather than mining labels’ ambiguous information. We believe that label ambiguity is too important to ignore. If we make good use of the ambiguity, we expect the required number of training images for some tasks could be effectively reduced.
In this paper, we focus on how to exploit the label ambiguity in deep ConvNets. Age and head pose estimation from still face images are suitable applications of the proposed research. In addition, we also extend our works to multilabel classification and semantic segmentation.
Iii The Proposed DLDL Approach
In this section, we firstly give the definition of the DLDL problem. Then, we present the DLDL theory. Next, we propose the construction methods of label distribution for different recognition tasks. Finally, we briefly introduce the DLDL architecture and training details.
Iiia The deep label distribution learning problem
Given an input image, we are interested in estimating a category output (e.g., age or head pose angles). For two input images and with groundtruth labels and , and are supposed to be similar to each other if the correlation of and is strong, and vice versa. For example, the correlation between faces aged 32 and 33 should be stronger than that between faces aged 32 and 64, in terms of facial details that reflect the age (e.g., skin smoothness). In other words, we expect high correlation among input images with similar outputs. The label distribution learning approach [4, 7]
exploited such correlations in the machine learning phase, but used features that are extracted ignoring these correlations. The proposed DLDL approach, however, is an endtoend deep learning method which utilizes such correlation information in both feature learning and classifier learning. We will also extend DLDL to handle other types of label ambiguity beyond correlation.
To fulfill this goal, instead of outputting a single value for an input , DLDL quantizes the range of possible values into several labels. For example, in age estimation, it is reasonable to assume that , and it is a common practice to estimate integer values for ages. Thus, we can define the set as the ordered label set for age estimation. The task of DLDL is then to predict a label distribution , where is the estimated probability that should be predicted to be years old. By estimating an entire label distribution, the deep learning machine is forced to take care of the ambiguity among labels.
Specifically, the input space of our framework is , where , and are the height, width, and number of channels of the input image, respectively. DLDL predicts a label distributionvector , where is the label set defined for a specific task (e.g., the above). We assume is complete, i.e., any possible value has a corresponding member in . A training data set with instances is then denoted as . We use boldface lowercase letters like to denote vectors, and the th element of is denoted as . The goal of DLDL is to directly learn a conditional probability mass function from , where is the parameters in the framework.
IiiB Deep label distribution learning
Given an instance with label distribution , we assume that
is the activation of the last fully connected layer in a deep ConvNet. We use a softmax function to turn these activations into a probability distribution, that is,
(1) 
Given a training data set , the goal of DLDL is to find to generate a distribution that is similar to .
There are different criteria to measure the similarity or distance between two distributions. For example, if the KullbackLeibler (KL) divergence is used as the measurement of the similarity between the groundtruth and predicted label distribution, then the best parameter is determined by
(2) 
Thus, we can define the loss function as:
(3) 
Stochastic gradient descent is used to minimize the objective function Eq. 3. For any and ,
(4) 
and the derivative of softmax (Eq. 1) is well known, as
(5) 
where is 1 if
, and 0 otherwise. According to the chain rule, for any fixed
, we have(6) 
Thus, the derivative of with respect to is
(7) 
Once is learned, the label distribution of any new instance can be generated by a forward run of the network. If the expected class label is a single one, DLDL outputs , where
(8) 
Prediction with multiple labels is also allowed, which could be a set where is a predefined threshold. If the expected output is a real number, DLDL predicts the expectation of , as
(9) 
where . This indicates that DLDL is suitable for both classification and regression tasks.
IiiC Label distribution construction
The groundtruth label distribution is not available in most existing datasets, which must be generated under proper assumptions. A desirable label distribution must satisfy some basic principles: (1) should be a probability distribution. Thus, we have and . (2) The probability values should have difference among all possible labels associated with an image. In other words, a less ambiguous category must be assigned high probability and those more ambiguous labels must have low probabilities. In this section, we propose the way to construct label distributions for age estimation, head pose estimation, multilabel classification and semantic segmentation.
For age estimation, we assume that the probabilities should concentrate around the groundtruth age . Thus, we quantize to get
using a normal distribution. For example, the apparent age of a face is labeled by hundreds of users. The groundtruth (including a mean
and a standard deviation
) is calculated from all the votes. For this problem, we find the range of the target (e.g., ), quantize it into a complete and ordered label set , where is the label set size and are all possible predictions for . A label distribution is then , where is the probability that (i.e., for ). Since we use equal step size in quantizing, the normal p.d.f. (probability density function) is a natural choice to generate the groundtruth
from and :(10) 
where . Fig. (a)a shows a face and its corresponding label distribution. For problems where is unknown, we will show that a reasonably chosen also works well in DLDL.
For head pose estimation, we need to jointly estimate pitch and yaw angles. Thus, learning joint distribution is also necessary in DLDL. Suppose the label set is
, where is a pair of values. That is, we want to learn the joint distribution of two variables. Then, the label distribution can be represented by an matrix, whose th element is . For example, when we use two angles (pitch and yaw) to describe a head pose, is a pair of pitch and yaw angles. Given an instance with groundtruth mean and covariance matrix , we calculate its label distribution as(11) 
where . In the above, we assume , that is, the covariance matrix is diagonal. Fig. (b)b shows a joint label distribution with head pose and .
For multilabel classification, a multilabel image always contains at least one object of the class of interest. There are usually multiple labels for an image. These labels are grouped into three different levels, including Positive, Negative and Difficult in the PASCAL VOC dataset [6]. A label is Positive means an image contains objects from that category, and Negative otherwise. Difficult indicates that an object is clearly visible but difficult to recognize. Existing multilabel methods often view Difficult as Negative, which leads to the loss of useful information. It is not reasonable either if we simply treat Difficult as Positive. Therefore, a nature choice is to use label ambiguity. We define different probabilities for different types of labels, as
(12) 
for Positive, Difficult and Negative labels, respectively. Furthermore, an normalization is applied to ensure :
(13) 
where equals , or if the label is Positive, Difficult or Negative, respectively. The label distribution is shown for a multilabel image in Fig. (c)c.
For semantic segmentation, we need to label a pixel as belonging to one class if it is a pixel inside an object of that class, or as the background otherwise. Let denote the annotation of the th pixel, where (assuming there are categories and 0 for background). Fully Convolutional Networks (FCN) have been an effective solution to this task. In FCN [3], a groundtruth label means that and for all . However, it is very difficult to specify groundtruth labels for pixels close to object boundaries, because labels of these pixels are inherently ambiguous. We propose a mechanism to describe the label ambiguity in the boundaries. Considering a Gaussian kernel matrix , we replace the original label distribution with , as
(14) 
where , , is the kernel size, and
are padding and stride sizes. In our experiment, we set
, and , and the generated label distribution is(15) 
Fig. (d)d gives the semantic label distribution for a bird image which shows that the ambiguity is encoded in the label distributions.
IiiD The DLDL architecture and training details
We use a deep ConvNet and a training set to learn a as the estimation of . The structure of our network is based on popular deep models such as ZFNet [19] and VGGNets [16]
. The ZFNet consists five convolution layers, followed by three fully connected layers. The VGGNets architecture includes 16 or 19 layers. We modify the last fully connected layer’s output based on the task and replace the original softmax loss function with the KL loss function. In addition, we use the parameter ReLU
[20]for ZFNet. In our network, the input is an order three tensor
and the output may be a vector (age estimation and multilabel classification), a matrix (head pose estimation) or a tensor (semantic segmentation).In this paper, we train the deep models in two ways:
Training from scratch.
For ZFNet, the initialization is performed randomly, based on a Gaussian distribution with zero mean and 0.01 standard deviation, and biases are initialized to zero. The coefficient of the parameter ReLU is initialized to 0.25. The dropout is applied to the last two fully connected layers with rate 0.5. The coefficient of weight decay is set to
. Optimization is done by Stochastic Gradient Descent (SGD) using minibatches of 128 and the momentum coefficient is 0.9. The initial learning rate is set to 0.01. The total number of epochs is about 20.
Finetuning. Three pretrained models including VGGNets (16layers and 19layers) and VGGFace (16layers) are used to finetune for different tasks. We remove these pretrained models’ classification layer and loss layer, and put in our label distribution layer which is initialized by the Gaussian distribution and the KL loss layer. The learning rates of the convolutional layers, the first two fullyconnected layers and the label distribution layer are initialized as 0.001, 0.001 and 0.01, respectively. We finetune all layers by back propagation through the whole net using minibatches of 32. The total number of epochs is about 10 for age estimation and 20 for multilabel classification.
Iv Experiments
We evaluate DLDL on four tasks, i.e., age estimation, head pose estimation, multilabel classification and semantic segmentation. Our implementation is based on MatConvNet [21].^{1}^{1}1http://www.vlfeat.org/matconvnet/ All our experiments are carried out on a NVIDIA K40 GPU with 12GB of onboard memory.
Iva Age estimation
Datasets. Two age estimation datasets are used in our experiments. The first is Morph [22], which is one of the largest publicly available age datasets. There are 55,134 face images from more than 13,000 subjects. Ages range from 16 to 77. Since no TRAIN/TEST split is provided, 10fold crossvalidation is used for Morph.
The second dataset is from the apparent age estimation competition, the first competition track of the ICCV ChaLearn LAP 2015 workshop [23]. Compared with Morph, this dataset (ChaLearn) consists of images collected in the wild, without any position, illumination or quality restriction. The only condition is that each image contains only one face. The dataset has 4,699 images, and is split into 2,476 training (TRAIN), 1,136 validation (VAL) and 1,087 testing (TEST) images. The apparent age (i.e., how old does this person look like) of each image is labeled by multiple individuals. The age of face images range from 3 to 85. For each image, its mean age and the corresponding standard deviation are given. Since the groundtruth for TEST images are not published, we train on the TRAIN split and evaluate on the VAL split of ChaLearn images.
Baselines. To demonstrate the effectiveness of DLDL, we firstly consider two related methods as baselines: ConvNet+LS (KL) and ConvNet+LD (div). The former uses label smoothing (LS) [18] as groundtruth and KL divergence as loss function. The latter uses label distribution (LD) as groundtruth and divergence [24] as loss function, which is
(16) 
In addition, we also compare DLDL with the following baseline methods:

BFGSLDL Geng et al. proposed the label distribution learning approach (IISLLD) for age and head pose estimation. They used traditional image features. To further improve IISLLD, Geng et al. [25] proposed a BFGSLDL algorithm by using the effective quasiNewton optimization method BFGS.

CConvNet
Classification ConvNets have obtained very competitive performance in various computer vision tasks. ZFNet
[19]and VGGNet are popular models which use the softmax loss. We replace the ImageNetspecific 1000way classification in these modes with the label set
. 
RConvNet ConvNets are also successively trained for regression tasks. In RConvNet, the groundtruth label (age and pose angle) is projected into the range by the mapping , where and are the maximum and minimum values in the training label set. During prediction, the RConvNet regression result is reverse mapped to get
. To speed up convergence, the last fully connected layer is followed a hyperbolic tangent activation function
, which maps to [14]. The squared , and ins loss functions are used in RConvNet.
Implementation details.
We use the same preprocessing pipeline for all compared methods, including face detection, facial key points detection and face alignment, as shown in Fig
2. We employ the DPM model [26] to detect the main facial region. Then, the detected face is fed into cascaded convolution networks [15] to get the five facial key points, including the left/right eye centers, nose tip and left/right mouth corners. Finally, based on these facial points, we align the face to the upright pose. Data augmentation are only applied to the training images for ChaLearn. For one color input training image, we generate its grayscale version, and leftright flip both color and grayscale versions. Thus, every training image turns into 4 images.We define for both datasets. The label distribution of each image is generated using Eq. 10. The mean is provided in both Morph and ChaLearn. The standard deviation , however, is provided in ChaLearn but not in Morph. We simply set in Morph. Experiments for different methods are conducted under the same data splits.
Description  Morph  ChaLearn  

MAE  MAE  error  
IISLDL [4]  5.670.15     
CPNN [4]  4.870.31     
ST+CSHOR [27]^{1}  3.82     
MS ConvNets [28]  3.63     
ConvNets [29]^{1}  3.31     
VGG (softmax, Exp) [30]^{3}    6.08  0.51 
VGG (softmax, Exp) [30]^{2,3}    3.22  0.28 
VGG (softmax, Exp) [31]^{2,3}  2.68  3.25  0.28 
BFGSLDL (KL, Max)  3.940.05  7.81  0.57 
BFGSLDL (KL, Exp)  3.850.05  6.79  0.53 
CConvNet (softmax, Max)  3.020.05  9.48  0.63 
CConvNet (softmax, Exp)  2.860.05  7.95  0.58 
RConvNet ()  3.170.04  5.94  0.50 
RConvNet ()  2.880.03  5.62  0.47 
RConvNet (ins)  2.890.04  5.71  0.48 
ConvNet+LS (KL, Max)  2.960.13  8.64  0.59 
ConvNet+LS (KL, Exp)  5.020.13  11.58  0.77 
ConvNet+LD (div, Max)  2.570.04  0.47  
ConvNet+LD (div, Exp)  2.570.04  0.46  
DLDL (KL, Max)  2.510.03  0.44  
DLDL (KL, Exp)  2.520.03  0.44  
DLDL+VGGFace (KL, Max)^{3}  2.420.01  3.62  0.32 
DLDL+VGGFace (KL, Exp)^{3}  2.430.01  3.51  0.31 
^{1} Used 80% of Morph images for training and 20% for evaluation;
^{2} Used additional external face images (i.e., IMDBWIKI);
^{3} Used pretrained model (i.e., VGGNets or VGGFace).
Evaluation criteria. Mean Absolute Error (MAE) and Cumulative Score (CS) are used to evaluate the performance of age estimation. MAE is the average difference between the predicted and the real age:
(17) 
where and are the estimated and groundtruth age of the th testing image, respectively. CS is defined as the accuracy rate of correct estimation:
(18) 
where is the number of correct estimation, i.e., testing images that satisfy . In our experiment, . In addition, a special measurement (named error) is defined by the ChaLearn competition, computed as
(19) 
Results. Table I lists results on both datasets. The upper part shows results in the literature. The middle part shows the baseline results. The lower part shows the results of the proposed approach. The first term in the parenthesis behind each method is the loss function corresponding to the method. Max or Exp represent predicting according to Eq. 8 or 9, respectively. Since crossvalidation is used in Morph, we also provide its standard deviations.

From Table I, we can see that DLDL consistently outperforms baselines and other published methods. The difference between DLDL (KL, Max) and its competitor CConvNet (softmax, Max) is 0.51 on Morph
. This gap is more than 6 times the sum of their standard deviations (0.03+0.05), showing statistically significant differences. The advantage of DLDL over RConvNet, CConvNet and ConvNet+LS suggests that learning label distribution is advantageous in deep endtoend models. DLDL has much better results than BFGSLDL, which shows that the learned deep features are more powerful than manually designed ones. Compared to ConvNet+LD (
div), DLDL (KL) achieves lower MAE on both datasets. It indicates that KLdivergence is better than divergence for measuring the similarity of two distributions in this context.We find that CConvNet and RConvNet are not stable. The RConvNet () method, although being the second best method for ChaLearn, is inferior to CConvNet (softmax, Exp) for Morph. In addition, we also find that Eq. 9 is better than Eq. 8 in many cases, which suggests that Eq. 9 is more suitable than Eq. 8 for age estimation.
Finetuning DLDL. Instead of training DLDL from scratch, we also finetune the network of VGGFace [17]. On the small scale ChaLearn dataset, the MAE of DLDL is reduced from 5.34 to 3.51, yielding a significant improvement. The error of DLDL is reduced from 0.44 to 0.31, which is close to the best competition result 0.28 [30] on the validation set. In [31], external training images (260,282 additional external training images with real age annotation) were used. DLDL only uses the ChaLearn dataset’s 2,476 training images and is the best among ChaLearn teams that do not use external data [23]. In the competition, the best externaldatafree error is 0.48, which is worse than DLDL’s. However, the idea in [31] to use external data is useful for further reducing DLDL’s estimation error.
Fig. (a)a and Fig. (b)b show the CS curves on ChaLearn and Morph datasets. At every error level, our DLDL finetuned VGGFace always achieves the best accuracy among all methods. It is noteworthy that the CS curves of DLDL (KL, Max) and ConvNet (div, Max) are very close to that of the DLDL+VGGFace (KL, Max) on Morph even without lots of external data and very deep model. This observation supports the idea that using DLDL can achieve competitive performance even with limited training samples.
In Fig. (g)g, we show some examples of face images from the ChaLearn validation set and predicted label distributions by DLDL (KL, Exp). In many cases, our solution is able to accurately predict the apparent age of faces. Failures may come from two causes. The first is the failure to detect or align the face. The second is some extreme conditions of face images such as occlusion, low resolution, heavy makeup and old photos.
IvB Head pose estimation
Datasets. We use three datasets in head pose estimation: Pointing’04 [32], BJUT3D [33] and Annotated Facial Landmarks in the Wild (AFLW) [34]. In them, head pose is determined by two angles: pitch and yaw. Pointing’04 discretizes the pitch into 9 angles and the yaw into 13 angles . When the pitch angel is or , the yaw angle is always set to . Thus, there are 93 poses in total. The head images are taken from 15 different human subjects in two different time periods, resulting in images.
BJUT3D contains 500 3D faces (250 male and 250 female people), acquired by a CyberWare Laser Scanner in an engineered environment. 9 pitch angles and 13 yaw angles are used. There are in total 93 poses in this dataset, similar to that in Pointing’04. Therefore, face images are obtained.
Methods  Description  MAE (lower is better)  Acc (higher is better)  

Pitch  Yaw  Pitch+Yaw  Pitch  Yaw  Pitch+Yaw  
LDLwJ [7]  2.690.15  4.240.17  6.450.29  86.240.97  73.301.36  64.271.82  
Baselines  BFGSLDL (KL)  1.990.19  4.000.20  5.680.13  88.780.11  74.370.13  66.420.11 
CConvNet (softmax)  5.280.65  6.020.44  10.560.74  73.152.74  62.901.81  42.971.67  
RConvNet ()  6.110.33  6.610.17  10.130.26        
RConvNet ()  5.940.71  5.900.39  9.430.79        
RConvNet (ins)  5.770.45  6.660.19  9.040.40        
ConvNet+LS (KL)  5.230.39  5.870.53  10.420.66  72.621.01  62.902.76  41.832.20  
ConvNet+LD (div)  1.940.20  3.680.16  5.340.17  90.000.77  76.270.82  69.000.89  
Ours  DLDL (KL)  1.690.32  3.160.07  4.640.24  91.651.13  79.570.57  73.150.72 
Methods  Description  MAE (lower is better)  Acc (higher is better)  

Pitch  Yaw  Pitch+Yaw  Pitch  Yaw  Pitch+Yaw  
Baselines  BFGSLDL (KL)  0.190.02  0.330.04  0.510.05  98.150.19  96.690.38  94.950.54 
CConvNet (Softmax)  0.060.01  0.090.02  0.140.03  99.450.09  99.160.16  98.640.23  
RConvNet ()  1.830.01  2.170.03  3.150.03        
RConvNet ()  1.250.06  1.370.09  2.110.09        
RConvNet (ins)  1.210.07  1.420.07  2.090.10        
ConvNet+LS (KL)  0.050.01  0.080.01  0.120.01  99.550.06  99.280.08  98.860.10  
ConvNet+LD (div)  0.070.01  0.120.02  0.190.02  99.310.04  98.820.20  98.150.21  
Ours  DLDL (KL)  0.020.01  0.070.01  0.090.01  99.810.04  99.270.08  99.090.09 
Unlike Pointing’04 and BJUT3D, the AFLW is a realworld face database. Head pose is coarsely obtained by fitting a mean 3D face with the POSIT algorithm [35]. The dataset contains about 24k faces in realworld images. We select 23,409 faces to ensure pitch and yaw angles within .
Implementation details. The head region is provided by bounding box annotations in Pointing’04 and AFLW. The BJUT3D does not contain background regions. Therefore, we will not perform any preprocessing.
In DLDL, we set in Pointing’04 and in BJUT3D for constructing label distributions. For AFLW, groundtruth of head pose angles are given as real numbers. Groundtruth (pitch and yaw) angles are divided from to in steps of , so we get (pitch, yaw) pair category labels. We set for AFLW. Since the discrete Jeffrey’s divergence is used in LDL [7], we implement BFGSLDL with the KullbackLeibler divergence. All experiments are performed under the same setting, including data splits, input size and network architecture.
To validate the effectiveness of DLDL for head pose estimation, we use the same baselines as age estimation. Our experiments show that Eq. 9 has lower accuracy than Eq. 8. Hence, we use Eq. 8 in this section.
Evaluation criteria. Three types of prediction values are evaluated: pitch, yaw, and pitch+yaw, where pitch+yaw jointly estimates the pitch and yaw angles. Two different measurements are used, which is MAE (Eq. 17) and classification accuracy (Acc). When we treat different poses as different classes, Acc measures the pose class classification accuracy. In particular, the MAE of pitch+yaw is calculated as the Euclidean distance between the predicted (pitch, yaw) pair and the groundtruth pair; the Acc of pitch+yaw is calculated by regarding each (pitch, yaw) pair as a class. For RConvNet, we only report its MAE but not Acc, because its predicted value are continuous real numbers. All methods are tested with 5fold cross validation for Pointing’04 and BJUT3D following [7]. For AFLW, 15,561 face images are randomly chosen for training, and the remaining 7,848 for evaluation. The setup is similar to the recent literature [36] (14,000 images for training and the rest 7,041 images for testing).
Description  MAE (lower is better)  Acc (higher is better)  

Pitch  Yaw  Pitch+Yaw  Pitch  Yaw  Pitch+Yaw  
AVM [36]    16.75      60.75   
BFGSLDL (KL)  7.21  8.72  12.69  90.62  86.81  79.80 
CConvNet (softmax)  7.87  9.34  13.65  87.75  83.79  75.04 
RConvNet ()  6.57  8.44  11.88  92.84  84.76  79.56 
RConvNet ()  6.01  7.07  10.34  94.60  89.62  85.45 
RConvNet (ins)  5.96  7.13  10.35  94.94  90.00  86.21 
ConvNet+LS (KL)  7.69  9.10  13.33  88.34  85.00  76.47 
ConvNet+LD (div)  6.55  7.02  10.77  92.80  91.88  86.14 
DLDL (KL)  5.75  6.60  9.78  95.41  92.89  89.27 

Results. Tables II, III and IV show results on Pointing’04, BJUT3D and AFLW, respectively. Pointing’04 is small scale with only 2,790 images. We observe that BFGSLDL (with handcrafted features) has much lower MAE and much higher accuracy than deep learning methods CConvNet, RConvNet and ConvNet+LS. One reasonable conjecture is that CConvNet, RConvNet and ConvNet+LS are not welllearned with only small number of training images. DLDL, however, successfully learns the head pose. For example, its accuracy for pitch+yaw is 73.15 (and CConvNet is only 42.97). That is, DLDL is able to perform deep learning with few training images, while CConvNet RConvNet and ConvNet+LS have failed for this task.
On BJUT3D and AFLW which have enough training data, we observe that many deep learning methods show higher performance than BFGSLDL. DLDL achieves the best performance: it has much lower MAE and higher accuracy than other methods. Another observation is also worth mentioning. Although RConvNet is better than CConvNet when label is dense such as age estimation and head pose estimation on AFLW, it is obviously worse than CConvNet on BJUT3D and pointing’04 for head pose estimation which have sparse labels. In other words, the performance of CConvNet and RConvNet are not very robust, while the proposed method consistently achieves excellent performance.
Fig. (c)c shows the pitch+yaw CS curves on the AFLW dataset. There is an obvious gap between DLDL and baseline methods at every error level. Fig. (g)g shows the predicted label distributions for different head poses on the AFLW testing set using the DLDL model. Our approach can estimate head pose with low errors but may fail under some extreme conditions. It is noteworthy that DLDL may produce more incorrect estimations when both yaw and pitch are large (e.g., ). The reason might be that there are much fewer training examples for large angles than for other angles.
IvC Multilabel classification
Datasets. We evaluate our approach for multilabel classification on the PASCAL VOC dataset [6]: PASCAL VOC2007 and VOC2012. There are 9,963 and 22,531 images in them, respectively. Each image is annotated with one or several labels, corresponding to 20 object categories. These images are divided into three subsets including TRAIN, VAL and TEST sets. We train on the TRAINVAL set and evaluate on the TEST
set. The evaluation metric is average precision (AP) and mean average precision (mAP), complying with the PASCAL challenge protocols.
We denote our methods as ImagesFinetuningDLDL (IFDLDL) and ProposalsFinetuningDLDL (PFDLDL) when ConvNets are finetuned by images and proposals of images, respectively. Details of these two variants are explained later in this section. We compare the proposed approaches with the following methods:

VGG+SVM [16]. This method densely extracted 4,096 dimensional ConvNet features at the penultimate layer of VGGNets pretrained on ImageNet. These features from different scales (smallest image side ) were aggregated by average pooling. Then, these averaged features from two networks (“NetD” containing 16 layers and “NetE” containing 19 layers) were further fused by stacking. Finally, [16] normalized the resulting image features and used these features to train a linear SVM classifier for multilabel classification.

HCP [37]. HCP proposed to solve the multilabel object recognition task by extracting object proposals from the images. The method used image label and square loss to finetune a pretrained ConvNet. Then, BING [38] or EdgeBoxes [39]
was used to extract object proposals, which were used to finetune the ConvNet again. Finally, scores of these proposals were maxpooled to obtain the prediction.

Fev+Lv [40]. This approach transformed the multilabel object recognition problem into a multiclass multiinstance learning problem. Two views (label view and feature view) were extracted for each proposal of images. Then, these two views were encoded by a Fisher vector for each image.

IFVGG and IFVGGKL. We finetune the VGGNets with square loss and multilabel crossentropy loss [41] and use them as our IFDLDL’s baselines. They are trained using the same setting.
Methods  Description  NetD  NetD  NetE  NetE 

Max  Avg  Max  Avg  
Fev+Lv20VD* [40]  90.6        
HCPVGG [42]  90.9        
Baselines  VGG+SVM [16]  89.3    89.3   
IFVGG  89.8  89.5  89.7  89.8  
IFVGGKL  90.0  90.3  90.3  90.2  
Ours  IFDLDL  90.1  90.5  90.6  90.7 
PFDLDL  92.3  92.1  92.5  92.2 
Methods  Description  aero  bike  bird  boat  bottle  bus  car  cat  chair  cow  table  dog  horse  mbike  person  plant  sheep  sofa  train  tv  mAP 

AGS* [43]  82.2  83.0  58.4  76.1  56.4  77.5  88.8  69.1  62.2  61.8  64.2  51.3  85.4  80.2  91.1  48.1  61.7  67.7  86.3  70.9  71.1  
AMM* [44]  84.5  81.5  65.0  71.4  52.2  76.2  87.2  68.5  63.8  55.8  65.8  55.6  84.8  77.0  91.1  55.2  60.0  69.7  83.6  77.0  71.3  
HCP2000C [37]  96.0  92.1  93.7  93.4  58.7  84.0  93.4  92.0  62.8  89.1  76.3  91.4  95.0  87.8  93.1  69.9  90.3  68.0  96.8  80.6  85.2  
Fev+Lv20VD* [40]  97.9  97.0  96.6  94.6  73.6  93.9  96.5  95.5  73.7  90.3  82.8  95.4  97.7  95.9  98.6  77.6  88.7  78.0  98.3  89.0  90.6  
HCPVGG [42]  98.6  97.1  98.0  95.6  75.3  94.7  95.8  97.3  73.1  90.2  80.0  97.3  96.1  94.9  96.3  78.3  94.7  76.2  97.9  91.5  90.9  
Baselines  VGG+SVM [16]  98.9  95.0  96.8  95.4  69.7  90.4  93.5  96.0  74.2  86.6  87.8  96.0  96.3  93.1  97.2  70.0  92.1  80.3  98.1  87.0  89.7 
IFVGG  98.9  95.7  97.3  95.5  65.0  92.8  93.7  97.1  74.2  90.8  87.0  97.1  97.1  93.8  97.0  70.8  94.3  77.8  98.0  86.4  90.0  
IFVGGKL  99.1  95.5  97.4  94.9  68.1  92.7  94.3  97.0  75.7  90.3  89.0  97.0  97.6  94.6  97.2  76.3  93.8  80.1  98.2  87.9  90.8  
Ours  IFDLDL  99.1  95.8  97.4  95.3  69.2  93.3  94.5  96.6  76.1  90.4  89.0  97.1  97.7  94.5  97.7  76.1  93.6  81.9  98.2  89.1  91.1 
PFDLDL  99.3  97.6  98.3  97.0  79.0  95.7  97.0  97.9  81.8  93.3  88.2  98.1  96.9  96.5  98.4  84.8  94.9  82.7  98.5  92.8  93.4 
Methods  Description  aero  bike  bird  boat  bottle  bus  car  cat  chair  cow  table  dog  horse  mbike  person  plant  sheep  sofa  train  tv  mAP 

NUSPSL*[43]  97.3  84.2  80.8  85.3  60.8  89.9  86.8  89.3  75.4  77.8  75.1  83.0  87.5  90.1  95.0  57.8  79.2  73.4  94.5  80.7  82.2  
PRE1512*[45]  94.6  82.9  88.2  84.1  60.3  89.0  84.4  90.7  72.1  86.8  69.0  92.1  93.4  88.6  96.1  64.3  86.6  62.3  91.1  79.8  82.8  
HCP2000C [37]  97.5  84.3  93.0  89.4  62.5  90.2  84.6  94.8  69.7  90.2  74.1  93.4  93.7  88.8  93.3  59.7  90.3  61.8  94.4  78.0  84.2  
Fev+Lv20VD* [40]  98.4  92.8  93.4  90.7  74.9  93.2  90.2  96.1  78.2  89.8  80.6  95.7  96.1  95.3  97.5  73.1  91.2  75.4  97.0  88.2  89.4  
HCPVGG [42]  99.1  92.8  97.4  94.4  79.9  93.6  89.8  98.2  78.2  94.9  79.8  97.8  97.0  93.8  96.4  74.3  94.7  71.9  96.7  88.6  90.5  
Baselines  VGG+SVM [16]  99.0  89.1  96.0  94.1  74.1  92.2  85.3  97.9  79.9  92.0  83.7  97.5  96.5  94.7  97.1  63.7  93.6  75.2  97.4  87.8  89.3 
IFVGG  98.9  88.4  96.7  93.4  70.7  92.3  85.8  97.7  77.3  94.2  81.2  97.4  96.8  93.7  96.7  62.2  94.1  70.7  96.9  85.8  88.6  
IFVGGKL  99.0  89.9  96.6  93.7  74.0  93.2  87.3  97.5  78.5  94.7  83.1  97.1  96.9  94.0  96.6  66.9  94.5  75.9  97.4  87.7  89.7  
Ours  IFDLDL  99.0  89.7  96.6  94.1  74.8  93.1  87.8  97.6  79.3  94.3  83.4  97.2  96.9  94.0  97.3  67.8  94.2  76.5  97.4  87.8  89.9 
PFDLDL  99.5  94.1  97.9  95.9  81.0  94.8  93.1  98.2  82.4  96.1  84.0  98.0  97.8  95.7  97.7  78.9  95.5  78.0  97.8  92.2  92.4 
Implementation details. According to the groundtruth labels, we set different probabilities for all possible labels on PASCAL VOC dataset. In our experiments, , , . Finally, similar to label smoothing, a uniform distribution is added to , where .
IFDLDL. Following [16], each training image is individually rescaled by randomly sampling in the range [256, 512]. We randomly crop patches from these resized images. We also adjust the pooling kernel in the pool5 layer from to . Maxpooling and Avgpooling are used at pool5 to train two ConvNets. We obtain four ConvNet models thought finetuning “NetD” and “NetE”. At the prediction stage, the smaller side of each image is scaled to a fixed length . Each scaled image is fed to the finetuned ConvNets to obtain the 20dim probability outputs. These probability outputs from different scales and different models are averaged to form the final prediction.
PFDLDL. Following [42], we further finetune IFDLDL models with proposals of images to boost performance. For each training image, we employ EdgeBoxes [39] to produce a set of proposal bounding boxes which are grouped into clusters by the normalized cut algorithm [46]. For each cluster, the top proposals with higher predictive scores generated by EdgeBoxes are resized into square shapes (i.e., ). As a result, we can obtain proposals for an image. Finally, these resized proposals are fed into a finetuned IFDLDL model to obtain prediction scores and these scores are fused by maxpooling to form the prediction distribution of the image. This process can be learned by using an endtoend way. In our implementation, we set and at the training and the prediction stage, respectively. Similar to IFDLDL, we also average fuse prediction scores of different models to generate the final prediction.
Results. In Table V, we compare single model results (average AP of all classes) on VOC2007. Our PFDLDL defeats all the other methods. Compared with Fev+Lv [40], 1.7% improvement can be achieved by PFDLDL even without using the bounding box annotation. Compared with HCPVGG [42], our PFDLDL can achieve 92.3% mAP, which is significantly higher than their 90.9%. This further indicates that it is very important to learn a label distribution.
Table VI and VII report details of all experimental results on VOC2007 and VOC2012, respectively. It can be seen that IFDLDL outperforms IFVGG by 1.1% for VOC2007 and 1.3% for VOC2012, which indicates that the KL loss function is more suitable than loss for measuring the similarity of two label distributions. Furthermore, IFDLDL improves IFVGGKL for about 0.2–0.3 points in mAP, which suggests that learning a label distribution is beneficial. More importantly, PFDLDL can achieve 93.4% for VOC2007 and 92.4% for VOC2012 in mAP when we average fuse output scores of four PFDLDL models.
Our framework shows good performance especially for scene categories such as “chair”, ‘table” and “sofa”. Although PFDLDL significantly outperforms IFDLDL in mAP, PFDLDL has higher computational cost than IFDLDL on both training and testing stages. Since IFDLDL does not need region proposals or bounding box information, it may be effectively and efficiently implemented for practical multilabel application such as multilabel image retrieval
[47]. It is also possible that by adopting new techniques (such as the region proposal method using gated unit in [48], which has higher accuracy that ours on VOC tasks), the accuracy of our DLDL methods can be further improved.





IvD Semantic segmentation
Datasets. We employ the PASCAL VOC2011 segmentation dataset and the Semantic Boundaries Dataset (SBD) for training the proposed DLDL. There are 2,224 images (1,112 for training and 1,112 for testing) with pixel labels for 20 semantic categories in VOC2011. SBD contains 11,355 annotated images (8,984 for training and 2,371 for testing) from Hariharan et al. [49]. Following FCN [3], we train DLDL using the union set (8,825 images) of SBD and VOC2011 training images. We evaluate the proposed approach on VOC2011 (1,112) and VOC2012 (1,456) test images.
Evaluation criteria. The performance is measured in terms of mean IU (intersection over union), which is the most widely used metric in semantic segmentation.
We keep the same settings as FCN including training images and model structure. The main change is that we employ KL divergence as the loss function based on label distribution (Eq. 15). Note that although we transform the groundtruth to label distribution in the training process, our evaluation rely only on groundtruth label.
Recently, Conditional Random Field (CRF) has been broadly used in many stateoftheart semantic segmentation systems. We optionally employ a fully connected CRF [50] to refine the predicted category score maps using the default parameters of [51].
Results. Table VIII gives the performance of DLDL8s and DLDL8sCRF on the test images of VOC2011 and VOC2012 and compares it to the wellknown FCN8s. DLDL8s improves the mean IU of FCN8s form 62.7% to 64.9% on VOC2011. On VOC2012, DLDL8s leads to an improvement of 2.3 points in mean IU. DLDL achieves better results than FCN, which suggests it is important to improve the segmentation performance using label ambiguity. In addition, the CRF further improve performance of DLDL8s, offering a 2.6% absolute increase in mean IU both on VOC2011 and VOC2012.
Methods  mean IU  mean IU 

VOC2011 test  VOC2012 test  
FCN8s [3]  62.7  62.2 
DLDL8s  64.9  64.5 
DLDL8s+CRF  67.6  67.1 
Fig. 7 shows four semantic segmentation examples from the VOC2011 validation images using FCN8s, DLDL8s and DLDL8sCRF. We can see that DLDL8s can successfully segment some small objects (e.g., car and bicycle) and particularly improve the segmentation of object boundaries (e.g., horse’s leg and plant’s leaves), but FCN8s does not. DLDL8s may fail, e.g., it sees a flowerpot as a potted plant in the fourth row in Fig. 7. Furthermore, compared to DLDL8s, DLDL8sCRF is able to refine coarse pixellevel label predictions to produce sharp boundaries and finegrained segmentations (e.g., plant’s leaves).





V Discussions
In this section, we try to understand the generalization performance of DLDL through feature visualization, and to analyze why DLDL can achieve high accuracy with limited training data. In addition, a study of the hyperparameter is also provided.
Feature visualization. We visualize the model features in a lowdimensional space. Early layers learn lowlevel features (e.g., edge and corner) and latter layers learn high level features (e.g., shapes and objects) in a deep ConvNet [19]. Hence, we extract the penultimate layer features (4,096dimensional) on Morph, ChaLearn, Pointing’04 and AFLW validation sets. To obtain the 2dimensional embeddings of the extracted high dimensional features, we employ a popular dimension reduction algorithm tSNE [52]. The lowdimensional embeddings of validation images from the above four datasets are shown in Fig. 6. The first row shows the 2dim embeddings of handcrafted features (BIF for Morph and Chalearn, HOG for Pointing’04 and AFLW) and the second row shows that of the DLDL features. These figures are colored by their semantic category. It can be observed that clear semantic clusterings (old or young for age datasets, left or right, up or down for head pose datasets) appear in deep features but do not in handcrafted features.
Reduce overfitting. DLDL can effectively reduce overfitting when the training set is small. This effect can be explained by the label ambiguity. Considering an input sample with one single label . In traditional deep ConvNet, and for all . In DLDL, the label distribution contains many non zeros elements. The diversity of labels helps reduce overfitting. Moreover, the objective function (Eq. 3) of DLDL can be rewritten as
(20) 
In Eq. 20, the first term is the tradition ConvNet loss function. The second term maximize the loglikelihood of the ambiguous labels. Unlike existing data augmentation techniques such as random cropping on the images, DLDL augments data on the label side.
In Fig. 8, MAE is shown as a function of the number of epochs on two age datasets (ChaLearn and Morph) and two head pose datasets (BJUT3D and AFLW). On ChaLearn and AFLW, CConveNet (softmax) achieves the lowest training MAE, but produces the highest validation MAE. In particular, the validation MAE increases after the 8th epoch on ChaLearn. Similar phenomenon is observed on AFLW. This fact shows that overfitting happens in CConvNet when the number of training images is small. Although there are 15,561 training images in AFLW, each category contains on averagely 4 training images since there are 3,721 categories.
Accelerate convergence. We further analyze the convergence performance of DLDL, CConvNet and RConvNet. We can observe that the training MAE is reduced very slowly at the beginning of training using CConvNet and RConveNet in many cases as shown in Fig. 8. On the contrary, the MAE of DLDL reduces quickly.
Robust performance. One notable observation is that CConvNet and RConveNet is unstable. Fig. (c)c shows the MAE for pitch+yaw, a complicated estimation of the joint distribution. This is a very sparse label set because the interval of adjacent class (pitch or yaw) is . RConvNet has difficulty in estimating this output, yielding errors that are roughly 20 times higher than DLDL and CConvNet. On the other hand, CConvNet easily fall into overfitting when there are not enough training data (e.g, Fig. (a)a and Fig. (d)d). The proposed DLDL is more amenable to small datasets or sparse labels than CConvNet and RConvNet.
Analyze the hyperparameter. DLDL’s performance may be affected by the label distribution. Here, we take age estimation (Morph) and head pose estimation (Pointing’04) for examples. is a common hyperparameter in these tasks if it is not provided in the groundtruth. We have empirically set in Morph, and in Pointing’04 in our experiments. In order to study the impact of , we test DLDL with different values, changing from 0 to 3 with 0.5 interval. Fig. 9 shows the MAE performance on Morph and Pointing’04 with different . We can see that a proper is important for low MAE. But generally speaking, a value that is close to the interval between neighboring labels is a good choice. Because the shape of all curves are Vshape like, it is also very convenient to find an optimal value using the crossvalidation strategy.
Vi Conclusion
We observe that current deep ConvNets cannot successfully learn good models when there are not enough training data and/or the labels are ambiguous. We propose DLDL, a deep label distribution learning framework to solve this issue by exploiting label ambiguity. In DLDL, each image is labeled by a label distribution, which can utilize label ambiguity in both feature learning and classifier learning. DLDL consistently improves the network training process in our experiments, by preventing it from overfitting when the training set is small. We empirically showed that DLDL produces robust and competitive performances than traditional classification or regression deep models on several popular visual recognition tasks.
However, constructing a reasonable label distribution is still challenging due to the diversity of label space for different recognition tasks. It is an interesting direction to extend DLDL to more recognition problems by constructing different label distributions.
References
 [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.

[2]
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies
for accurate object detection and semantic segmentation,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2014, pp. 580–587.  [3] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
 [4] X. Geng, C. Yin, and Z.H. Zhou, “Facial age estimation by learning from label distributions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 10, pp. 2401–2412, 2013.
 [5] S. G. Kong and R. O. Mbouna, “Head pose estimation from a 2D face image using 3D face morphing with depth parameters,” IEEE Transactions on Image Processing, vol. 24, no. 6, pp. 1801–1808, 2015.
 [6] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The PASCAL visual object classes (VOC) challenge,” International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, 2010.
 [7] X. Geng and Y. Xia, “Head pose estimation based on multivariate label distribution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1837–1842.
 [8] C. Xing, X. Geng, and H. Xue, “Logistic boosting regression for label distribution learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4489–4497.
 [9] Z. He, X. Li, Z. Zhang, F. Wu, X. Geng, Y. Zhang, M.H. Yang, and Y. Zhuang, “Datadependent label distribution learning for age estimation,” IEEE Transactions on Image Processing, 2017, to be published, doi: 10.1109/TIP.2017.2655445.
 [10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
 [11] V. Belagiannis, C. Rupprecht, G. Carneiro, and N. Navab, “Robust optimization for deep regression,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2830–2838.
 [12] G. Fanelli, J. Gall, and L. Van Gool, “Real time head pose estimation with random regression forests,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 617–624.
 [13] J. Lu, V. E. Liong, and J. Zhou, “Costsensitive local binary feature learning for facial age estimation,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5356–5368, 2015.
 [14] B. Ahn, J. Park, and I. S. Kweon, “Realtime head orientation from a monocular camera using deep neural network,” in Asian Conference on Computer Vision, 2015, pp. 82–96.
 [15] Y. Sun, X. Wang, and X. Tang, “Deep convolutional network cascade for facial point detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3476–3483.
 [16] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in Proceedings of International Conference on Learning Representations, 2015, pp. 1–14.
 [17] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in Proceedings of the British Machine Vision Conference, 2015, p. 6.
 [18] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
 [19] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European Conference on Computer Vision, 2014, pp. 818–833.
 [20] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1026–1034.
 [21] A. Vedaldi and K. Lenc, “MatConvNet: Convolutional neural networks for MATLAB,” in Proceedings of the 23rd ACM International Conference on Multimedia, 2015, pp. 689–692.
 [22] K. Ricanek Jr and T. Tesafaye, “Morph: A longitudinal image database of normal adult ageprogression,” in International Conference on Automatic Face and Gesture Recognition, 2006, pp. 341–345.
 [23] S. Escalera, J. Fabian, P. Pardo, X. Baró, J. Gonzalez, H. J. Escalante, D. Misevic, U. Steiner, and I. Guyon, “Chalearn looking at people 2015: Apparent age and cultural event recognition datasets and results,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2015, pp. 1–9.
 [24] T. Minka, “Divergence measures and message passing,” Microsoft Research, Tech. Rep. MSRTR2005173, 2005.
 [25] X. Geng, “Label distribution learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 7, pp. 1734–1748, 2016.
 [26] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool, “Face detection without bells and whistles,” in European Conference on Computer Vision, 2014, pp. 720–735.
 [27] K.Y. Chang and C.S. Chen, “A learning framework for age rank estimation based on face images with scattering transform,” IEEE Transactions on Image Processing, vol. 24, no. 3, pp. 785–798, 2015.
 [28] D. Yi, Z. Lei, and S. Z. Li, “Age estimation by multiscale convolutional network,” in Asian Conference on Computer Vision, 2015, pp. 144–158.
 [29] I. Huerta, C. Fernández, C. Segura, J. Hernando, and A. Prati, “A deep analysis on age estimation,” Pattern Recognition Letters, vol. 68, pp. 239–249, 2015.
 [30] R. Rothe, R. Timofte, and L. Gool, “DEX: Deep EXpectation of apparent age from a single image,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2015, pp. 252–257.
 [31] R. Rothe, R. Timofte, and L. Van Gool, “Deep expectation of real and apparent age from a single image without facial landmarks,” International Journal of Computer Vision, pp. 1–14, 2016, doi:10.1007/s11263016094036.
 [32] N. Gourier, D. Hall, and J. L. Crowley, “Estimating face orientation from robust detection of salient facial structures,” in FG Net Workshop on Visual Observation of Deictic Gestures, 2004, pp. 1–9.
 [33] B. Yin, Y. Sun, C. Wang, and Y. Ge, “BJUT3D large scale 3D face database and information processing,” Journal of Computer Research and Development, vol. 46, no. 6, pp. 1009–1018, 2009.
 [34] M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Annotated facial landmarks in the wild: A largescale, realworld database for facial landmark localization,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2011, pp. 2144–2151.
 [35] D. F. Dementhon and L. S. Davis, “Modelbased object pose in 25 lines of code,” International Journal of Computer Vision, vol. 15, no. 1–2, pp. 123–141, 1995.
 [36] K. Sundararajan and D. Woodard, “Head pose estimation in the wild using approximate view manifolds,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2015, pp. 50–58.
 [37] Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan, “CNN: singlelabel to multilabel,” CoRR, abs:1406.5726, 2014.

[38]
M.M. Cheng, Z. Zhang, W.Y. Lin, and P. Torr, “BING: binarized normed gradients for objectness estimation at 300fps,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3286–3293.  [39] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from edges,” in European Conference on Computer Vision, 2014, pp. 391–405.
 [40] H. Yang, J. T. Zhou, Y. Zhang, B.B. Gao, J. Wu, and J. Cai, “Exploit bounding box annotations for multilabel object recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 280–288.
 [41] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe, “Deep convolutional ranking for multilabel image annotation,” CoRR, abs:1312.4894, 2013.
 [42] Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan, “HCP: A flexible CNN framework for multilabel image classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 9, pp. 1901–1907, 2015.
 [43] J. Dong, W. Xia, Q. Chen, J. Feng, Z. Huang, and S. Yan, “Subcategoryaware object classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 827–834.
 [44] Z. Song, Q. Chen, Z. Huang, Y. Hua, and S. Yan, “Contextualizing object detection and classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 1585–1592.
 [45] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring midlevel image representations using convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1717–1724.
 [46] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, 2000.
 [47] H. Lai, P. Yan, X. Shu, Y. Wei, and S. Yan, “InstanceAware hashing for multilabel image retrieval,” IEEE Transactions on Image Processing, vol. 25, no. 6, pp. 2469–2479, 2016.

[48]
R.W. Zhao, J. Li, Y. Chen, J.M. Liu, Y.G. Jiang, and X. Xue, “Regional gating neural networks for multilabel image classification,” in
Proceedings of the British Machine Vision Conference, vol. 6, 2016.  [49] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik, “Semantic contours from inverse detectors,” in Proceedings of the IEEE International Conference on Computer Vision, 2011, pp. 991–998.
 [50] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected CRFs with gaussian edge potentials,” in Advances in Neural Information Processing Systems, 2011, pp. 109–117.
 [51] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected CRFs,” in Proceedings of International Conference on Learning Representations, 2015.
 [52] L. Van der Maaten and G. Hinton, “Visualizing data using tSNE,” Journal of Machine Learning Research, vol. 9, no. Nov, pp. 2579–2605, 2008.
Comments
There are no comments yet.