Attributes are mid-level representations used for the recognition of activities, objects, and people   . Attributes provide an abstraction between the low-level features and the high-level labels. They have seen the most success in face recognition and verification  . In the face recognition domain, attributes include gender, race, age, hair color, facial hair
, etc. These semantic features are very intuitive, and they allow for much more understandable descriptions of objects, people, and activities. Reliable estimation of facial attributes is useful for many different tasks. HCI applications may require information about gender in order to properly greet a user (i.e. Mr. or Ms.) and other attributes such as expression in order to determine the mood of the user. Facial attributes can be used for identity verification in low quality imagery, where other verification methods may fail. Suspects are often described in terms of attributes, and so they can be used to automatically search for suspects in surveillance video. Attributes can be used to search a database of images very quickly. They have been used very successfully in image search and retrieval in the past few years  .
Improving the accuracy of attribute classifiers is a challenging problem in itself and has been of recent interest due to the release of several large-scale attribute datasets 8] . They have proved to be effective in attribute classification as well   
. However, with few exceptions, attributes have been treated as independent from each other. From a simple example - a woman wearing lipstick and earrings - we can see that this is not the case. If the subjects are wearing lipstick and earrings, the probability that they are women is much higher than if they did not exhibit those attributes, and the reverse is also true. Treating each attribute as independent fails to use the valuable information provided by the other attributes. Attributes fit nicely into a multi-task learning framework, where multiple problems are solved jointly using shared information  .
We propose a multi-task deep CNN (MCNN) with an auxiliary network (MCNN-AUX) on top in order to utilize information provided by all attributes in three ways: by sharing the lower layers of the MCNN for all attributes, by sharing the higher layers for similar attributes, and by utilizing all attribute scores from the MCNN in an auxiliary network in order to improve the recognition of individual attributes. We are able to achieve state-of-the-art performance on most attributes from two large-scale publicly available datasets: CelebA and LFWA .
The contributions of our work are as follows:
We develop a multi-task deep CNN for attribute classification.
We develop an auxiliary network for MCNN which allows for explicit use of attribute scores to improve classification of other attributes.
We demonstrate the effectiveness of our approach by evaluating on two challenging publicly available datasets - LFWA and CelebA.
We achieve state-of-the-art performance for many attributes, some showing up to a improvement over other methods.
We significantly decrease the number of parameters - over 4 times - and the amount of training time - over 16 times - required for the attribute classifier.
Our method requires no expensive pre-training, alignment, or part extraction steps.
The remainder of the paper is organized as follows: Section 2 describes the related work on CNNs, multi-task learning, and attributes. Section 3 discusses the MCNN architecture, and section 4 describes the auxiliary network. In section 5 we outline the experiments we performed in order to test our methods as well as our results. Finally in section 6 we discuss the impact of our work.
2 Related Work
There are large bodies of work on CNNs, Multi-Task Learning, and Attributes. We draw from all three areas to design the proposed method, MCNN-AUX. The relevant literature is reviewed in the following sections.
Deep CNNs have been widely used for feature extraction and have shown great improvement over hand-crafted features for many problems including object recognition, automatic caption generation, face detection, face recognition and verification, and activity recognition   
. CNNs have quickly gained popularity since the introduction of open-source software tools which allow for straight-forward construction, training, and testing of deep CNNs. Caffe, Torch, and TensorFlow are among the most popular packages for implementing CNNs
. The first big success for deep CNNs in a large-scale problem was in the 2012 Imagenet Challenge with a network that outperformed the then existing methods. Since then, a wide variety of CNN architectures have been proposed for many computer vision problems.
CNNs have also dominated the field of face recognition and verification. One of the most notable works in this domain is that of Deep-Face, which utilized a large dataset and applied both a siamese deep CNN and a classification CNN in order to maximize the distance between impostors and minimize the distance between true matches . Motivated by the success on the LFW dataset, researchers focused more on CNNs for face recognition and the networks became deeper and more complex    .
In this work, we take advantage of the discriminative power of the CNN to learn semantic attribute classifiers as a mid-level representation for subsequent use in recognition and verification systems.
2.2 Multi-Task Learning
Multi-task learning (MTL) is a way of solving several problems at the same time utilizing shared information   . MTL has found success in the domains of facial landmark localization, pose estimation, action recognition, face detection, and many more     .
In , , and  attributes and object classes are learned jointly to improve overall object classification performance.  uses Multiple Instance Learning to detect and recognize objects in images by learning attribute-object pairs.  uses an undirected graph to model the correlation amongst attributes in order to improve object recognition. In , attributes and objects share a low-dimensional representation allowing for regularization of the object classifier. In our work, all attributes share the lower layers in the CNN, so that information common to all the attributes can be learned. Applying MTL to attribute prediction is very natural given the strong relationships among the facial attributes.
Kumar et al. introduced the concept of attributes as image descriptors for face verification in . They used a collection of 65 binary attributes to describe each face image. They later extended this work with an addition of 8 attributes and applied their method to the problem of image search in addition to face verification . Berg et al. created classfiers for each pair of people in a dataset and then used these classifiers to create features for a face verification classifier . Here, rather than manually identifying attributes, each person was described by their likeness to other people. This is a way of automatically creating a set of attributes without having to exhaustively hand-label attributes on a large dataset. Prior to this, there were decades of research on gender and age recognition from face images .
CNNs have been used for attribute classification recently, demonstrating impressive results. Pose Aligned Networks for Deep Attributes (PANDA) achieved state-of-the-art performance by combining part-based models with deep learning to train pose-normalized CNNs for attribute classification. Focusing on age and gender,  applied deep CNNs to the Adience dataset. Liu et al. used two deep CNNs - one for face localization and the other for attribute recognition - and achieved impressive results on the new CelebA and LFWA datasets, outperforming PANDA on many attributes . Unlike these methods, our MCNN-AUX requires no pre-training, alignment or part extraction.
Past work has generally considered attributes to be independent, with , , and  training a separate classifier for each attribute.  uses the correlation amongst attributes to improve image ranking and retrieval. They use independently trained attribute classifiers and then learn pairwise correlations based on the outputs of these classifiers. Our method goes above and beyond this by training a single attribute network which classifies 40 attributes, sharing information throughout the network, and by learning the relationship among all 40 attributes, not just attribute pairs.  used a multi-task network to learn attributes for animals and clothing, rather than faces. They utilize groupings as in , but they impose constraints at the feature level according to the groups. We incorporate groupings directly into the network by sharing layers amongst attributes in a single grouping.
3 Multi-Task CNN
The proposed MCNN takes an image as input and outputs 40 separate attribute scores, which are then thresholded to obtain binary outputs. We describe the details of the architecture below.
shows the MCNN architecture. Conv1 consists of 75 7x7 convolution filters, and it is followed by a ReLU, 3x3 Max Pooling, and 5x5 Normalization. Conv2 has 200 5x5 filters and it is also followed by a ReLU, 3x3 Max Pooling, and 5x5 Normalization. Conv1 and Conv2 are shared for all attributes. After Conv2, groupings are used to separate the layers. There are nine groups in all:Gender, Nose, Mouth, Eyes, Face, AroundHead, FacialHair, Cheeks, and Fat. There are 6 Conv3s: one each for Gender, Nose, Mouth, Eyes, and Face, and one for the remaining groups - Conv3Rest. Each Conv3 has 300 3x3 filters and is followed by a ReLU, 5x5 Max Pooling and 5x5 Normalization. The Conv3s are followed by fully connected layers, FC1. There are 9 FC1s - one for each group. Each FC1 is fully connected to the corresponding previous layer, with Conv3Rest connected to the FC1s for AroundHead, FacialHair, Cheeks, and Fat. Every FC1 has 512 units and is followed by a ReLU and a 50 dropout to avoid overfitting. Each FC1 is fully connected to a corresponding FC2, also with 512 units. The FC2s are followed by a ReLU and a 50 dropout. Each FC2 is then fully connected to an output node for the attributes in that group. The attributes for each group are listed below:
Nose: Big Nose, Pointy Nose
Mouth: Big Lips, Smiling, Lipstick, Mouth Slightly Open
Eyes: Arched Eyebrows, Bags Under Eyes, Bushy Eyebrows, Narrow Eyes, Eyeglasses
Face: Attractive, Blurry, Oval Face, Pale Skin, Young, Heavy Makeup
AroundHead: Black Hair, Blond Hair, Brown Hair, Gray Hair, Earrings, Necklace, Necktie, Balding, Receding Hairline, Bangs, Hat, Straight Hair, Wavy Hair
FacialHair: 5 o’clock Shadow, Mustache, No Beard, Sideburns, Goatee
Cheeks: High Cheekbones, Rosy Cheeks
Fat: Chubby, Double Chin
The 9 groups were manually chosen according to attribute location. Some groupings were separated from others and some were absorbed into others through experimentation giving the above groupings. Male was kept separate from all other attributes as we found, through experimentation on the CelebA dataset, that gender was improved by sharing layers with other attributes, but it ultimately decreased performance of those attributes. We found the best compromise was to include male in the shared Conv1 and Conv2 layers and then to have separate Conv3, FC1, and FC2 layers.
We use the Caffe software for our implementation, training, and testing of MCNN and MCNN-AUX . We use a sigmoid cross-entropy loss applied to all attribute scores to facilitate training. As preprocessing steps, the training mean is subtracted from the images and they are cropped randomly with a size of 227x227. This helps the network to be robust to shifts in the input.
If we were to use an independent CNN for each attribute following the architecture of one path in the MCNN - 3 convolutional layers and 3 fully connected layers - each CNN would have over 1.6 million parameters. So, for all 40 attributes, there would be over 64 million parameters. Using MCNN, we cut this down to less than 15 million parameters, over four times fewer.
After training the MCNN, we add a fully connected layer, AUX, after the output of the trained MCNN. Starting with the weights from the trained MCNN, we learn the weights for the AUX portion of the network, keeping the weights from the MCNN constant. The AUX layer allows for interactions amongst attributes at the score level. The MCNN-AUX network learns the relationship amongst attribute scores in order to improve overall classification accuracy for each attribute. Figure 2 shows the connection between MCNN and AUX. The AUX layer only adds 1600 parameters to the 1.6 million from MCNN.
In our experiments, we used two challenging, publicly available datasets: CelebA and LFWA. Both datasets were originally constructed for identification and verification, and recently were given binary labels for 40 different attributes  . Both datasets are extremely challenging, with large variations in subject pose, illumination and image quality. The CelebA dataset consists of 200,000 images: 160,000 for training and 20,000 each for validation and testing. The LFWA dataset contains 13143 images with 6263 for training and 6880 for testing. Since the CelebA dataset is so large, we did not need to augment it in any way. If we did not augment the LFWA dataset, the network would severely overfit to the training data due to the large number of parameters. We augmented the LFWA dataset by jittering the original images by increments of 10 pixels. After jittering, we had over 75,000 images for training. Figure 3 shows some example images from CelebA and LFWA.
5.2 Independent CNNs
We train independent CNNs for all the 40 attributes for both datasets in order to compare these results with those from MCNN and MCNN-AUX. We use one portion of our MCNN network for this. Each independent CNN has 3 convolutional layers, and 3 fully connected layers with the parameters specified in section 3
. We train these networks for 22 epochs for both datasets and use a batch size of 100. The independent CNNs each take about an hour to train for the CelebA dataset and about 30 minutes for the LFWA dataset. For all 40 attributes, training independent CNNs takes over 40 hours for CelebA and over 20 hours for LFWA.
To train MCNN, we use batches of size 100, and we train for 22 epochs for both datasets. Training takes about 2.5 hours for the CelebA dataset and about 1 hour for the LFWA dataset. We see a significant reduction in time from 40 hours to 2.5 hours for CelebA and 20 hours to 1 hour for LFWA using MCNN over independent CNNs.
Taking the trained MCNN, we fix the weights for that portion of the MCNN-AUX network and only train the last layer, AUX. This takes about twenty minutes to train for CelebA and about 10 minutes for LFWA.
We present results for our independent CNNs, MCNN, and MCNN-AUX. For comparison, we also provide the state-of-the-art by Liu et al., and a baseline of always choosing the most common label for each attribute.
|Attribute||Baseline||Liu et al.||Independent||MCNN||MCNN-AUX|
|5 o’clock Shadow||90.01||91||93.94||94.41||94.51|
|Bags Under Eyes||79.73||79||84.83||84.89||84.92|
|Mouth Slightly Open||50.49||92||93.99||93.74||93.74|
We can see from Table 1 that our independent CNNs outperform Liu et al. on most attributes for CelebA. The independent CNNs improve on Liu et al. by for necklace, for blurry, for straight hair, and for big nose. MCNN makes even further improvements, and finally MCNN-AUX gives the highest accuracy for most attributes. We see that the largest jump in performance is from the method of Liu et al. to the independent CNNs, with smaller improvements being made with MCNN and MCNN-AUX. From this, we see that the value in MCNN and MCNN-AUX is in the increased training speed and the decreased parameters, which reduces the chances of overfitting. We do not expect to see an increase in performance with MCNN-AUX for every attribute, as many attributes do not have strong relationships with the others. Determining which relationships to use can be done in the validation portion. We did not remove any relationships in our testing. Unlike Liu et al., all three of our methods outperform the baseline for every attribute in CelebA.
Figure 4 shows a heatmap of the weights for the AUX layer of MCNN-AUX on the CelebA dataset. From Figure 4 we can see that each attribute contributes the most to its final classifier score. Some interesting relationships can be seen in the heatmap. We see that bald is strongly related to receding hairline and has an inverse relationship with straight hair and wavy hair and that no beard has an inverse relationship with 5 o’clock shadow, mustache, and sideburns. The strongest relationships are summarized in Table 2. Most of the relationships listed in Table 2 make intuitive sense. Someone with heavy makeup is likely to be wearing lipstick; if someone is chubby, they likely have a double chin; and if someone has gray hair, it is unlikely that they are young.
|Attribute||Positive Influences||Negative Influences|
|Bald||Receding Hairline||Straight Hair, Wavy Hair|
|Black Hair||Straight Hair, Wavy Hair||Blond Hair, Brown Hair|
|Blond Hair||Attractive||Black Hair, Brown Hair, Bushy Eyebrows|
|Chubby||Double Chin||Pointy Nose|
|Double Chin||Chubby, Big Nose||Young|
|Eyeglasses||N/A||Bags Under Eyes|
|Male||5 o’clock Shadow, Necktie||Earrings, Heavy Makeup, High Cheekbones, Lipstick|
|Goatee||Mustache||5 o’clock Shadow, No Beard|
|Gray Hair||Receding Hairline||Black Hair, Brown Hair, Young|
|Hat||Black Hair, Blond Hair||Bald, Receding Hairline|
|Heavy Makeup||Attractive, Lipstick||Bags Under Eyes|
|No Beard||N/A||5 o’clock Shadow, Goatee, Male, Mustache, Sideburns|
|Receding Hairline||Bald||Bangs, Hat|
|Sideburns||5 o’clock Shadow, Goatee||No Beard|
|Smiling||High Cheekbones||Big Lips|
|Straight Hair||N/A||Wavy Hair|
|Wavy Hair||N/A||Straight Hair|
|Attribute||Baseline||Liu et al.||Independent||MCNN||MCNN-AUX|
|5 o’clock Shadow||58.64||84||77.39||77.70||77.06|
|Bags Under Eyes||58.29||83||83.24||83.51||83.48|
|Mouth Slightly Open||58.70||82||82.41||83.47||83.51|
Table 3 shows the results for the LFWA dataset. We can see that the accuracies are lower for this dataset than for the CelebA dataset. This is likely due to overfitting because LFWA is much smaller than CelebA. The independent CNNs outperform Liu et al. on most attributes with an improvement of for blurry, for rosy cheeks, improvement for pale skin, and improvements for both straight hair and wavy hair. MCNN improved the classification accuracy of many attributes, but we see that blurry and eyeglasses did not improve with MCNN. This makes sense, as both attributes are relatively unrelated to the other attributes, and therefore don’t gain anything from shared information. We note that though we do not improve the results for some attributes, we perform no pre-training of the networks using a larger dataset, unlike Liu et al., which used a much larger dataset to initialize the weights of their networks. Pre-training on external data would likely improve the results, however that is not the focus of this work.
Figure 5 shows a heatmap of the weights for the AUX layer on LFWA. There is much more white in this heatmap than in that of Figure 4. This makes sense, as the results for MCNN on LFWA were not as strong as on CelebA. Again, we believe that this is due to the small size of the dataset. Though jittering LFWA helps, it does not compare to having a large amount of data as in CelebA. As with CelebA, we see that each attribute contributes most to its overall classification accuracy, though not quite as strongly. We again see promising relationships, which we summarize in Table 4.
|Attribute||Positive Influences||Negative Influences|
|5 o’clock Shadow||Goatee||No Beard|
|Chubby||Double Chin||Oval Face|
|Goatee||5 o’clock Shadow, Mustache||No Beard|
|No Beard||N/A||5 o’clock Shadow, Goatee, Mustache|
|Receding Hairline||Bald, Gray Hair||Hat|
In this paper, we have shown that though facial attributes have been treated as independent problems in the past, there is a lot to be gained from shared information amongst attributes. Framing the attribute prediction problem as a multi-task learning problem is very natural and allows for a large reduction in training time and the number of parameters required for the classifier. In this work we showed that the MCNN-AUX reduced the number of parameters from 64 million to 16 million, and reduced the training time by 16 times. We demonstrated our independent CNN, MCNN, and MCNN-AUX classifiers on the challenging CelebA and LFWA datasets, achieving state-of-the-art performance for most attributes. The relationship amongst attributes can be exploited in many ways and we presented three ways in this paper: by sharing lower layers of MCNN, by grouping similar attributes in higher layers of MCNN, and by introducing an auxiliary layer (AUX), which explicitly learns attribute relations at the score level. Even without pre-training, we were able to outperform the method of Liu et al. for many attributes. Pre-training on external data would likely improve the results, however that is not the focus of this work. We sought to show that a multi-task framework for attribute prediction outperforms independent classifiers, and we have done that through our experimentation. Taking advantage of relationships among attributes allowed for improved attribute prediction which will lead to improved facial recognition. In future work we plan to explore how these relationships can be used to improve identification and to learn how attributes are related to identity.
-  Duan, K., Parikh, D., Crandall, D., Grauman, K.: Discovering localized attributes for fine-trained recognition. CVPR (2012)
-  Zheng, J., Jiang, Z., Chellappa, R., Phillips, J.P.: Submodular attribute selection for action recognition in video. NIPS (2014)
-  Zhang, N., Paluri, M., Ranzato, M.A., Darrell, T., Bourdev, L.: Panda: Pose aligned networks for deep attribute modeling. CVPR (2014)
-  Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Attribute and simile classifiers for face verification. ICCV (2009)
-  Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Describable visual attributes for face verification and image search. PAMI (2011)
-  Siddiquie, B., Feris, R.S., Davis, L.S.: Image ranking and retrieval based on multi-attribute queries. CVPR (2011)
-  Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. ICCV (2015)
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. NIPS (2012)
-  Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. CVPR (2015)
-  Abdulnabi, A.H., Wang, G., Lu, J., Jia, K.: Multi-task cnn model for attribute prediction. arXiv preprint (2015)
-  Levi, G., Hassner, T.: Age and gender classification using convolutional neural networks. CVPR (2015)
-  Argyriou, A., Evgeniou, T., Pontil, M.: Multi-task feature learning. NIPS (2007)
-  Parameswaran, S., Weinberger, K.: Large margin multi-task metric learning. NIPS (2010)
-  Caruana, R.: Multitask learning. Machine Learning (1997)
-  Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR (2014)
-  Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
-  Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S.: Tensorflow: Large-scale machine learning on heterogeneous systems. (2015)
-  Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: Closing the gap to human-level performance in face verification. CVPR (2014)
-  Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face representation by joint identification-verification. NIPS (2014)
-  Sun, Y., Wang, X., Tang, X.: Deep learning face representation from predicting 10,000 classes. CVPR (2014)
-  Sun, Y., Wang, X., Tang, X.: Deeply learned face representations are sparse, selective, and robust. CoRR (2014)
-  Sun, Y., Ding, L., Wang, X., Tang, X.: Face recognition with very deep neural networks. CoRR (2015)
-  Zhang, Z., Luo, P., Loy, C., Tang, X.: Facial landmark detection by deep multi-task learning. ECCV (2014)
-  Zhou, Q., Wang, G., Jia, K., Zhao, Q.: Learning to share latent tasks for action recognition. ICCV (2013)
-  Yim, J., Jung, H., Yoo, B., Choi, C., Park, D., Kim, J.: Rotating your face using multi-task deep neural network. CVPR (2015)
-  Zhang, C., Zhang, Z.: Improving multiview face detection with multi-task deep convolutional neural networks. WACV (2014)
-  Devries, T., Biswaranjan, K., Taylor, G.W.: Multi-task learning of facial landmarks and expression. CRV (2014)
-  Wang, G., Forsyth, D.: Joint learning of visual attributes, object classes and visual saliency. CVPR (2009)
-  Wang, Y., Mori, G.: A discriminative latent model of object classes and attributes. ECCV (2010)
-  Hwang, S.J., Sha, F., Grauman, K.: Sharing features between objects and their attributes. CVPR (2011)
-  Berg, T., Belhumeur, P.N.: Tom-vs-pete classifiers and identity-preserving alignment for face verification. BMVC (2012)
-  Fu, Y., Guo, G., Huang, T.S.: Age synthesis and estimation via faces: A survey. PAMI (2010)
-  Ng, C.B., Tay, Y.H., Goi, B.M.: Vision-based human gender recognition: A survey. arXiv preprint (2012)
-  Jayaraman, D., Sha, F., Grauman, K.: Decorrelating semantic visual attributes by resisting the urge to share. CVPR (2014)
-  Zhang, X., Zhang, L., Wang, X.J., Shum, H.Y.: Finding celebrities in billions of web images. IEEE Transactions on Multimedia (2012)
-  Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report, University of Massachusetts, Amherst (2007)