Directing DNNs Attention for Facial Attribution Classification using Gradient-weighted Class Activation Mapping

by   Xi Yang, et al.

Deep neural networks (DNNs) have a high accuracy on image classification tasks. However, DNNs trained by such dataset with co-occurrence bias may rely on wrong features while making decisions for classification. It will greatly affect the transferability of pre-trained DNNs. In this paper, we propose an interactive method to direct classifiers paying attentions to the regions that are manually specified by the users, in order to mitigate the influence of co-occurrence bias. We test on CelebA dataset, the pre-trained AlexNet is fine-tuned to focus on the specific facial attributes based on the results of Grad-CAM.



There are no comments yet.


page 1

page 2

page 3

page 4


DEPARA: Deep Attribution Graph for Deep Knowledge Transferability

Exploring the intrinsic interconnections between the knowledge encoded i...

A Light in the Dark: Deep Learning Practices for Industrial Computer Vision

In recent years, large pre-trained deep neural networks (DNNs) have revo...

Ranking and Rejecting of Pre-Trained Deep Neural Networks in Transfer Learning based on Separation Index

Automated ranking of pre-trained Deep Neural Networks (DNNs) reduces the...

Attribution Mask: Filtering Out Irrelevant Features By Recursively Focusing Attention on Inputs of DNNs

Attribution methods calculate attributions that visually explain the pre...

Ranking Neural Checkpoints

This paper is concerned with ranking many pre-trained deep neural networ...

What Do Neural Networks Learn When Trained With Random Labels?

We study deep neural networks (DNNs) trained on natural image data with ...

Learning Credible Deep Neural Networks with Rationale Regularization

Recent explainability related studies have shown that state-of-the-art D...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many datasets feature various biases frequently, such as co-occurrence bias, which is attributed to a lack of negative examples [5]. Strong correlations among several featured elements mean that the feature one wishes to extract is often accompanied by other features. A network trained by datasets featuring obvious biases cannot reliably make decisions based only on desired features, as proven by the the lipstick problem of [7] shown in Figure 1. Although an attention approach may be used to improve DNN performance [1], biased representations may still be in play.

In Large-scale CelebFaces Attributes (CelebA) dataset [3]

, for example, the attributes ‘Wearing Lipstick’ and ‘Heavy Makeup’ often occur simultaneously with a high probability. Most people in the images, not only applying lipstick but also putting makeup on other facial parts, will be only labelled by the attribute of ‘Wearing Lipstick’. Thus, the network recognizes ‘Wearing Lipstick’ usually relying on the makeup of several parts of a face, such as the eyes, eyebrows, and mouth. In this way, a pre-trained network lacks strong transferability from one dataset to another, because of bias in representation. To improve DNN generalization, it is important to focus on the correct extracted features facilitating classification; accuracy is not everything.

Figure 1: Examples from the CelebA dataset. Grad-CAM shows that a pre-trained DNN ‘lipstick’ attribute focuses on not only the mouth region but also on the eyes, eyebrows, and other regions (first row). Our fine-tuned DNN focuses on the mouth only (second row). The predictive importance of various facial regions (high to low) is colorized blue to red.

The biased representations can not be simply eliminated from a pre-trained network; it is difficult to disentangle extracted features. Apart from directly modifying the training dataset to balance bias, Li et al. [2] suggested that network attention could be guided with biased data. However, self-guidance via soft mask application is ineffective if the region of interest (ROI) overlaps with the attention map.

In this paper, we incorporate user interaction to resolve co-occurrence issues when the attention map includes the ROI. The user directly specifies the region on which classifiers should focus, then the system re-trains the classifiers to focus on the specified region. We test this method using facial images; we find that the approach effectively addresses co-occurrence bias.

2 Method

An overview of our method is shown in the Figure 2. Given a pre-trained classification DNN, we visualize the activation in the model to localize the regions where network focuses on for some example images. If the network makes a classification based on biased features, the user manually specifies the correct region on a template. This will fine-tune the pre-trained network to focus on user’s defined region and direct the attention of network accordingly.

Figure 2: Overview of how we direct the attention of network on ‘Wearing Lipstick’ images.

Visualization of the feature maps. Gradient-weighted Class Activation Mapping (Grad-CAM) [4] is employed in our method, which uses class-specific gradient information to localize important regions in terms of classification.

Specifying region of interest (ROI). We assume that one facial attribute corresponds to some facial regions. In order to conveniently specify the desired location that requires attention, we use Dlib111 to detect landmarks and segment the facial regions of each image. The user interface is shown in Figure 3. The user first identifies the most important region of an input image; this should include some of ten pre-defined rectangular facial regions. Then the entire specified rectangular region is used to calculate Grad-CAM loss. Note that, the user only needs to select a rectangle in template face illustration.

Loss function.

The pre-trained network is fine-tuned using a loss function,

which is a weighted combination of attribute loss and Grad-CAM loss, see Equation (1). The attribute loss is the difference between the combined binary cross entropy (BCE) of the predicted scores and the labels. The Grad-CAM loss is computed by comparing the Grad-CAM and the user-specified regions. The positive parameters are balancing weights for and

. Neurons with values

on Grad-CAM visualization constitute the Grad-CAM set. The landmarks of the specified region are mapped onto grids that are the same size as the Grad-CAM layer. We use the Intersection over Union (IoU) loss concept [6] to evaluate the extent of ; we calculate the ratio of the overlap areas yielded by prediction and ground truth (Figure 4). The prediction value is the Grad-CAM region value, and the ground truth is the specified facial region , as shown in Equation (2).

Figure 3: The user interface for the lipstick problem. After loading an input image and selecting a single class, the original image and the visualization are shown side-by-side. The user could select desired attention region(s) to fine-tune the pre-trained network.
Figure 4: Grad-CAM loss in an image exemplifying the lipstick problem. Red boxes show the Grad-CAM regions and blue boxes indicate the specified regions.

3 Experiments

We tested our method using the CelebA dataset [3], a large-scale facial attributes dataset containing more than 200,000 celebrity images, each featuring 41 attribute annotations. We divided these images into three sets: training, validation, and testing set. To enhance the diversity, two image sets that showing mouth region only and concealing eyes with sunglasses are included. Let and denote two image sets positively annotated in terms of attributes and , and let and denote two negative sets. Then the image sets and of the testing set are extracted to evaluate network fine-tuning. We used AlexNet for facial attribute classification task which contains five convolutional layers and three fully connected layers. The last conv-layer delivers the Grad-CAM results.

Figure 5: Examples of masked images and images with sunglasses.

We used three examples: ‘Heavy Makeup’ & ‘Wearing Lipstick’, ‘High Cheekbones’ & ‘Smiling’, and ‘Chubby’ & ‘Double Chin’, in our experiments. For the attribute ‘Wearing Lipstick’, the classification accuracy of the pre-trained network was about for the test set. Next, we specified that the network should focus only on the mouth; we fine-tuned the pre-trained AlexNet accordingly, accompanied by IoU loss. Accuracy improved to . For the sunglasses set, the classification accuracy improved from to . We also used two straightforward methods to modify the dataset in our comparison experiment. The network was trained using edited images of the mouth region only; the accuracy was about . Then we mixed these images with the original training set to create a network that weighted mouth regions more highly; this trained network () reported an accuracy of .

The Grad-CAM results for 10 selected images of the test sets for the four networks are shown in Figure 6. It is clear that the pre-trained network classified ‘Wearing Lipstick’ based not only on the mouth region but also by reference to the eyebrows and eyes. Using our method, fine-tuning significantly reduced dependence on eyebrows and eyes, emphasizing the mouth. The focused on the mouth only, but the accuracy of the test set was low. The was very accurate, but was affected by co-occurrence bias.

Figure 6: A comparison of results afforded by the four networks.

We compared the accuracies of the pre-trained and fine-tuned networks using the image sets and shown in Table 1. The Grad-CAM results for five images from each set are shown in Figure 7. Images featuring the attributes of both ‘Wearing Lipstick’ and ‘Heavy Makeup’ (from ) clearly differ in terms of DNN measured attention before and after fine-tuning. For ‘Wearing Lipstick’ but without ‘Heavy Makeup’ images (from ), the fine-tuned (but not the pre-trained) network detected ‘lipstick’. Thus, the former network exhibited better transferability. The experimental results for the attributes ‘High Cheekbones’ & ‘Smiling’ and ‘Double Chin’ & ‘Chubby’ are shown in Figure 7.

‘Wearing Lipstick’ & ‘Heavy Makeup’
Network test set
Pre-trained 92.90% 98.06% 82.17%
Center loss 93.26% 98.29% 83.23%
IoU loss 93.25% 98.31% 83.31%
‘High Cheekbones’ & ‘Smiling’
Network test set
Pre-trained 63.53% 70.22% 47.41%
IoU loss 65.56% 71.42% 61.33%
‘Double Chin’ & ‘Chubby’
Network test set
Pre-trained 84.18% 90.92% 81.77%
IoU loss 84.79% 94.32% 86.45%
Table 1: The accuracies afforded by each network (three examples).
Figure 7: Comparison of the Grad-CAM results derived using the two image sets and . The attributes of ‘Wearing Lipstick’, ‘High Cheekbones’, and ‘Double Chin’ are shown in the first to third blocks, respectively. In each block, the results afforded by the pre-trained network are shown in the first row, and those from the fine-tuned network are given in the second row.

3.1 Training and test details

We performed all experiments using PyTorch running on a PC featuring a GPU GTX 1080 processor. We trained AlexNet via early stopping of the training and validation sets. Optimization was achieved using the stochastic gradient descent (SGD) (with momentum) method. The learning rate and the momentum were 0.01 and 0.9, respectively, during both training and fine-tuning. The batch size was 256. All fine-tuning results were derived by performing single-epoch AlexNet runs. The training and fine-tuning epoch times were about 6 and 10 min, respectively. The mouth region weight was triple that of other facial parts. We set

to , , and in the ‘Wearing Lipstick’, ‘High Cheekbones’, and ‘Double Chin’ experiments, respectively.

4 Conclusion

We developed a method whereby a user can manually instruct a pre-trained DNN network to focus on only relevant features; this eliminates co-occurrence bias that may be present in a dataset. We used the CelebA dataset to solve the lipstick problem; our method reduced the effects of co-occurring features in the pre-trained network; classification accuracy improved greatly. However, face landmark recognition accuracy affected our results. In the future, we will eschew landmarks. Additionally, we will test our method employing other DNN models (VGG and ResNet).

5 Acknowledgement

This work was supported by JST CREST Grant Number JPMJCR17A1, Japan.


  • [1] H. Fukui, T. Hirakawa, T. Yamashita, and H. Fujiyoshi. Attention branch network: Learning of attention mechanism for visual explanation. arXiv, 2018.
  • [2] K. Li, Z. Wu, K. Peng, J. Ernst, and Y. Fu. Tell me where to look: Guided attention inference network. In ICCV, 2018.
  • [3] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015.
  • [4] R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
  • [5] A. Torralba and A. Efros. Unbiased look at dataset bias. In CVPR, 2011.
  • [6] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang. Unitbox: An advanced object detection network. In ACMMM, 2016.
  • [7] Q. Zhang, W. Wang, and S. Zhu. Examining cnn representations with respect to dataset bias. CoRR, 2018.