Attention Branch Network: Learning of Attention Mechanism for Visual Explanation

12/25/2018 ∙ by Hiroshi Fukui, et al. ∙ 24

Visual explanation enables human to understand the decision making of Deep Convolutional Neural Network (CNN), but it is insufficient to contribute the performance improvement. In this paper, we focus on the attention map for visual explanation, which represents high response value as the important region in image recognition. This region significantly improves the performance of CNN by introducing an attention mechanism that focuses on a specific region in an image. In this work, we propose Attention Branch Network (ABN), which extends the top-down visual explanation model by introducing a branch structure with an attention mechanism. ABN can be applicable to several image recognition tasks by introducing a branch for attention mechanism and is trainable for the visual explanation and image recognition in end-to-end manner. We evaluate ABN on several image recognition tasks such as image classification, fine-grained recognition, and multiple facial attributes recognition. Experimental results show that ABN can outperform the accuracy of baseline models on these image recognition tasks while generating an attention map for visual explanation. Our code is available



page 1

page 2

page 3

page 4

page 5

page 6

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Convolutional Neural Network (CNN) [1, 17]

approaches have outperformed various image recognition tasks on computer vision 

[25, 9, 7, 34, 8, 12, 18]. However, inspite of these CNN approaches achieve impressive performance on such tasks, it is difficult to interpret the CNN models. To understand the decision making of CNN, methods of interpreting CNN have been proposed [39, 41, 26, 4, 24, 3, 22].

Figure 1: Network structures of Class Activation Mapping and our Attention Branch Network.

“Visual explanation” has been used to interpret CNN by highlighting attention region during the inference process. Visual explanation can be categorized into bottom-up or top-down methods. Bottom-up methods typically use gradients with auxiliary data, such as noise [4] and class index [24, 3]

. These methods interpret a CNN without re-training and modifying the architecture, however, they require the backpropagation process to obtain gradients. In contrast, top-down methods can interpret a CNN during the inference process. Class Activation Mapping (CAM) 

[41], which is a representative top-down method, can visualize the attention map in each category using the response of the convolution layer. Instead of a fully connected layer, it replaces the convolution and global average pooling (GAP) [20] and obtains class specific feature maps that include high response value positions representing the class, as shown in Fig. 1(a). However, CAM requires replacing the fully-connected layer with a convolution layer and enable to passing through the GAP, thus, decreasing the performance of CNN.

To avoid this problem, bottom-up methods are often used for interpreting the CNN. The highlight location in visual explanation is considered an important location in image recognition. To use top-down methods that can visualize an attention map during a forward pass, we extended a top-down visual explanation model to an attention mechanism. By employing the attention map for visual explanation as an attention mechanism, our network is trained while paying attention to the important location in image recognition. The attention mechanism with a top-down visual explanation model can simultaneously interpret CNN and improve their performance.

Inspired by top-down visual explanation methods and attention mechanisms, we propose  (ABN), which extends a top-down visual explanation model by introducing a branch structure with an attention mechanism, as shown in Fig 1

(b). ABN consists of three components: feature extractor, attention branch, and perception branch. The feature extractor contains multiple convolution layers for extracting feature maps. The attention branch is designed to apply an attention mechanism by introducing a top-down visual explanation model. This component is important in ABN because it generates an attention map for attention mechanism and visual explanation. The perception branch outputs the probabilities of class by feeding both feature maps and attention map to convolution layers. ABN has a simple structure and is trainable in an end-to-end manner using training loss at both branch. Moreover, by introducing the attention branch to various baseline model such as ResNet 

[9], ResNeXt [34], and multi-task network [27], ABN can be applied to several networks and image recognition tasks. Our contributions are as follows:

  • ABN is designed to extend a top-down visual explanation model by introducing a branch structure with an attention mechanism. ABN is the first attempt to improve the performance of the CNN by including a top-down method.

  • ABN is applicable to various baseline network models such as VGGNet [14], ResNet [9], and multi-task learning [27] by dividing a baseline model and including an attention branch for generalizing an attention map.

  • ABN improves the performance of CNN and visual explanation simultaneously due to the attention map during a forward pass.

2 Related works

2.1 Interpreting CNN

Several visual explanation methods for highlighting the important region on image recognition as an attention map have been proposed [30, 39, 41, 26, 13, 4, 24, 3, 22]. Visual explanation is two types of such methods: bottom-up, which are gradient-based methods, and top-down, which use the response of a forward pass. For example, SmoothGrad [24] obtains sensitivity maps by adding noise to the input image iteratively and takes the average of these sensitivity maps. Guided backpropagation [13] and Gradient-weighted Class Activation Mapping (Grad-CAM) [4, 3], which are bottom-up methods, have been proposed. Guided backpropagation and Grad-CAM visualize the attention map by the backward pass to only positive gradients at a specific class. Grad-CAM and guided backpropagation have been widely used because they can interpret various pre-trained models using the attention map of a specific class.

Top-down methods visualize an attention map by using the response value of a forward pass using a convolution layer or deconvolution layer. While top-down methods need to re-train and modify network model, they can directly visualize an attention map during forward pass. CAM [41] can visualize an attention maps for each class using the response of a convolution layer and the weight at last fully-connected layer. CAM performs well on weakly supervised object localization but decreases perform well in image classification due to replacing fully-connected layers with a convolution layers and passing through the GAP.

We build ABN based on CAM, which can visualize a CNN using the attention map during a forward pass. CAM easily compatibles with attention mechanism that directly weights the feature map. In contrast, bottom-up visual explanation methods difficulty are not compatibles with an attention mechanism due to these methods requiring the back propagation process because of calculating gradients. Therefore, we used CAM for attention mechanism of proposed method.

Figure 2: Detailed structure of the Attention Branch Network.

2.2 Attention mechanism

Attention mechanisms have been used in various fields such as computer vision and natural language processing 

[19, 15, 32, 12]. They have been widely used in sequential models [15, 36, 37, 2, 31]

by using recurrent neural network and Long Short Term Memory (LSTM) 


. A typical attention model on sequential data was proposed by Xu

 [15]. Attention mechanism of Xu is based on two type attention mechanisms that are soft attention and hard attention. The soft attention mechanism of Xu model is used as the gate of LSTM, and image captioning and visual question answering have been used [36, 37]. Additionally, the Non-local Neural Network [33], which uses the self-attention approach, and the Recurrent Attention Model [21]

, which controls the attention location by reinforcement learning, have been proposed.

Recent attention mechanisms work advance to a single image recognition task [32, 12, 6]. Typical attention models on a single image are Residual Attention Network [32] and Squeeze-and-Excitation Network (SENet) [12]. Residual Attention Network includeds two attention components: stacked network structure that consists of multiple attention components, and attention residual learning, which applies residual learning [9] to an attention mechanism. SENet includes a squeeze-and-excitation block that contains a channel-wise attention mechanism is introduced for each residual block.

ABN is designed to focus on the attention map for visual explanation that represents the important region in image recognition. Previous attention models extracts an attention map for attention mechanism using only response value of convolution layers during foward pass. However, ABN easily extracts the effective attention map in image recognition by generating the attention map for visual explanation on the basis of top-down method.

3 Attention Branch Network

ABN consists of three modules as follows: feature extractor, attention branch, and perception branch, as shown in Fig. 1. The feature extractor contains multiple convolution layers and extracts feature maps from an input image. The attention branch converts the attention location based on CAM to an attention map by using an attention mechanism. The perception branch outputs the probability of each class by receiving the feature map from the feature extractor and attention map.

ABN is based on a baseline model such as VGGNet [14] and ResNet [9]. The feature extractor and perception branch are constructed by dividing a baseline model between a specific layer. The attention branch is constructed after feature extractor on the basis of the CAM. ABN can be applied to several image recognition tasks by introducing the attention branch. We provide ABN for the image recognition tasks of image classification, fine-grained recognition, and multi-task learning.

3.1 Attention branch

CAM has a convolution layer, a GAP and fully-connected layer as shown in Fig. 1(a). Here, is the number of categories, and “ convolution layer” means a kernel with channels at the convolution layer. convolution layer outputs a feature map, which represents the attention location for each class. The feature map is down-sampled to a feature map by the GAP and outputs the probability of each class by passing through the fully-connected layer with the softmax function. When CAM visualizes the attention map of each class, attention map is generated by multiplying and weighted sum of feature map and weight at the last fully-connected layer.

Instead of a fully-connected layer, CAM stacks the convolution layers. This restriction is also introduced into the attention branch. The fully-connected layer that connects a unit with all units at the next layer negates the ability to localize the attention area in the convolution layer. Therefore, if a the baseline model contains a fully-connected layer, such as VGGNet, the attention branch replaces fully-connected layer with convolution layer, similar with CAM, as shown in the top of Fig. 2(b) . The ResNet model with ABN is constructed from the residual block at the attention branch, as shown in the bottom of Fig. 2

(b). Here, we set the stride of first convolution layer at the residual block as 1 to maintain the resolution of the feature map.

To generate an attention map, we build a top-layer at the attention branch, the attention branch outputs the probability of each class and generates an attention map of an attention mechanism. However, CAM cannot generate an attention map in the training process because the attention map is generated using the feature map and weight at a fully-connected layer after training. To address this issue, we replace the fully-connected layer to a convolution layer, as with CAM. This convolution layer is imitated at the last fully-connected layer of CAM in a forward pass. After the convolution layer, the attention branch outputs the probability by using the response of the GAP and softmax function. Finally, the attention branch generates an attention map from the feature map. Then, to aggregate the feature maps, these feature maps are convoluted by a convolution layer. By convoluting with the kernel, feature map is generated. We employ the

feature map that is normalized by sigmoid function as attention map for attention mechanism.

3.2 Perception branch

The perception branch outputs the final probability of each class by receiving the feature map from the feature extractor and attention map. The structure of the perception branch is the same for conventional top layers from image classification models such as VGGNet or ResNet, as shown in Fig. 2(c). First, the attention map is applied to the feature map by the attention mechanism. We use the two types of attention mechanisms, as in Eq. 1 and Eq. 2. Here, is the feature map at the feature extractor, is an attention map, and is the output of the attention mechanism, as shown in Fig. 2(a). Note that is the index of the channel.


Equation 1 is simply a dot-product between the attention map and the feature map at a specific channel . In contrast, Eq. 2 can highlight the feature map at the peak of the attention map while preventing the lower value region of the attention map from degrading to zero.

3.3 Training

ABN can be trainable in an end-to-end manner using losses at both branches. Our training loss function 

is a simple sum of losses at both branches as Eq. 3.


Here, denotes training loss at the attention branch with a input sample , and denotes training loss at the perception branch. Training loss for each branch is calculated by the combination of the softmax function and cross-entropy in image classification tasks. The feature extractor is optimized by passing through the gradients of the attention and perception branches during back propagation. If ABN is applied to other image recognition tasks, our training loss can adaptively change depending on the baseline model.

3.4 ABN for multi-task learning

ABN using a classification model was designed to divide the branch that is generated the attention map and output the probability of each class. This network design can be applicable to other image recognition tasks, such as multi-task learning. In this section, we explain ABN for multi-task learning.

Figure 3: Attention Branch Network for multi-task learning.

Conventional multi-task learning has units outputting the recognition scores corresponding to each task [27]. In training, the loss function defines multiple tasks using a single network. However, there is a problem with ABN for multi-task learning. In image classification, the relation between the numbers of inputs and recognition tasks is one-to-one. In contrast, the relation between the numbers of inputs and recognition tasks of multi-task learning is one-to-many. The one-to-one relation can be focused on the specific target location using a single attention map, but one-to-many relation can not be focused on multiple target locations using a single attention map. To address this issue, we generate multiple attention maps for each task by introducing the multi-task learning to the attention and perception branches. Note that we use ResNet with multi-task learning as the baseline model.

To output the multiple attention maps corresponding to specific tasks, we design the attention branch with multi-task learning, as shown in Fig. 3. First, a feature map at residual block 4 is convoluded by the 11 convolution layer, and the 1414 feature map is output. The probability score during a specific task is output by applying the 1414 feature map at specific task to GAP and sigmoid function. In training, we optimize by combining the sigmoid function and binary cross-entropy loss function. We apply the 1414 feature maps to attention maps.

We introduce the perception branch to multi-task learning. Converting feature map  is first generated using attention map  at specific task and feature map  at the feature extractor, as shown in Eq. 4, as discussed sec. 3.2. After generating feature map , the probability score at specific task is calculated on perception branch , which output the probability for each task by inputting feature map .


This probability matrix of each task  on the perception branch consists of components, which is defined two categories classification for each task. The probability  at specific task is used when the perception branch receives the feature map  that applies the attention map at specific task , as shown in Fig. 3. These processes are repeated for all each task.

4 Experiments

4.1 Experiments detail on image classification

First, we evaluate ABN for image classification task using the CIFAR10, CIFAR100, Street View Home Number (SVHN) [23]

, ImageNet 

[5] datasets. The input image size of the CIFAR10, CIFAR100, SVHN datasets are 3232 pixels, and ImageNet is 224

224 pixels. The number of categories for each dataset is as follows: CIFAR10 and SVHN consist of 10 class, CIFAR100 consists of 100 class, and ImageNet consists of 1,000 class. During training, we applied the standard data augmentation. For CIFAR10, CIFAR100, and SVHN, the images are first zero-padded with 4 pixels for each side, then randomly cropped to again produce 32

32 pixels images, and the images are then horizontally mirrored at random. For ImageNet, the images are resized 256256 pixels, then randomly cropped to again produce 224224 pixels images, and the images are then horizontally mirrored at random. The number of training, validation, and testing images of each dataset are as follows: CIFAR10 and CIFAR100 consist of 60,000 training images and 10,000 testing images, SVHN consists of 604,388 training images (train:73,257, extra:531,131) and 26,032 testing images, and ImageNet consists of 1,281,167 training images and 50,000 validation images.

We optimize the networks by Stochastic Gradient Descent (SGD) with momentum. On CIFAR10 and CIFAR100, the total number of iterations to update the parameters is 300 epochs, and the batch size is 256. The total number of iterations to update the networks is as follows: CIFAR10 and CIFAR100 are 300 epochs, SVHN is 40 epoch, and ImageNet is 90 epoch. The initial learning rate is set to 0.1, and is divided by 10 at 50 

and 75  of the total number of training epochs.

4.2 Image classification

ResNet20 31.47 30.61 30.46
ResNet32 30.13 28.34 27.91
ResNet44 25.90 24.83 25.59
ResNet56 25.61 24.22 24.07
ResNet110 24.14 23.28 22.82
Table 1: Comparison with the top-1 error on CIFAR100 with attention mechanism manner.
Dataset CIFAR10 CIFAR100 SVHN [23] ImageNet [5]
VGGNet [14] 31.2
ResNet [9] 6.43
VGGNet+CAM [41] 33.4
WideResNet [38] 4.00 19.25 21.9
DenseNet [11] 4.51 22.27 22.2
ResNeXt [34] 22.4
Attention [32] 3.90 20.45 21.76
AttentionNeXt [32] 21.20
SENet [12] 21.57
indicates results of re-implementation accuracy
Table 2: Comparison of top-1 error on CIFAR10, CIFAR100, SVHN, and ImageNet dataset.

Analysis on attention mechanism manner   We compare the accuracies of attention mechanisms of Eq. 1 and Eq. 2. We use the ResNet {20, 33, 44, 56, 110} models on CIFAR100.

Table 1 shows the top-1 errors of attention mechanisms Eq. 1 and Eq. 2. The is the conventional ResNet. First, we compare ABN with attention mechanism  at Eq. 1 and conventional ResNet . Attention mechanism  is lower the top-1 error than conventional ResNet. We also compared the accuracy of both attention mechanisms of and . Attention mechanism  is slightly more accurate than attention mechanism . In Residual Attention Network, which includes the same attention mechanisms, accuracy decreased with attention mechanism  [32]. From this result, our attention map responds to the effective region in image classification. We employ the the attention mechanism  at Eq. 2 version by as default manner.

Accuracy on CIFAR and SVHN   Table 2 shows the top-1 errors on CIFAR10/100, SVHN, and ImageNet. We evaluate the top-1 error using various baseline models, CAM, and ABN regarding image classification. The accuracy are an original top-1 error at referring paper or top-1 error of our model, and the ’’ indicates results of re-implementation accuracy. The numbers in brackets denote the difference in the top-1 error from the baseline model at re-implementation. On CIFAR and SVHN, we evaluate the top-1 error by using several ResNet models as follows: ResNet (depth=110), DenseNet (depth=100, growth rate=12), Wide ResNet (depth=28, widen factor=4, drop ratio=0.3), ResNeXt (depth=28, cardinality=8, widen factor=4). Note that ABN is constructed by dividing a ResNet model at residual block 3.

ResNet, Wide ResNet, DenseNet and ResNeXt improve the accuracy by introducing ABN. In CIFAR10, ResNet and DenseNet decrease the top-1 error from 6.43  to 4.91  and 4.51  to 4.17 , respectively. Additionally, all ResNet models are decrease the top-1 error more 0.6  in CIFAR100.

Accuracy on ImageNet   We evaluate the image classification accuracy on ImageNet using Table 2 in the same manner as in CIFAR10/100 and SVHN. In ImageNet, we evaluated the top-1 error by using the VGGNet (depth=16), ResNet (depth=152), and SENet (ResNet152 model). First, we compare the top-1 errors of CAM. The performance of CAM slightly decreased with specific baseline model because of the removal of the fully-connected layers and adding a GAP [41]. Similarly, the performance on VGGNet+BatchNormalization (BN) [29] with CAM decrease even in re-implementation. In contrast, the performance of ResNet with CAM is almost the same as that of baseline ResNet. The structure of ResNet that contains a GAP and fully-connected layer as last layer resembles that in CAM. ResNet with CAM can be easily constructed by stacking on the convolution layer at the last residual block, which sets the stride to 1 at first convolution layer. Therefore, ResNet with CAM is difficult to decrease in performance due to removal of the fully-connected layer and adding a GAP. On the other hand, ABN outperformed conventional VGGNet and CAM models. Similarly, ABN performed better than the conventional ResNet and CAM model.

We compare the accuracy of a conventional attention model. SENet reduce the top-1 error from 22.19 to 21.90 at ResNet152. However, ABN reduce the top-1 error from 22.19  to 21.37, which indicating that ABN is more accurate than SENet. Moreover, ABN can introduce the SENet in parallel. SENet with ABN reduces the top-1 error by 22.19  to 20.77  than the conventional ResNet152. In Residual Attention Network, it achieved the top-1 errors on the size of the input image that is as follows: ResNet model is 21.76, and ResNeXt model is 21.20, indicating that ResNet152+SENet with ABN performs better than these Residual Attention Networks.

Figure 4: Visualizing the high attention area with CAM, Grad-CAM, and our ABN. CAM and Grad-CAM are visualized the attention maps at top-1.

Visualizing attention map   We compare the attention maps visualizes using Grad-CAM, CAM, and ABN. Grad-CAM extracts an attention map by using the baseline model to ResNet152. CAM and ABN are constructed using a baseline model to ResNet152. Figure. 4 shows the attention maps for each model on ImageNet dataset.

As shown in Fig. 4, Grad-CAM, CAM and ABN highlighted a similar region. For example in the first column in Fig. 4

, these models classify the “Violin”, and highlight the “Violin” localization on the original image. Similarly, “Cliff” in second column is highlights the “Cliff” region. For the third column, this original image is a typical example because multiple objects such as “Seat belt” and “Australian terrier” are included. In this case, Grad-CAM (conventional ResNet152) and CAM are failed, but ABN performed well. When visualizing the attention maps in the third column, the attention map of ABN highlights each object. Therefore, this attention map can focus on a specific region when multiple objects are in an image.

task  model []  maker []
VGG16 85.9 90.4
ResNet101 90.2 90.1
VGG16+ABN 90.7 92.9
ResNet101+ABN 97.1 98.1
Table 3: Comparison of accuracy in CompCars dataset
Figure 5: Visualizing attention map on fine-grained recognition.
Figure 6: Comparison of the distribution maps at residual block 4 by t-SNE. Left : Distribution of conventional ResNet101. Center and Right : Distribution of the attention branch network. Center has not applied the attention map.
Figure 7: Visualizing attention map on multiple facial attributes recognition.

4.3 Fine-grained recognition

We evaluate the ABN for the fine-grained recognition on Comprehensive Cars (CompCars) dataset [35], which has 36,451 training images and 15,626 testing images with 432 car models and 75 makers. We use VGG16 and ResNet101 as baseline model and optimized models by SGD with momentum. Total number of update iterations is 50 epochs, and mini-batch size is 32. The learning rate starts from 0.01 and is multiplied by 0.1 at 25 and 35 epochs. The input image is resized to 323224 pixels. The image size is calculated by taking the average of bounding box aspect ration from training data. This resizing process is suppressed the collapse of the car shape.

Table 3 shows the car model and maker recognition accuracy on the CompCars dataset. The car model recognition accuracy of ABN improves by 4.9  and 6.2  with VGG16 and ResNet101, respectively. Moreover, maker recognition improves by 2.0  and 7.5 , respectively. These results indicate that ABN is also effective for fine-grained recognition. We visualize the attention map for car model or maker recognition, as shown in Fig. 5. In these visualizing results, training and testing images are the same for car model and maker recognition, however, our attention maps differ depending on the recognition task.

Method Average of accuracy [%] Odds
FaceTracer [16] 81.13 40/40
PANDA-l [40] 85.43 39/40
LNet+ANet [42] 87.30 37/40
MOON [28] 90.93 29/40
ResNet101 90.69 27/40
ABN 91.07
Table 4: Comparison with accuracy on CelebA dataset

We compare the feature representations of the conventional ResNet101 and ABN with ResNet101. In this experiments, we visualize distributions by t-distributed Stochastic Neighbor Embedding (t-SNE) [30] and analyze the distributions. We use the comparison feature maps at the final layer on residual block 4. Figure 6 shows distribution maps of t-SNE. We use 5,000 testing images on CompCars dataset. Feature maps of conventional ResNet101 and the feature extractor in the attention branch network are clustered by car pose. However, feature map applying attention map is split distribution by car pose and detail car form.

4.4 Multi-task Learning

In multi-task learning, we evaluate for multiple facial attributes recognition using the CelebA dataset [42], which consists of 40 facial attribute labels and 202,599 images (182,637 training images and 19,962 testing images). The total number of iterations to update the parameters is 10 epochs, and the learning rate is set to 0.01.

Table 4 shows the average recognition rate and the number of facial attribute tasks which ABN outperform each previous methods. Note that the number of the third column at Tab. 4 is the number of winning tasks when we compared conventional models with ABN for each facial attribute. The accuracy of a specific facial attribute task is described in the appendix. When we compare ResNet101 and ABN, ABN is 0.38 more accurate. Moreover, the accuracy of 27 facial tasks is improved. ABN also performes better than conventional facial attribute recognition methods, such as FaceTracer [16], PANDA-l [40], LNet+ANet [42], Mixed Objective Optimization Network (MOON) [28]. ABN outperform conventional facial attributes recognition methods for difficult tasks such as “arched eyebrows”, “pointy nose”, “wearing earring”, and “wearing necklace”. Figure 7 shows the attention map of ABN on CelebA dataset. This attention map highlights the specific locations such as mouth, eye, beard, and hair. These highlight locations correspond to the specific facial task, as shown in Fig. 7. It is conceivable that these highlight locations are contributed to performance improvement of ABN.

5 Conclusion

We proposed an Attention Branch Network, which extends the top-down visual explanation model by introducing a branch structure with an attention mechanism. ABN can be simultaneously trainable for visual explanation and image recognition with an attention mechanism in an end-to-end manner. It is also applicable to several CNN models and image recognition tasks. In our experiments, we evaluated the accuracy of ABN for image classification, fine-grained recognition, and multi-task learning, and ABN performed improvement performance for these tasks. We plan to apply ABN to reinforcement learning that does not include label in training process.


  • [1] K. Alex, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems, pages 1097–1105. 2012.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2016.
  • [3] A. Chattopadhyay, A. Sarkar, P. Howlader, and V. N. Balasubramanian. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. arXiv preprint arXiv:1710.11063, 2017.
  • [4] S. Daniel, T. Nikhil, K. Been, B. V. Fernanda, and W. Martin. Smoothgrad: removing noise by adding noise, 2017.
  • [5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In

    Computer Vision and Pattern Recognition

    , 2009.
  • [6] L. Drew, S. Dan, E. Sven, and S. Thomas. Global-and-local attention networks for visual recognition. arXiv, abs/1805.08819, 2018.
  • [7] H. Emily M. and C. Rama. Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classification. In

    Association for the Advancement of Artificial Intelligence

    , 2017.
  • [8] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In International Conference on Computer Vision, 2017.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [10] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput, 9(8):1735–1780, 1997.
  • [11] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [12] H. Jie, S. Li, and S. Gang. Squeeze-and-excitation networks. Computer Vision and Pattern Recognition, 2017.
  • [13] S. Jost, Tobias, D. Alexey, B. Thomas, and R. Martin. Striving for simplicity: The all convolutional net. In International Conference on Learning Representations. 2015.
  • [14] S. Karen and Z. Andrew. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations, 2015.
  • [15] X. Kelvin, B. Jimmy, K. Ryan, C. Kyunghyun, C. Aaron, S. Ruslan, Z. Rich, and B. Yoshua. Show, attend and tell: Neural image caption generation with visual attention. In

    International Conference on Machine Learning

    , pages 2048–2057, 2015.
  • [16] N. Kumar, P. N. Belhumeur, and S. K. Nayar. Facetracer: A search engine for large collections of images with faces, October 2008.
  • [17] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
  • [18] C. Liang-Chieh, Z. Yukun, P. George, S. Florian, and A. Hartwig. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision, 2018.
  • [19] T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural machine translation. In Empirical Methods in Natural Language Processing, pages 1412–1421, 2015.
  • [20] L. Min, C. Qiang, and Y. Shuicheng. Network in network. International Conference on Learning Representations, 2014.
  • [21] V. Mnih, N. Heess, A. Graves, and k. kavukcuoglu. Recurrent models of visual attention. In Neural Information Processing Systems, pages 2204–2212. 2014.
  • [22] G. Montavon, W. Samek, and K.-R. Müller. Methods for interpreting and understanding deep neural networks. Digital Signal Processing, 73:1–15, 2018.
  • [23] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In Neural Information Processing Systems, 2011.
  • [24] S. Ramprasaath, R., C. Michael, D. Abhishek, V. Ramakrishna, P. Devi, and B. Dhruv. Grad-cam: Visual explanations from deep networks via gradient-based localization. In International Conference on Computer Vision, pages 618–626, 2017.
  • [25] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Neural Information Processing Systems, pages 91–99. 2015.
  • [26] M. T. Ribeiro, S. Singh, and C. Guestrin. ”why should i trust you?”: Explaining the predictions of any classifier. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016.
  • [27] C. Richard. Multitask learning: A knowledge-based source of inductive bias. In International Conference on Machine Learning, pages 41–48, 1993.
  • [28] E. Rudd, M. Gunther, and T. Boult. Moon: A mixed objective optimization network for the recognition of facial attributes. In European Conference on Computer Vision. 2016.
  • [29] I. Sergey and S. Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
  • [30] M. L. Van, Der and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, ¥. u. Kaiser, and I. Polosukhin. Attention is all you need. In Neural Information Processing Systems, pages 5998–6008. 2017.
  • [32] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. In Computer Vision and Pattern Recognition, 2017.
  • [33] W. Xiaolong, G. Ross, G. Abhinav, and H. Kaiming. Non-local neural networks. Computer Vision and Pattern Recognition, 2018.
  • [34] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 5987–5995, 2017.
  • [35] L. Yang, P. Luo, C. C. Loy, and X. Tang. A large-scale car dataset for fine-grained categorization and verification. In Computer Vision and Pattern Recognition, pages 3973–3981, 2015.
  • [36] Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola. Stacked attention networks for image question answering. In Computer Vision and Pattern Recognition, pages 21–29, 2016.
  • [37] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. Computer Vision and Pattern Recognition, pages 4651–4659, 2016.
  • [38] S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference, 2016.
  • [39] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818–833, 2014.
  • [40] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev. Panda: Pose aligned networks for deep attribute modeling. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1637–1644, 2014.
  • [41] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.

    Learning deep features for discriminative localization.

    Computer Vision and Pattern Recognition, 2016.
  • [42] L. Ziwei, L. Ping, W. Xiaogang, and T. Xiaoou. Deep learning face attributes in the wild. In International Conference on Computer Vision, 2015.