Despite the success of convolutional neural networks (CNNs) across many computer vision tasks, there is resistance to deploying them in critical applications such as healthcare and criminal justice because their predictions are difficult to interpret(Rudin, 2019). CNNs compute complex nonlinear functions of their inputs, which make it unclear what aspects of the input contributed to the prediction. Although many researchers have attempted to design methods to interpret predictions of off-the-shelf CNNs (Zhou et al., 2016; Baehrens et al., 2010; Simonyan and Zisserman, 2014; Zeiler and Fergus, 2014; Springenberg et al., 2014; Bach et al., 2015; Yosinski et al., 2015; Nguyen et al., 2016; Montavon et al., 2017; Zintgraf et al., 2017), it is unclear whether these explanations faithfully describe the underlying model (Kindermans et al., 2017; Adebayo et al., 2018; Hooker et al., 2018; Rudin, 2019)
. Additionally, adversarial machine learning research(Szegedy et al., 2013; Goodfellow et al., 2017, 2014) has demonstrated that imperceptible modifications to inputs can change classifier predictions, underscoring the unintuitive nature of CNN-based image classifiers.
One interesting class of models that offers more interpretable decisions are “hard” visual attention models. These models rely on a controller that selects relevant parts of the input to contribute to the decision, which provides interpretability by design. These models are inspired by human vision, where the fovea and visual system process only a limited portion of the visual scene at high resolution (Wandell, 1995), and top-down pathways control eye movements to sequentially sample salient parts of visual scenes (Schütz et al., 2011). Although models with hard attention perform well on simple datasets (Larochelle and Hinton, 2010; Mnih et al., 2014; Ba et al., 2014; Gregor et al., 2015), it has been challenging to scale these models from small tasks to real world images (Sermanet et al., 2015). Here, we propose a novel hard visual attention model that we term Saccader, as well as an effective procedure to train this model. Our model and training procedure overcome the problems of high dimensionality and sparse rewards that make hard attention models difficult to optimize. We introduce a self-supervised pretraining procedure to initialize the model such that reward is not sparse for our policy gradient optimization. The Saccader model learns features for different patches in the image that reflect the degree of relevance to the downstream task, then sequentially selects patches for classification using a novel cell. Our results show that the Saccader model is highly accurate while remaining interpretable compared to other visual attention models (Figure 1). Saccader is able to achieve top-1 and top-5 accuracy on ImageNet above and , respectively, while only attending to less than one-third of the image area on average. We further confirm that the glimpse locations proposed by the Saccader model are highly relevant to the classification task.
2 Related Work
Models that employ hard attention make decisions based on only a subset of pixels in the input image, typically in the form of a series of square glimpses. Butko and Movellan (2008)
formulated the problem of selecting glimpses as a partially observable Markov decision process, and proposed to use policy gradient to learn a convolutional logistic policy to optimize long-term information gain. Later work extended the policy gradient framework to policies parameterized by neural networks to perform image classification(Mnih et al., 2014; Ba et al., 2014) and image captioning (Xu et al., 2015)
. Other non-policy-gradient-based approaches include direct estimation of the probability of correct classification for each glimpse location(Larochelle and Hinton, 2010; Zheng et al., 2015) or differentiable attention based on adaptive downsampling (Gregor et al., 2015; Jaderberg et al., 2015; Eslami et al., 2016).
Work employing hard attention for vision tasks has generally examined performance only on relatively simple datasets such as MNIST and SVHN (Mnih et al., 2014; Ba et al., 2014). Sermanet et al. (2015) adapted the model proposed by Ba et al. (2014) to classify the more challenging Stanford Dogs dataset, but the model achieved only a modest accuracy improvement over a non-attentive baseline, and did not appear to derive significant benefit from glimpses beyond the first.
Models with hard attention are difficult to train with gradient-based optimization. To make training more tractable, other models have resorted to soft attention (Bahdanau et al., 2014; Xu et al., 2015). Typical soft attention mechanisms rescale features at one or more stages of the network. The soft masks used for rescaling often appear to provide some insight into the model’s decision-making process (Xu et al., 2015; Das et al., 2017), but the model’s final decision may nonetheless rely on information provided by features with small weights (Jain and Wallace, 2019).
Soft attention is popular in models used for natural language tasks (Bahdanau et al., 2014; Luong et al., 2015; Vaswani et al., 2017), image captioning (Xu et al., 2015; You et al., 2016; Rennie et al., 2017; Lu et al., 2017; Chen et al., 2017), and visual question answering (Andreas et al., 2016), but less common in image classification. Although several spatial soft attention mechanisms for image classification have been proposed (Wang et al., 2017; Jetley et al., 2018; Woo et al., 2018; Linsley et al., 2019; Fukui et al., 2019), current state-of-the-art models do not use these mechanisms (Zoph et al., 2018; Liu et al., 2018a; Real et al., 2018; Huang et al., 2018). The squeeze-and-excitation block, which was a critical component of the model that won the 2017 ImageNet Challenge (Hu et al., 2018), can be viewed as a form of soft attention, but operates over feature channels rather than spatial dimensions.
Although our aim in this work is to perform classification with only image-level class labels, our approach bears some resemblance to two-stage object detection models. These models operate by generating many region proposals and then applying a classification model to each proposal (Uijlings et al., 2013; Girshick et al., 2014; Girshick, 2015; Ren et al., 2015). Unlike our work, these approaches use ground-truth bounding boxes to train the classification model, and modern architectures also use bounding boxes to supervise the proposal generator (Ren et al., 2015).
To understand the intuition behind the Saccader model architecture (Figure 2
), imagine one uses a trained ImageNet model and applies it at different locations of an image to obtain logits vectors at these locations. To find the correct label, one could compute a global average of these vectors, and to find a salient location on the image, one reasonable choice is the location of the patch the elicits the largest response. With that intuition in mind, we design the Saccader model to compute sets of 2D features for different patches in the image, and select one of these sets of features at each time (FigureSupp.6). The attention mechanism learns to select the most salient location at the current time. For each predicted location, one may extract the logits vector and perform averaging across different times to perform class prediction.
In particular, our architecture consists of the following three components:
Representation network: This is a CNN that processes glimpses from different locations of an image (Figure 2). To restrict the size of the receptive field (RF), one could divide the image into patches and process each separately with an ordinary CNN, but this is computationally expensive. Here, we used the “BagNet" architecture from Brendel and Bethge (2019), which enables us to compute representations with restricted RFs efficiently in a single pass, without scanning. Similar to Brendel and Bethge (2019), we use a ResNet architecture where most convolutions are replaced with
convolutions to limit the RF of the model and strides are adjusted to obtain a higher resolution output (FigureSupp.4). Our model has a pixel RF and computes -dimensional feature vectors at different locations in the image separated by only 8 pixels. For pixel images, this maps to possible attention locations. Before computing the logits, we apply a
convolution with ReLU activation to encode the representations into a 512-dimensional feature space (“what" features; name motivated by the visual system ventral pathway(Goodale and Milner, 1992)), and then apply another
convolution to produce the 1000-dimensional logits tensor for classification. We find that introducing this bottleneck provides a small performance improvement over the original BagNet model; we term the modified network BagNet-lowD.
Attention network: This is a CNN that operates on the -dimensional feature vectors (Figure 2), and includes 4 convolutional layers alternating between convolution and
dilated convolution with rate 2, each followed by batch normalization and ReLU activation (see FigureSupp.5). The convolution layers reduce the dimensionality from to to location features while the convolutional layers widen the RF (“where" features; name motivated by the visual system dorsal pathway (Goodale and Milner, 1992)). The what and where features are then concatenated and mixed using a linear convolution to produce a compact tensor with 512 features ().
Saccader cell: This cell takes the mixed what and where features and produces a sequence of location predictions. Elements in the sequence correspond to target locations (Figure Supp.6). The cell includes a 2D state () that keeps memory of the visited locations until time by placing 1 in the corresponding location in the cell state. We use this state to prevent the network from returning to previously seen locations. The cell first selects relevant spatial locations from and then selects feature channels based on the relevant locations:
where and are the height and width of the output features from the representation network, is the dimensionality of the mixed features, and is a trainable vector. We use a large negative number multiplied by the state (i.e., ) to mask out previously used locations. Next, the cell computes a weighted sum of the feature channels and performs a spatial softmax to compute the policy:
reflects the model’s policy over glimpse locations. At test time, the model extracts the logits at time from the representation network at location . The final prediction is obtained by averaging the extracted logits across all times.
In terms of complexity, the Saccader model has 35,583,913 parameters, which is fewer than the 45,610,219 parameters in the DRAM model (Table Supp.1).
3.2 Training Procedure
In all our training, we divide the standard ImageNet ILSVRC 2012 training set into training and development subsets. We trained our model on the training subset and chose our hyperparameters based on the development subset. We follow common practice and report results on the separate ILSVRC 2012 validation set, which we do not use for training or hyperparameter selection. The goal of our training procedure is to learn a policy that predicts a sequence of visual attention locations that is useful to the downstream task (here image classification) in absence of location labels.
We performed a three step training procedure using only the training class labels as supervision. First, we pretrained the representation network by optimizing the cross entropy loss computed based on the average logits across all possible locations plus -regularization on the model weights. More formally, we optimize:
where is the image patch at location , is the target class, is the number of classes, are the representation network parameters, and is a hyperparameter.
Second, we use self-supervision to pretrained the location network (i.e., attention network, mixing convolution and Saccader cell) to emit glimpse locations ordered by descending value of the logits. Just as SGD biases neural network training toward solutions that generalize well, the purpose of this pretraining is to alter the training trajectory in a way that produces a better-performing model. Namely, we optimized the following objective:
where is the sorted target location, i.e., is the location with largest maximum logit, and is the location with the smallest maximum logit. is the probability the model gives for attending to location at time given the input image and cell state , i.e. where . The parameters are the weights of the attention network and Saccader cell. For this step, we fixed .
Finally, we trained the whole model to maximize the expected reward, where the reward () represents whether the model final prediction after 6 glimpses () is correct. In particular, we used the REINFORCE loss (Williams, 1992) for discrete policies, cross entropy loss and -regularization. The parameter update is given by the gradient of the objective:
where we sampled trajectories at each time from a categorical distribution with location probabilities given by , is the average accuracy of the model computed on each minibatch, and denotes the image patch sampled at time . The role of adding and the
Monte Carlo samples is to reduce variance in our gradient estimates(Sutton et al., 2000; Mnih et al., 2014).
In this section, we use the Saccader model to classify the ImageNet (ILSVRC 2012) dataset. ImageNet is an extremely large and diverse dataset that contains both coarse- and fine-grained class distinction. To achieve high accuracy, a model must not only distinguish among superclasses, but also e.g. among the 100 fine-grained classes of dogs. We show that the Saccader model learns a policy that yields high accuracy on ImageNet classification task compared to other learned and engineered policies. Moreover, we show that our self-supervised pretraining for the location network helps achieve that high accuracy. Finally, we demonstrate that the attention locations proposed by our model are highly relevant to the classification task.
4.1 Saccader Makes Accurate Predictions on ImageNet
We trained the Saccader model on the ImageNet dataset. Our results show that, with only a few glimpses covering a fraction of the image, our model achieves accuracy close to CNN models that make predictions using the whole image (see Figure 3a,b and 4a).
We compared the policy learned by the Saccader model to alternative policies/models: a random policy, where visual attention locations are picked uniformly from the image; an ordered logits policy that uses the BagNet model to pick the top locations based on the largest class logits; policies based on simple edge detection algorithms (Sobel mean, Sobel variance), which pick the top locations based on strength of edge features computed using the per-patch mean or variance of the Sobel operator (Kanopoulos et al., 1988) applied to the input image; and the deep recurrent attention model (DRAM) from Sermanet et al. (2015); Ba et al. (2014).
With small numbers of glimpses, the random policy achieves relatively poor accuracy on ImageNet (Figure 3a, 4d). With more glimpses, the accuracy slowly increases, as the policy samples more locations and covers larger parts of the image. The random policy is able to collect features from different parts from the image (Figure 4d), but many of these features are not very relevant. Edge detector-based policies (Figure 3a) also perform poorly.
The ordered logits policy starts off with accuracy much higher than a random policy, suggesting that the patches it initially picks are meaningful to classification. However, accuracy is still lower than the learned Saccader model (Figure 3a and 4b), and performance improves only slowly with additional glimpses. The ordered logits policy is able to capture some of the features relevant to classification, but it is a greedy policy that produces glimpses that cluster around a few top features (i.e., with low image coverage; Figure 4c). The learned Saccader policy on the other hand captures more diverse features (Figure 4a), leading to high accuracy with only a few glimpses.
The DRAM model also performs worse than the learned Saccader policy (Figure 3a). One major difference between the Saccader and DRAM policies is that the Saccader policy generalizes to different times, whereas accuracy of the DRAM model does not improve when allowed more glimpses than it was trained on. Sermanet et al. (2015) also reported this issue when using this model to perform fine-grained classification. In fact, increasing the number of glimpses beyond the number used for DRAM policy training leads to a drop in performance (Figure 3a) unlike the Saccader model that generalizes to greater numbers of glimpses.
4.2 Saccader Attends to Locations Relevant to Classification
Glimpse locations identified by the Saccader model contain features that are highly relevant to classification. Figure 5a shows the accuracy of the model as a function of fraction of area covered by the glimpses. For the same image coverage, the Saccader model achieves the highest accuracy compared to other models. This demonstrates that the superiority of the Saccader model is not due to simple phenomena such as having larger image coverage. Our results also suggest that the self-supervised pretraining procedure is necessary to achieve this performance (see Figure Supp.3 for a comparison of Saccader models with and without location pretraining). Furthermore, we show that attention network and Saccader cell are crucial components of our system. Removing the Saccader cell (i.e. using the BagNet-77-lowD ordered logits policy) yields poor results compared to the Saccader model (Figure 5a), and ablating the attention network greatly degrades performance (Figure Supp.3). The wide receptive field (RF) of the attention network allows the Saccader model to better select locations to attend to. Note that this wide RF does not impact the interpretability of the model, since the classification path RF is still limited to .
We further investigated the importance of the attended regions for classification using an analysis similar to that proposed by Zeiler and Fergus (2014). We occluded the patches the model attends to (i.e., set the pixels to 0) and classified the resulting image using a pretrained ResNet-v2-50 model (Figures 5b and Supp.2b). Our results show that occluding patches selected by the Saccader model produces a larger drop in accuracy than occluding areas selected by other policies.
4.3 Higher Classification Network Capacity and Better Data Quality Improve Accuracy Further
In previous sections, we used a single network to learn useful representations for both the visual attention and classification. This approach is efficient and gives reasonably good classification accuracy. Here, we investigate if further improvements in accuracy can be attained by expanding the capacity of the classification network and using high-resolution images. In this section, we add a powerful NASNet classification network (Zoph et al., 2018) to the base Saccader model. The use of separate models for classification and localization is reminiscent of approaches used in object detection (Girshick et al., 2014; Uijlings et al., 2013; Girshick, 2015), yet here we do not have access to location labels.
We first applied the Saccader model to pixel images to determine relevant glimpse locations. Then, we extracted the corresponding patches and used the NASNet, fine-tuned to operate on these patches, to make class predictions. Our results show that the Saccader-NASNet model is able to increase the accuracy even more while still retaining the interpretability of the predictions (Figure 6).
We investigated whether accuracy can be improved even further by training on higher resolution images. We applied the Saccader-NASNet model to patches extracted from pixel high-resolution images from ImageNet. We resized these images to and fed them to the Saccader model to identify visual attention locations. Then, we extracted the corresponding patches from the high resolution images and fed them to the NASNet model for classification (NASNet model was fine tuned on these patches). The accuracy was even higher than obtained with Saccader-NASNet model on ImageNet ; with 6 glimpses, the top-1 and top-5 accuracy were and , respectively, while processing only of the image with the NASNet.
In this work, we propose the Saccader model, a novel approach to image classification with hard visual attention. We design an optimization procedure that uses pretraining on an auxiliary task with only class labels and no visual attention guidance. The Saccader model is able to achieve good accuracy on ImageNet while only covering fraction of the image. The locations to which the Saccader model attends are highly relevant to the downstream classification task, and occluding them substantially reduces classification performance. Since ImageNet is a representative benchmark for natural image classification (e.g., Kornblith et al. (2019) showed that accuracy on ImageNet predicts accuracy on other natural image classification datasets), we expect the Saccader model to perform well in practical applications involving natural images. Future work is necessary to determine whether the Saccader model is applicable to non-natural image domains (e.g., in the medical field).
Although Saccader outperforms other hard attention models, it still lags behind state-of-the-art feedforward models in terms of accuracy. Our results suggest that, it may be possible to improve upon the performance achieved here by exploring larger classification model capacity and/or training on higher-quality images. Additionally, although it was previously suggested that foveation mechanisms might provide natural robustness against adversarial examples (Luo et al., 2015), the hard attention-based models that we explored here are not substantially more robust than traditional CNNs (see Appendix D). We ensure that the Saccader classification network is interpretable by limiting its input, but the attention network has access to the entire image, and thus the patch selection process remains difficult to interpret.
Here, we consider only classification task; future work can potentially extend the Saccader to many other vision tasks. The high accuracy obtained by our hard attention model and the quality of the learned visual attention policy open the door to the use of this interesting class of models in practice, particularly in applications that require understanding of classification predictions.
We are grateful to Pieter-Jan Kindermans, Jonathon Shlens, and Jascha Sohl-Dickstein for useful discussions and helpful feedback on the manuscript. We thank Jaehoon Lee for help with computational resources.
- Sanity checks for saliency maps. In Advances in Neural Information Processing Systems, pp. 9505–9515. Cited by: §1.
Neural module networks.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 39–48. Cited by: §2.
- Synthesizing robust adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 284–293. External Links: Cited by: Appendix D.
- Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755. Cited by: Appendix C, Appendix C, Appendix C, Figure 1, §1, §2, §2, §4.1.
- On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10 (7), pp. e0130140. Cited by: §1.
- How to explain individual classification decisions. Journal of Machine Learning Research 11 (Jun), pp. 1803–1831. Cited by: §1.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2, §2.
- Approximating CNNs with bag-of-local-features models works surprisingly well on ImageNet. In International Conference on Learning Representations, External Links: Cited by: item 1.
- I-pomdp: an infomax model of eye movement. In 2008 7th IEEE International Conference on Development and Learning, pp. 139–144. Cited by: §2.
- SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5659–5667. Cited by: §2.
- Human attention in visual question answering: do humans and deep networks look at the same regions?. Computer Vision and Image Understanding 163, pp. 90–100. Cited by: §2.
Attend, infer, repeat: fast scene understanding with generative models. In Advances in Neural Information Processing Systems, pp. 3225–3233. Cited by: §2.
- Attention branch network: learning of attention mechanism for visual explanation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
- Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §2, §4.3.
- Fast R-CNN. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2, §4.3.
- Separate visual pathways for perception and action. Trends in neurosciences 15 (1), pp. 20–25. Cited by: item 1, item 2.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1.
- Attacking machine learning with adversarial examples. OpenAI. https://blog. openai. com/adversarial-example-research. Cited by: §1.
Draw: a recurrent neural network for image generation. arXiv preprint arXiv:1502.04623. Cited by: §1, §2.
- Evaluating feature importance estimates. CoRR abs/1806.10758. External Links: Cited by: §1.
- Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. Cited by: §2.
- Gpipe: efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965. Cited by: §2.
- Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §2.
- Attention is not explanation. In NAACL, Cited by: §2.
- Learn to pay attention. In International Conference on Learning Representations, External Links: Cited by: §2.
- Design of an image edge detection filter using the sobel operator. IEEE Journal of solid-state circuits 23 (2), pp. 358–367. Cited by: §4.1.
- The (un) reliability of saliency methods. arXiv preprint arXiv:1711.00867. Cited by: §1.
- Do better imagenet models transfer better?. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.
Learning to combine foveal glimpses with a third-order boltzmann machine. In Advances in neural information processing systems, pp. 1243–1251. Cited by: §1, §2.
- Learning what and where to attend with humans in the loop. In International Conference on Learning Representations, External Links: Cited by: §2.
- Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: §2.
- An intriguing failing of convolutional neural networks and the coordconv solution. In Advances in Neural Information Processing Systems, pp. 9605–9616. Cited by: Appendix C.
- Knowing when to look: adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 375–383. Cited by: §2.
- Foveation-based mechanisms alleviate adversarial examples. arXiv preprint arXiv:1511.06292. Cited by: Appendix D, Appendix D, §5.
Effective approaches to attention-based neural machine translation.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421. Cited by: §2.
Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, External Links: Cited by: Appendix D.
- Recurrent models of visual attention. In Advances in neural information processing systems, pp. 2204–2212. Cited by: §1, §2, §2, §3.2.
- Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition 65, pp. 211–222. Cited by: §1.
Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Advances in Neural Information Processing Systems, pp. 3387–3395. Cited by: §1.
- Technical report on the cleverhans v2.1.0 adversarial examples library. arXiv preprint arXiv:1610.00768. Cited by: Appendix D.
- Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548. Cited by: §2.
- Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pp. 91–99. Cited by: §2.
- Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024. Cited by: §2.
- Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1 (5), pp. 206. Cited by: §1.
- Eye movements and perception: a selective review. Journal of vision 11 (5), pp. 9–9. Cited by: §1.
- Attention for fine-grained categorization. In International Conference on Learning Representations (ICLR 2015) workshop, External Links: Cited by: Appendix C, Appendix C, Appendix C, Figure 1, §1, §2, §4.1, §4.1.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
- Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control 37 (3), pp. 332–341. Cited by: Appendix D.
- Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §1.
Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §3.2.
- Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
- Adversarial risk and the dangers of evaluating against weak attacks. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 5025–5034. External Links: Cited by: Appendix D.
- Selective search for object recognition. International Journal of Computer Vision 104 (2), pp. 154–171. Cited by: §2, §4.3.
- Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §2.
- Foundations of vision. Sinauer Associates. Cited by: §1.
- Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. Cited by: §2.
- Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §3.2.
- CBAM: convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §2.
- Feature denoising for improving adversarial robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 501–509. Cited by: Appendix D.
- Show, attend and tell: neural image caption generation with visual attention. In International Conference on Machine Learning, pp. 2048–2057. Cited by: §2, §2, §2.
- Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579. Cited by: §1.
- Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659. Cited by: §2.
- Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §1, §4.2.
- A neural autoregressive approach to attention-based recognition. International Journal of Computer Vision 113 (1), pp. 67–79. Cited by: §2.
Learning deep features for discriminative localization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
- Visualizing deep neural network decisions: prediction difference analysis. arXiv preprint arXiv:1702.04595. Cited by: §1.
- Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8697–8710. Cited by: §2, §4.3.
Appendix A Supplementary Figures
Appendix B Supplementary Tables
|Model||learning rate||batch size||epochs|
* Fine tuning starting from a trained NASNet using crops of size identified by Saccader.
|Model||learning rate||batch size||epochs|
* Fine tuning starting from a trained NASNet using crops of size identified by Saccader. Note the Saccader model operates on patches from ImageNet 224; the corresponding patches from ImageNet 331 are of size .
|Model||learning rate||batch size||epochs|
|DRAM (pretraining )*||0.8||1024||120||N/A|
|DRAM (pretraining )**||0.001||1024||120||N/A|
* First stage of classification weights pretraining with a wide receptive field.
** Second stage of classification weights pretraining with a narrow receptive field.
*** Full training starts from the model learned in the pretraining stage.
|Model||learning rate||batch size||epochs|
|Saccader (no pretraining)***||0.01||1024||120|
|Saccader (no atten. network)***||0.01||1024||120|
* Training for attention network weights; other weights are initialized and fixed from BagNet-77-lowD.
** Weights initialized from Saccader (pretraining).
*** Training starts from BagNet-77-lowD directly without location pretraining. **** Modified model with attention network ablated. Representation network initialized from trained BagNet-77-lowD.
Appendix C Optimization and hyperparameters
Models were trained on examples from ImageNet. We used examples to tune the optimization hyper-parameters. The results were then reported on the separate test subset of
examples. We optimized all models using Stochastic Gradient Descent with Nesterov momentum of
. We preprocessed images by subtracting the mean and dividing by the standard deviation of training examples. During optimization, we augmented the training data by taking a random crop within the image and then performing bicubic resizing to model’s resolution. For all models except NASNet, we used a cosine learning schedule with linear warm up forepochs. For NASNet, we trained with a batch size of using linear warmup to a learning rate of over the first epochs, decaying the learning rate exponentially by a factor of /epoch thereafter and taking an exponential moving average of weights with a decay of . We used Tensor Processing Unit (TPU) accelerators in all our training.
Convolutional neural networks
We show the architecture of the BagNet-77-lowD model used in Saccader experiments in Figure Supp.4. For BagNet classification models, we optimized the typical cross-entropy objective:
where is the input image, are the class labels, and are model parameters (see Table Supp.3 for hyperparameters). For the NASNet model, we additionally used label smoothing of 0.1, scheduled drop path over 250 epochs, and dropout of 0.7 on the penultimate layer.
We used the DRAM model from Ba et al.  and adapted changes similar to those proposed by Sermanet et al. . In particular, the model consists of a powerful CNN (here we used ResNet-v2-50) that process a multi resolution crops concatenated along channels. The high resolution crop has the smallest receptive field, the lowest resolution crop receptive field is of the full image size, and the middle resolution crop receptive field is halfway between the highest and lowest resolution. The location information is specified by adding 2 channels with coordinate information similar to Liu et al. [2018b]. The features identified by the CNN is then sent as an input to an LSTM classification layer of size 1024. The output of the LSTM classification layer is then fed to another LSTM layer of size 1024 for location prediction. The output of the location LSTM is passed to fully connected layer of 1024 units with ReLU activation, then passed to a 2D fully connected layer with activation, which represents the normalized coordinates to glimpse at in the next time step.
The best DRAM model from Sermanet et al. , Ba et al.  uses multi-resolution glimpses at different receptive field sizes to enhance the classification performance. This approach compromises the interpretability of the model, as the lowest-resolution glimpse covers nearly the entire image. To ensure that the DRAM model is similarly interpretable compared to the Saccader model, we limited the input to the classification component to the high resolution glimpses and blocked the wide-receptive-field middle and low resolutions by feeding the spatial average per channel instead. On the other hand, we allowed the location prediction component of the DRAM model to have a wide receptive field (size of the whole image) by providing all three (high, middle and low) resolutions to the location LSTM layer.
Trying to enhance the accuracy of the DRAM model, we extended the pretraining of the classification weights to two stages (120 epochs on wide receptive field of size followed by 120 epochs on the small receptive field of size . Each stage combines all the different glimpses (similar to Figure 4 Sermanet et al. ) with the change of averaging the logits to compute one cross entropy loss instead of having multiple cross entropy losses for each combination of views. During pretraining, we unrolled the model for 2 time steps using a random uniform policy. We then trained all the weights with the hybrid loss specified in Ba et al. .
We used REINFORCE loss weight of 0.1 and used location sampling from a Gaussian distribution with network output mean and standard deviation () of 0.1 (we also tried REINFORCE loss weight of 1. and of 0.05). We also used accuracy as a baseline to center the reward and 2 MC samples to reduce the variance in the gradient estimates. We tuned the -regularization weight and found that no regularization for the location weights give the best performance. Table Supp.4 summarizes the hyperparameters we used in the optimization.
Appendix D Robustness to adversarial examples
Luo et al.  previously suggested that foveation-based vision mechanisms enjoy natural robustness to adversarial perturbation. This hypothesis is attractive because it provides a natural explanation for the apparent robustness of human vision to adversarial examples. However, no attention-based model we tested is meaningfully robust in a white box setting.
We generated adversarial examples using both the widely-used projected gradient descent method (PGD) [Madry et al., 2018] and the non-gradient-based simultaneous perturbation stochastic approximation (SPSA) method [Spall, 1992], suggested by Uesato et al. . We used implementations provided as part of the CleverHans library [Papernot et al., 2018]. For both attacks, we fixed the size of the perturbation to with respect to the norm, which yields images that are perceptually indistinguishable from the originals. For PGD, we used a step size of and performed up to 300 iterations. Note that, although the attention mechanisms of the DRAM and Saccader models are non-differentiable, the classification network provides a gradient with respect to the input, which is used when performing PGD. For SPSA, we used the hyperparameters specified in Appendix B of Uesato et al. . Bicubic resampling can result in pixel values not in , so we clipped pixels to this range before performing attacks. We report clean accuracy after clipping.
The SPSA attack is computationally expensive. With the selected hyperparameters, the attack fails for a given example only after computing predictions for inputs. Thus, we restricted our analysis to 3906 randomly-chosen examples from the ImageNet validation set. We report clean accuracy after clipping and perturbed accuracy on this subset in Table Supp.6.
|Model||Clean Acc.||SPSA Acc.||PGD Acc.|
For all models we investigated, one or both attacks reduced accuracy to <1% at . Thus, at least when used in isolation, hard attention does not result in meaningful adversarial robustness. For comparison with approaches that explore robustness through training instead of model class or architecture, the state-of-the-art defense proposed by Xie et al.  achieves 42.6% accuracy with a perturbation of , and their adversarial training baseline achieves 39.2%.
It is possible that additional robustness could be achieved by using a stochastic rather than deterministic attention mechanism, as implied by Luo et al. . However, Luo et al.  only tested the transferability of an adversarial example designed for one crop to other crops, rather than constructing adversarial examples that fool models across many crops. Athalye et al.  show that it is possible to construct adversarial examples for ordinary feed-forward CNNs that are robust to a wide variety of image transformations, including rescaling, rotation, translation. Moreover, non-differentiability of the attention mechanism evidently does not in itself improve robustness, and may hinder adversarial training.