Information Bottleneck Approach to Spatial Attention Learning

08/07/2021
by   Qiuxia Lai, et al.
The Chinese University of Hong Kong
0

The selective visual attention mechanism in the human visual system (HVS) restricts the amount of information to reach visual awareness for perceiving natural scenes, allowing near real-time information processing with limited computational capacity [Koch and Ullman, 1987]. This kind of selectivity acts as an 'Information Bottleneck (IB)', which seeks a trade-off between information compression and predictive accuracy. However, such information constraints are rarely explored in the attention mechanism for deep neural networks (DNNs). In this paper, we propose an IB-inspired spatial attention module for DNN structures built for visual recognition. The module takes as input an intermediate representation of the input image, and outputs a variational 2D attention map that minimizes the mutual information (MI) between the attention-modulated representation and the input, while maximizing the MI between the attention-modulated representation and the task label. To further restrict the information bypassed by the attention map, we quantize the continuous attention scores to a set of learnable anchor values during training. Extensive experiments show that the proposed IB-inspired spatial attention mechanism can yield attention maps that neatly highlight the regions of interest while suppressing backgrounds, and bootstrap standard DNN structures for visual recognition tasks (e.g., image classification, fine-grained recognition, cross-domain classification). The attention maps are interpretable for the decision making of the DNNs as verified in the experiments. Our code is available at https://github.com/ashleylqx/AIB.git.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

12/25/2018

Attention Branch Network: Learning of Attention Mechanism for Visual Explanation

Visual explanation enables human to understand the decision making of De...
08/19/2021

Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Attention mechanism has demonstrated great potential in fine-grained vis...
11/03/2021

Multi-Glimpse Network: A Robust and Efficient Classification Architecture based on Recurrent Downsampled Attention

Most feedforward convolutional neural networks spend roughly the same ef...
05/14/2018

Deep Attentional Structured Representation Learning for Visual Recognition

Structured representations, such as Bags of Words, VLAD and Fisher Vecto...
03/30/2017

Dynamic Computational Time for Visual Attention

We propose a dynamic computational time model to accelerate the average ...
05/08/2020

Attentional Bottleneck: Towards an Interpretable Deep Driving Network

Deep neural networks are a key component of behavior prediction and moti...
05/25/2019

DIANet: Dense-and-Implicit Attention Network

Attention-based deep neural networks (DNNs) that emphasize the informati...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human beings can process vast amounts of visual information in parallel through visual system [14] because the attention mechanism can selectively attend to the most informative parts of visual stimuli rather than the whole scene [7]

. A recent trend is to incorporate attention mechanisms into deep neural networks (DNNs), to focus on task-relevant parts of the input automatically. Attention mechanism has benefited sequence modeling tasks as well as a wide range of computer vision tasks to boost the performance and improve the interpretability.

The attention modules in CNNs can be broadly categorized into channel-wise attention and spatial attention, which learns channel-wise [10] and spatially-aware attention scores [19] for modulating the feature maps, respectively. As channel-wise attention would inevitably lose spatial information essential for localizing the important parts, in this paper, we focus on the spatial attention mechanism. There are query-based and module-based spatial attention learning methods. Query-based attention, or ‘self-attention’, generates the attention scores based on the similarity/compatibility between the query and the key content. Though having facilitated various computer vision tasks, such dense relation measurements would lead to heavy computational overheads [9], which significantly hinders its application scenario. Module-based attention directly outputs an attention map using a learnable network module that takes as input an image/feature. Such an end-to-end inference structure is more efficient than query-based attention, and has been shown to be effective for various computer vision tasks. Existing spatial attention mechanisms trained for certain tasks typically generate the attention maps by considering the contextual relations of the inputs. Although the attention maps are beneficial to the tasks, the attention learning process fails to consider the inherent information constraints of the attention mechanism in HVS, i.e., to ensure that the information bypassed by the attention maps is of minimal redundancy, and meanwhile being sufficient for the task.

To explicitly incorporate the information constraints of the attention mechanism in HVS into attention learning in DNNs, in this paper, we propose an end-to-end trainable spatial attention mechanism inspired by the ‘Information Bottleneck (IB)’ theory. The whole framework is derived from an information-theoretic argument based on the IB principle. The resulted variational attention maps can effectively filter out task-irrelevant information, which reduces the overload of information processing while maintaining the performance. To further restrict the information bypassed by the attention map, an adaptive quantization module is incorporated to round the attention scores to the nearest anchor value. In this way, previous continuous attention values are replaced by a finite number of anchor values, which further compress the information filtered by the attention maps. To quantitatively compare the interpretability of the proposed attention mechanism with others, we start from the definition of interpretability on model decision making, and measure the interpretability in a general statistical sense based on the attention consistency between the original and the modified samples that do not alter model decisions.

In summary, our contributions are three-fold:

  • We propose an IB-inspired spatial attention mechanism for visual recognition, which yields variational attention maps that minimize the MI between the attention-modulated representation and the input while maximizing the MI between the attention-modulated representation and the task label.

  • To further filter out irrelevant information, we design a quantization module to round the continuous attention scores to several learnable anchor values during training.

  • The proposed attention mechanism is shown to be more interpretable for the decision making of the DNNs compared with other spatial attention models.

Extensive experiments validate the theoretical intuitions behind the proposed IB-inspired spatial attention mechanism, and show improved performances and interpretability for visual recognition tasks.

2 Related Work

2.1 Spatial Attention Mechanism in DNNs

Attention mechanisms enjoy great success in sequence modeling tasks such as machine translation [3], speech recognition [5]

and image captioning 

[27]. Recently, they are also shown to be beneficial to a wide range of computer vision tasks to boost the performance and improve the interpretability of general CNNs. The attention modules in CNNs fall into two broad categories, namely channel-wise attention and spatial attention. The former learns channel-wise attention scores to modulate the feature maps by reweighing the channels [10]. The latter learns a spatial probabilistic map over the input to enhance/suppress each 2D location according to its relative importance w.r.t.the target task [19]. In this section, we focus on spatial attention modules.

Query-based/ Self-attention. Originated from query-based tasks [3, 27], this kind of attention is generated by measuring the similarity/compatibility between the query and the key content. Seo et al

. seo2018progressive use a one-hot encoding of the label to query the image and generate progressive attention for attribute prediction. Jetley 

et al. jetley2018learn utilize the learned global representation of the input image as a query and calculate the compatibility with local representation from each 2D spatial location. Hu et al. hu2019local adaptively determines the aggregation weights by considering the compositional relationship of visual elements in local areas. Query-based attention considers dense relations in the space, which can improve the discriminative ability of CNNs. However, the non-negligible computational overhead limits its usage to low-dimensional inputs, and typically requires to downsample the original images significantly.

Module-based attention. The spatial attention map can also be directly learned using a softmax-/sigmoid-based network module that takes an image/feature as input and outputs an attention map. Being effective and efficient, this kind of attention module has been widely used in computer vision tasks such as action recognition [18] and image classification [26]. Our proposed spatial attention module belongs to this line of research, and we focus on improving the performance of visual recognition over baselines without attention mechanisms.

Previous spatial attention learning works focus on the relations among non-local or local contexts to measure the relative importance of each location, and do not consider the information in the feature filtered by the attention maps. We instead take inspiration from the IB theory [22] which maintains a good trade-off between information compression and prediction accuracy, and propose to learn spatial attention that minimizes the MI between the masked feature and the input, while maximizing the MI between the masked feature and the task label. Such information constraints could also help remove the redundant information from the input features compared with conventional relative importance learning mechanisms.

Figure 1: Graphical model of the probabilistic neural network with IB-inspired spatial attention mechanism (§3.2).
Figure 2: Framework of the IB-inspired spatial attention mechanism for visual recognition. The input is passed through an attention module to produce a continuous variational attention map , which is quantized to a discrete attention map using a set of learnable anchor values . Then, and

are encoded to a latent vector

, and decoded to a prediction

. Loss function in Eq. (

10). See §3.2 and §3.3.

2.2 IB-Inspired Mask Generation

IB theory has been explored in tasks such as representation learning [2, 1] and domain adaptation [6]. Here, we focus on IB-inspired mask generation methods, which yield additive [16] or multiplicative masks [1, 21, 29] to restrict the information that flows to the successive layers.

Information Dropout [1] multiplies the layer feature with a learnable noise mask to control the flow of information. This is equivalent to optimizing a modified cost function that approximates the IB Lagrangian of [22], allowing the learning of a sufficient, minimal and invariant representation for classification tasks. InfoMask [21] filters out irrelevant background signals using the masks optimized on IB theory, which improves the accuracy of localizing chest disease. Zhmoginov et al. zhmoginov2019information generates IB-inspired Boolean attention masks for image classification models to completely prevent the masked-out pixels from propagating any information to the model output. Schulz et al. schulz2019restricting adopts the IB concept for interpreting the decision-making of a pre-trained neural network, by adding noise to intermediate activation maps to restrict and quantify the flow of information. The intensity of the noise is optimized to minimize the information flow while maximizing the classification score.

In this paper, we propose an IB-inspired spatial attention mechanism based on a new variational approximation of the IB principle derived from a neural network incorporated with attention module ( in Eq. (17)). Compared with the above works, our optimization objective is not derived from the original IB Lagrangian of [22], thus the resulted attention mechanism is parameterized with different network architectures. Besides, we focus on learning continuous spatial attention instead of noise masks [1, 16], which is also different from the Boolean attention in [29].

3 Methodology

3.1 Overview

We derived the framework of the proposed spatial attention mechanism theoretically from the IB principle on a neural network incorporated with an attention module. Assuming that random variables

and follow the joint conditional distribution of the model shown in Fig. 6. Here, and denote the input, output, spatial attention map, and the latent representation obtained from and

, respectively. We drive the proposed spatial attention framework based on the joint distribution and the IB principle. The resulted framework is shown in Fig. 

2, which consists of an attention module , a set of anchor values for quantization, an encoder , and a decoder .

The input is first passed through the attention module to produce a continuous variational attention map , which is quantized to a discrete attention map using a set of learnable anchor values . Then, and are encoded to a latent vector , and decoded to a prediction . The whole framework is trained end-to-end using the loss function defined in Eq. (10). The derivation of the framework is shown in §3.2, and more details are provided in the supplementary. The attention score quantization process is shown in §3.3.

3.2 IB-inspired Spatial Attention Mechanism

We introduce the IB principle [22] to learn the spatial attention that minimizes the MI between the masked representation and the input, while maximizing the MI between the masked representation and the class label. Different from [2] for a standard learning framework, we derive new variational bounds of MI within an attentive framework for visual recognition, which lead to an IB-inspired spatial attention mechanism.

Let the random variables and denote the input, output, spatial attention map, and the latent representation obtained from and , respectively. The MI between the latent feature and its output is defined as:

(1)

We introduce to be a variational approximation of the intractable . Since the Kullback Leibler (KL) divergence is always non-negative, we have:

(2)

which leads to

(3)

where is the entropy of , which is independent of the optimization procedure thus can be ignored. By leveraging the fact that (see Fig. 6), and ignoring , the new variational lower bound is as follows:

(4)

where is the attention module, is the encoder, and is the decoder. can thus be maximized by maximizing its variational lower bound.

Next, to minimize the MI between the attention-modulated representation and the input, we consider , and obtain the following upper bound:

(5)

where is a prior distribution of . In our experiments, we use a spherical Gaussian as the prior .

By combining Eq. (14) and (15), we obtain the lower bound of the attentive variational information bottleneck:

(6)

which offers an IB-inspired spatial attention mechanism for visual recognition. Here, controls the tradeoff between information compression and prediction accuracy.

We approximate , with empirical distribution , following [2], where is the number of training samples, and are samples drawn from data distribution and , respectively. The approximated lower bound can thus be written as:

(7)

Similar to [2], we suppose to use the attention module of the form , and the encoder of the form , where and are network modules. To enable back-propagation, we use the reparametrization trick [12], and write and , where and are deterministic functions of and the Gaussian random variables and . The loss function is:

(8)

The first term of the loss function is the negative log-likelihood of the prediction, where the label of is predicted from the latent encoding , and is generated from and its attention map . Minimizing this term leads to maximal prediction accuracy. The second term is the KL divergence between distributions of latent encoding and the prior , the minimization of which enables the model to learn an IB-inspired spatial attention mechanism for visual recognition. This is different from the regular IB principle [22, 2] which is only for the representation learning without attention module.

 

Image Class. Fine-grained Recog. Cross-domain Class.
Model CIFAR-10 CIFAR-100 CUB-200-2011 SVHN STL10-train STL10-test

– Existing architectures –
VGG [20] 7.77 30.62 34.64 4.27 54.66 55.09
VGG-GAP [30] 9.87 31.77 29.50 5.84 56.76 57.24
VGG-PAN [17] 6.29 24.35 31.46 8.02 52.50 52.79

VGG-DVIB [2]
  4.64   22.88   23.94   3.28   51.40   51.60

WRN [28]
4.00 19.25 26.50


– Architectures with attention –
VGG-att2 [11] 5.23 23.19 26.80 3.74 51.24 51.71
VGG-att3 [11] 6.34 22.97 26.95 3.52 51.58 51.68
WRN-ABN [8]   3.92 18.12   2.88   50.90   51.24


VGG-aib (ours)
4.28 21.56 23.73 3.24 50.64 51.24

VGG-aib-qt (ours)
4.10 20.87 21.83 3.07 50.44 51.16

WRN-aib (ours)
3.60 17.82 17.26 2.76 50.08 50.84
WRN-aib-qt (ours) 3.43 17.64 15.50 2.69 50.34 50.49

Table 1: Top-1 error for image classification (§4.1), fine-grained recognition (§4.2), and cross-dataset classification (§4.3). denotes re-implementation or re-training. Other values are from the original paper. Best values of different backbones are in bold.

3.3 Attention Score Quantization

We define the continuous attention space as , and the quantized attention space as , where are the width and height of the attention map, respectively. As shown in Fig. 2, the input is passed through an attention module to produce a continuous variational attention map , which is mapped to a discrete attention map through a nearest neighbour look-up among a set of learnable anchor values , which is given by:

(9)

where are spatial indices. In this way, each score in the continuous attention map is mapped to the 1-of- anchor value. The quantized attention map and the input are then encoded into a latent representation , where is the dimension of the latent space. Finally,

is mapped to the prediction probabilities

, and is the number of classes. The complete model parameters include the parameters of the attention module, encoder, decoder, and the anchor values .

As the operation in Eq. (9

) is not differentiable, we resort to the straight-through estimator 

[4] and approximate the gradient of using the gradients of . Though simple, this estimator worked well for the experiments in this paper. To be concrete, in the forward process the quantized attention map is passed to the encoder, and during the backwards computation, the gradient of is passed to the attention module unaltered. Such a gradient approximation makes sense because and share the same dimensional space, and the gradient of can provide useful information on how the attention module could change its output to minimize the loss function defined in Eq. (19).

The overall loss function is thus defined as in Eq. (10), which extends Eq. (19) with two terms, namely a quantization objective weighted by , and a commitment term weighted by , where is the stopgradient operator [23], and is the quantized version of . The former updates the anchor values to move towards the attention map , and the latter forces the attention module to commit to the anchor values. We set , , and empirically.

(10)

An illustration of the whole framework is shown in Fig. 2. In practice, we first use an feature extractor, e.g.VGGNet [20], to extract an intermediate feature from the input , then learn the attention and the variational encoding from instead of .

 


Model
Spatial CIFAR-10 CIFAR-100 Frequency CIFAR-10 CIFAR-100


VGG-att3
color 91.46 79.61 37.71   7.12 52.92 39.93 25.40 14.56 83.06   3.69
svhn 90.97 75.70 36.74   6.51 82.08 63.16 39.07 21.03 49.61 50.90


VGG-aib
color 99.22 99.69 93.78 20.59 98.88 98.56 95.92 22.70 99.91 99.78
svhn 98.59 97.52 97.35 72.86 98.73 96.70 96.69 93.02 53.26 79.13


VGG-aib-qt
color 99.26 99.79 91.10 18.36 99.18 99.30 94.59 20.54 99.96 99.60
svhn 98.65 98.04 97.85 78.82 99.12 97.64 97.21 94.79 73.52 79.70





WRN-ABN
color 90.76 65.01 33.89 13.24 89.74 61.38 30.56   9.67 38.14 14.04
svhn 90.40 68.41 38.46 16.55 92.77 63.67 30.57   9.47 36.33 25.43

WRN-aib
color 99.95 93.94 45.17   6.69 99.86 95.35 64.27 17.15 78.96 90.44
svhn 99.94 97.34 81.98 47.65 99.90 97.77 89.37 62.64 84.58 94.18

WRN-aib-qt
color 99.84 84.18 28.94   4.64 99.97 97.07 69.51 25.80 72.05 94.13
svhn 99.91 96.05 70.75 28.35 99.95 98.52 89.00 63.10 76.53 93.17

Table 2:

Interpretability scores under spatial and frequency domain modification on CIFAR-10 and CIFAR-100.

is the window size of the modified region. is the radius in frequency domain. See §4.4 for more details. denotes re-implementation. Best values in bold.

4 Experiments

To demonstrate the effectiveness of the proposed IB-inspired spatial attention, we conduct extensive experiments on various visual recognition tasks, including image classification (§4.1), fine-grained recognition (§4.2), and cross-dataset classification (§4.3), and achieve improved performance over baseline models. In §4.4, we compare the interpretability of our attention maps with those of other attention models both qualitatively and quantitatively. We conduct an ablation study in §4.5. More details are shown in the supplementary.

4.1 Image Classification

Datasets and models. CIFAR-10 [15] contains natural images of classes, which are splited into training and test images. CIFAR-100 [15] is similar to CIFAR-10, except that it has

classes. We extend standard architectures, VGG and wide residual network (WRN), with the proposed IB-inspired spatial attention, and train the whole framework from scratch on CIFAR-10, and CIFAR-100, respectively. We use original input images after data augmentation (random flipping and cropping with a padding of 4 pixels).

Results on CIFAR. As shown in Table 1, the proposed attention mechanism achieves noticeable performance improvement over standard architectures and existing attention mechanisms such as GAP, PAN, VGG-att111https://github.com/SaoYan/LearnToPayAttention and ABN222https://github.com/machine-perception-robotics-group/attention_branch_network. To be specific, VGG-aib achieves and decrease of errors over the baseline VGG model on CIFAR-10 and CIFAR-100, respectively. The quantized attention model VGG-aib-qt further decreases the errors over VGG-aib by and on the two datasets. Compared with other VGG-backboned attention mechanisms, ours also achieve superior classification performances. Similarly, WRN-aib and WRN-aib-qt also decrease the top-1 errors on CIFAR-10 and CIFAR-100.

4.2 Fine-grained Recognition

Datasets and models. CUB-200-2011 (CUB) [24] contains training and testing bird images from classes. SVHN collects training, testing, and extra digit images from house numbers in street view images. For CUB, we perform the same preprocessing as [11]. For SVHN, we apply the same data augmentation as CIFAR.

Results on CUB. The proposed VGG-aib and VGG-aib-qt achieves smaller classification error compared with baselines built on VGG, including GAP, PAN, and VGG-att. Specially, VGG-aib and VGG-aib-qt outperform VGG-att for a and , respectively.

Results on SVHN. Our proposed method achieves lower errors with VGG-backboned models, and comparative errors for WRN backbone methods.

4.3 Cross-domain Classification

Datasets and models. STL-10 contains training and test images of resolution organized into classes. Following [11]

, the models trained on CIFAR-10 are directly tested on STL-10 training and test sets without finetuning to verify the generalization abilities.

Results on STL-10. As shown in Table 1, the proposed attention model outperforms other competitors on VGG backbone, and achieves comparative performance for WRN backbone.

4.4 Interpretability Analysis

An interpretable attention map is expected to faithfully highlight the regions responsible for the model decision. Thus, an interpretable attention mechanism is expected to yield consistent attention maps for an original input and a modified input if the modification does not alter the model decision. Here, we quantify the interpretability of an attention mechanism by calculating the ‘interpretability score’, i.e

. the percentage of attention-consistent samples in prediction-consistent samples under modifications in the spatial or frequency domain. Here, the attention consistency is measured by the cosine similarity between two flattened and normalized attention maps.

Figure 3: Visualization of attention maps for interpretability (§4.4).

The spatial domain modification includes randomly occluding the original images in CIFARs with color blocks or images drawn from distinct datasets. The size of the modified region ranges from to . The nd and rd rows in Fig. 3 show exemplar images occluded by a random color block and an image randomly drawn from SVHN, respectively, where . As can be observed, our spatial attention model VGG-aib(-qt), and WRN-aib(-qt) yield consistent attention maps with the original images (first row). The interpretability scores are listed in Table 2. Our method consistently outperforms other spatial attention mechanisms with the same backbone by a large margin on two datasets for various . This is because the IB-inspired attention mechanism minimizes the MI between the attention-modulated features and the inputs, thus mitigating the influence of ambiguous information exerted on the inputs to some extent.

We also conduct frequency domain modification

, which is done by performing Fourier transform on the original samples, and feeding only high-/low-frequency (HF/LF) components into the model. To preserve enough information and maintain the classification performance, we focus on HF components of

, and LF components of , where is the radius in frequency domain defined as in [25]. The th and th rows in Fig. 3 show exemplar images constructed from HF and LF components, respectively. Our attention maps are more royal to those of the original images compared with other competitors. Out method also yields much better quantitative results than other attention models, as illustrated in Table 2. This is because our IB-inspired attention mechanism can well capture the task-specific information in the input by maximizing the MI between the attention-modulated feature and the task output, even when part of the input information is missing.

4.5 Ablation Study

We conduct ablation studies on CIFAR-10 with the VGG backbone to assess the effectiveness of each component.

Effectiveness of attention module. Table 1 shows that VGG-aib and VGG-aib-qt outperform the IB-based representation learning model VGG-DVIB.

Information tradeoff. controls the amount of information flow that bypasses the bottleneck of the network. To measure the influence of on the performance, we plot the accuracy values with varying in Fig. 4 (a). As can be observed, the accuracy is largest when .

Latent vector dimension. We experiment on , , . As shown in Fig. 4 (b), achieves the best performance.

Attention score quantization. Fig. 4 (c) shows the classification accuracy when varying number of anchor values , where between and

gives better performance. Exemplary attention maps of cases that are wrongly classified by

VGG-aib but are correctly predicted by VGG-aib-qt are listed in Fig. 5. As can be observed, attention quantization can further help focus on more concentrated regions with important information, thus improve the prediction accuracy.

(a)(b)(c)

Figure 4: Ablation study on CIFAR-10 with VGG backbone (§4.5).
Figure 5: Effect of attention score quantization. See §4.5 for details.

5 Conclusion

We propose a spatial attention mechanism based on IB theory for generating probabilistic maps that minimizes the MI between the masked representation and the input, while maximizing the MI between the masked representation and the task label. To further restrict the information bypassed by the attention map, we incorporate an adaptive quantization mechanism to regularize the attention scores by rounding the continuous scores to the nearest anchor value during training. Extensive experiments show that the proposed IB-inspired spatial attention mechanism significantly improves the performance for visual recognition tasks by focusing on the most informative parts of the input. The generated attention maps are interpretable for the decision making of the DNNs, as they consistently highlight the informative regions for the original and modified inputs with the same semantics.

Acknowledgments

This work is supported in part by General Research Fund (GRF) of Hong Kong Research Grants Council (RGC) under Grant No. 14205018 and No. 14205420.

References

  • [1] A. Achille and S. Soatto (2018) Information dropout: learning optimal representations through noisy computation. TPAMI 40 (12), pp. 2897–2905. Cited by: §2.2, §2.2, §2.2.
  • [2] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy (2017) Deep variational information bottleneck. In ICLR, Cited by: §2.2, §3.2, §3.2, Table 1.
  • [3] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §2.1, §2.1.
  • [4] Y. Bengio, N. Léonard, and A. Courville (2013)

    Estimating or propagating gradients through stochastic neurons for conditional computation

    .
    arXiv preprint arXiv:1308.3432. Cited by: §3.3.
  • [5] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio (2015) Attention-based models for speech recognition. In NeurIPS, Cited by: §2.1.
  • [6] Y. Du, J. Xu, H. Xiong, Q. Qiu, X. Zhen, C. G. Snoek, and L. Shao (2020) Learning to learn with variational information bottleneck for domain generalization. In ECCV, Cited by: §2.2.
  • [7] C. W. Eriksen and J. E. Hoffman (1972) Temporal and spatial characteristics of selective encoding from visual displays. Perception & Psychophysics 12 (2), pp. 201–204. Cited by: §1.
  • [8] H. Fukui, T. Hirakawa, T. Yamashita, and H. Fujiyoshi (2019) Attention branch network: learning of attention mechanism for visual explanation. In CVPR, Cited by: Table 1.
  • [9] K. Han, Y. Wang, H. Chen, et al. (2020) A survey on visual transformer. arXiv preprint arXiv:2012.12556. Cited by: §1.
  • [10] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, Cited by: §1, §2.1.
  • [11] S. Jetley, N. A. Lord, N. Lee, and P. H. Torr (2018) Learn to pay attention. In ICLR, Cited by: Table 1, §4.2, §4.3.
  • [12] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In ICLR, Cited by: §3.2.
  • [13] C. Koch and S. Ullman (1987) Shifts in selective visual attention: towards the underlying neural circuitry. In Matters of Intelligence, pp. 115–141. Cited by: Information Bottleneck Approach to Spatial Attention Learning.
  • [14] K. Koch, J. McLean, et al. (2006) How much the eye tells the brain. Current Biology 16 (14), pp. 1428–1434. Cited by: §1.
  • [15] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
  • [16] K. Schulz, L. Sixt, F. Tombari, and T. Landgraf (2019) Restricting the flow: information bottlenecks for attribution. In ICLR, Cited by: §2.2, §2.2.
  • [17] P. H. Seo, Z. Lin, S. Cohen, X. Shen, and B. Han (2018) Progressive attention networks for visual attribute prediction. In BMVC, Cited by: Table 1.
  • [18] S. Sharma, R. Kiros, and R. Salakhutdinov (2015) Action recognition using visual attention. In ICLR Workshop, Cited by: §2.1.
  • [19] K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §1, §2.1.
  • [20] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §3.3, Table 1.
  • [21] S. A. Taghanaki, M. Havaei, T. Berthier, F. Dutil, L. Di Jorio, G. Hamarneh, and Y. Bengio (2019) Infomask: masked variational latent representation to localize chest disease. In MICCAI, Cited by: §2.2, §2.2.
  • [22] N. Tishby, F. C. Pereira, and W. Bialek (1999) The information bottleneck method. JMLR. Cited by: §2.1, §2.2, §2.2, §3.2, §3.2.
  • [23] A. Van Den Oord, O. Vinyals, et al. (2017) Neural discrete representation learning. In NeurIPS, Cited by: §3.3.
  • [24] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Cited by: §4.2.
  • [25] H. Wang, X. Wu, Z. Huang, and E. P. Xing (2020)

    High-frequency component helps explain the generalization of convolutional neural networks

    .
    In CVPR, Cited by: §4.4.
  • [26] S. Woo, J. Park, J. Lee, and I. So Kweon (2018) Cbam: convolutional block attention module. In ECCV, Cited by: §2.1.
  • [27] K. Xu, J. Ba, et al. (2015) Show, attend and tell: neural image caption generation with visual attention. In ICML, Cited by: §2.1, §2.1.
  • [28] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. Cited by: Table 1.
  • [29] A. Zhmoginov, I. Fischer, and M. Sandler (2019) Information-bottleneck approach to salient region discovery. In ICML Workshop, Cited by: §2.2, §2.2.
  • [30] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    .
    In CVPR, Cited by: Table 1.

Appendix A Deriviation of the IB-inspired Spatial Attention

In this section, we present the detailed deriviation of the variational lowerbound of the attentive variational information bottleneck .

Figure 6: Graphical model of the probabilistic neural network with IB-inspired spatial attention mechanism.

a.1 Lowerbound of

Let the random variables and denote the input, output, the spatial attention map, and the latent representation obtained from and , respectively. The MI between the latent feature and its output is defined as:

(11)

We introduce to be a variational approximation of the intractable . Since the Kullback Leibler (KL) divergence is always non-negative, we have:

(12)

which leads to

(13)

where is the entropy of , which is independent of the optimization procedure thus can be ignored. By leveraging the fact that (see Fig. 6), and ignoring , the new variational lower bound is as follows:

(14)

where is the attention module, is the encoder, and is the decoder where is the output of . can thus be maximized by maximizing its variational lower bound.

a.2 Upperbound of

Next, to minimize the MI between the masked repsentation and the input, we consider , and obtain the following upper bound according to the chain rule for mutual information, which is calculated as follows:

(15)

where is a prior distribution of . This inequation is also from the nonnegativity of the Kullback Leibler (KL) divergence.

Eq. (15) can be further simplified as follows:

(16)

By combining the two bounds in Eq. (14) and (16), we obtain the attentive variational information bottleneck:

(17)

which offers an IB-inspired spatial attention mechanism for visual recognition.

Following alemi2017deep, we approximate and with empirical data distribution and , where is the number of training samples, and are samples drawn from data distribution and , respectively. The approximated lower bound can thus be written as:

(18)

Similar to alemi2017deep, we suppose to use the attention module of the form , and the encoder of the form , where and are network modules. To enable back-propagation, we use the reparametrization trick kingma2014auto, and write and , where and are deterministic functions of and the Gaussian random variables and . The final loss function is:

(19)

The first term is the negative log-likelihood of the prediction, where the label of is predicted from the latent encoding , and is generated from the and the attention . Minimizing this term leads to maximal prediction accuracy. The second term is the KL divergence between distributions of latent encoding and the prior . The minimization of the KL terms in Eq. (19) enables the model to learn an IB-inspired spatial attention mechanism for visual recognition.

Appendix B Introduction of the recognition tasks

Fine-grained recognition is to differentiate similar subordinate categories of a basic-level category, e.g. bird species WahCUB_200_2011, flower types nilsback2008automated, dog breeds khosla2011novel, automobile models krause20133d, etc.  The annotations are usually provided by domain experts who are able to distinguish the subtle differences among the highly-confused sub-categories. Compared with generic image classification task, fine-grained recognition would benefit more from finding informative regions that highlight the visual differences among the subordinate categories, and extracting discriminative features therein.

Cross-domain classification is to test the generalization ability of a trained classifier on other domain-shifted benchmarks, i.e. datasets that have not been used during training. In our experiments, we directly evaluate the models trained on CIFAR-10 on the train and test set of STL-10 following jetley2018learn.

Appendix C Implementation details

Network Configuration.

For CIFAR and SVHN, the feature extractor consists of the first three convolutional blocks of VGG, where the max-pooling layers are omitted, or the first two residual blocks of WRN. In

VGG-aib, for obtaining the is built as: Conv() Sigmoid, and is calculated from through a Conv(), which is inspired by bahuleyan2018variational. In WRN-aib, contains an extra residual block compared with that in VGG backbone, and is obtained in the same way. The encoder contains the remaining convolutional building blocks from VGG or WRN, with an extra FC layer that maps the flattened attention-modulated feature to a vector, where the first -dim is , and the remaining -dim is (after a softplus activation). The decoder maps the sampled latent representation to the prediction , which is built as: FC() ReLUFC() for VGG backbone, and ReLUFC() for WRN backbone, where is the dimension of the latent encoding, and is the number of classes. For CUB, the feature extractor is built from the whole convolutional part of VGG. The encoder only contains a Linear layer. The structure of the attention module and the decoder is the same as above. For VGG-aib-qt and WRN-aib-qt, the number of anchor values are and , respectively. The values are initialized by evenly dividing and then trained end-to-end within the framework.

Comparions of #parameters. In Table. 3, we report the number of parameters for different attention mechanisms under various settings. We will add this to the future version.

 

Input Size #Class VGG-att3 VGG-aib VGG-aib-qt WRN-ABN WRN-aib WRN-aib-qt
 , 10 19.86 20.05 20.05 64.35   65.00   65.00
  , 100 19.99 20.07 20.07 64.48   65.11   65.11
, 200 - 24.28 24.28 - 130.73 130.73

Table 3: Comparisons of the number of parameters (M).

Hyper-parameter selection. We first determine the ranges of the hyper-parameters by referring to related works alemi2017deep,van2017neural, and then discover the best ones through experiments. Following this scheme, we choose a good for different backbones by experimenting on . For a certain backbone, the hyper-parameters are the same for tasks whose inputs share the same resolution. The hyper-parameters varied slightly for different backbones. Latent vector dimension defines the size of the bottleneck as in alemi2017deep, which does not necessarily match the latent dimension of other architectures.

Training. Models are trained from scratch except on CUB following jetley2018learn. We use an SGD optimizer with weight decay and momentum , and train for epochs in total. The batch size is for all the datasets. For VGG-aib(-qt), the initial learning rate is , which is scaled by every epochs. For WRN-aib(-qt), the initial learning rate is , and is multiplie,d by at the and -th epoch.

Inference. We draw samples from for and samples from for in the lower bound of the IB-inspired attention mechanism in Eq. (17) and the loss in Eq. (18).

Running time analysis. We show the FLOPs and inference time of our method in Table. 4.

 

Input Size #Class FLOPs (G) Inference time per frame (ms)
VGG-aib VGG-aib-qt WRN-aib WRN-aib-qt VGG-aib VGG-aib-qt WRN-aib WRN-aib-qt
  10   3.79   3.79 39.14 39.14 3.43 3.68   8.53   8.55
   100   3.79   3.79 39.20 39.20 3.51 3.70   8.60   8.84
200 16.75 16.75 50.97 50.97 3.58 3.81 10.47 10.66

Table 4: Running time analysis with various task settings.

Reproducibility. Our model is implemented using PyTorch and trained/tested on one Nvidia TITAN Xp GPU. To ensure reproducibility, our code is released at https://github.com/ashleylqx/AIB.git.

named varef