Modulated Self-attention Convolutional Network for VQA

by   Jean-Benoit Delbrouck, et al.

As new data-sets for real-world visual reasoning and compositional question answering are emerging, it might be needed to use the visual feature extraction as a end-to-end process during training. This small contribution aims to suggest new ideas to improve the visual processing of traditional convolutional network for visual question answering (VQA). In this paper, we propose to modulate by a linguistic input a CNN augmented with self-attention. We show encouraging relative improvements for future research in this direction.



There are no comments yet.


page 1

page 2

page 3

page 4


Visual Question Generation as Dual Task of Visual Question Answering

Recently visual question answering (VQA) and visual question generation ...

Visual Question Answering using Deep Learning: A Survey and Performance Analysis

The Visual Question Answering (VQA) task combines challenges for process...

Interpretable Neural Computation for Real-World Compositional Visual Question Answering

There are two main lines of research on visual question answering (VQA):...

Learning from Lexical Perturbations for Consistent Visual Question Answering

Existing Visual Question Answering (VQA) models are often fragile and se...

Speech-Based Visual Question Answering

This paper introduces speech-based visual question answering (VQA), the ...

Linguistically Driven Graph Capsule Network for Visual Question Reasoning

Recently, studies of visual question answering have explored various arc...

Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering

This paper proposes deep convolutional network models that utilize local...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Problems combining vision and natural language processing such as visual question answering (VQA) is viewed as an extremely challenging task. Visual attention-based neural decoder models

Xu et al. (2015); Karpathy and Li (2015) have been widely adopted to solve such tasks. All recent works pushing the state-of-the-art in VQA are using the so-called bottom-up attention features Anderson et al. (2018). It consists of multiple pre-extracted features corresponding to local regions of interest in an image. Therefore, it is not possible to modulate the entire visual pipeline to capture relationships or make comparisons between widely separated spatial regions according to the question. For example, questions of the new GQA data-set Hudson and Manning (2019) tackles reasoning skills such as object and attribute recognition, transitive relation tracking, spatial reasoning, logical inference and comparisons. This contribution aims to bring back the features extraction process end-to-end in the learning and to modulate the convolutional network by a linguistic input though self-attention.

2 Related work

Two main approaches have helped pushing further the state-of-the-art in Visual Question Answering: the co-attention module and the bottom up image features. Our work also takes inspiration in the recent attention augmented Convolutional Networks.

Image features  Most early results have used a VGG Antol et al. (2015) or a ResNet Lu et al. (2016)

CNN pretrained on ImageNet

Russakovsky et al. (2015) to extract visual features from images. Recently, richer bottom-up attention features Anderson et al. (2018)

have been proposed. It consists of pre-extracted from a Faster R-CNN (that outputs bounding boxes of interest) in conjunction with ResNet-101 (that acts as a feature extractor for each selected bounding box). They are widely used in the most recent works as they have shown great success for VQA and image captioning.

Co-attention  Attention networks in multimodal learning provide an efficient way to utilize given visual information selectively. However, the computational cost to learn attention distributions for every pair of multimodal input channels is prohibitively expensive. To solve this problem, a co-attention learning framework is proposed Chen et al. (2017) to reduce the expensive computational cost to learn attention distributions for every pair of multimodal input channels. An improvement to the co-attention method is introduced by Yu et al. (2018) that consists of two steps, a self-attention for a question embedding and the question-conditioned attention for a visual embedding. However, these co-attention approaches use separate attention distributions for each modality, neglecting the interaction between the modalities what we consider and model. To tackle this problem, Kim et al. (2018) propose bilinear attention networks that find bilinear attention distributions to effectively utilize vision-language information. Finally, Yu et al. (2019) propose a Deep Modular Co-Attention Networks composed of multiple blocks of two basic attention units : a self-attention unit (inspired by machine translation Vaswani et al. (2017)) and the guided-attention (GA) unit to model the dense inter-modal interactions. The authors also concatenate the res5c features from ResNet-152 to the bottom up features from Anderson et al. (2018).

Attention augmented CNN  Over the years, multiple attention mechanisms have been proposed for visual tasks to address the weaknesses of convolutions. For instance, Squeeze-and-Excitation networks Hu et al. (2018) reweight feature channels using signals aggregated from entire feature maps while CBAM Woo et al. (2018) module sequentially infers attention maps along two separate the channel and spatial independently. More recent attention augmented networks produces additional feature maps by using Attention Augmented Convolution Bello et al. (2019) or recalibrates the feature maps via addition Zhang et al. (2019).

3 Attention augmented Residual Network

3.1 Residual Network

ResNets He et al. (2016) are built from residual blocks:

Here, and

are the input and output vectors of the layers considered. The function

is the residual mapping to be learned. For an example, if we consider two layers, where

denotes ReLu function. The operation

is the shortcut connection and consists of an element-wise addition. Therefore, the dimensions of and must be equal. When this is not the case (e.g., when changing the input/output channels), the matrix performs a linear projection by the shortcut connections to match the dimension. Finally, it performs a last second nonlinearity after the addition (i.e., . A group of blocks are stacked to form a stage of computation. The general ResNet architecture starts with a single convolutional layer followed by 4 stages of computation.

3.2 Self-attention layer

A self-attention layer can be placed in-between two layers in a residual block. Let’s consider the image features from the previous layer: . Here, is the number of channels and is the number of feature locations of features.

Features are first transformed into two feature spaces , where . We use and to compute the square matrix of attention weights :


With indicating the extent to which the model attends to the location when synthesizing the location. The output of the attention layer is where


In the above formulation, are the learned weight matrices, which are implemented as convolutions.

Additionally, the output is multiplied by a factor and we add back the input feature map.


is a learnable scalar and it is initialized as . Introducing the learnable allows the visual network to first rely on the cues in the local neighborhood and the language model to converge normally. As the gamma goes up, the model will gradually learn long range interactions as required for a task such as VQA.

4 Linguistic modulation

In this section, we described two methods to modulate a pretrained ResNet through the self attention modules. We want to enable the linguistic embedding to manipulate the self-attention mechanism. We decide to do so through the parameter and the attention weights .

modulation  Given the last hidden state of the RNN encoding the question, we output a modulated given by :


where . Recall that in the previous section, in equation 3 is unique for any training examples. Here, we output a dedicated scalar for every example in the batch. We rewrite equation 3 as :



We define new linguistic and visual features spaces where and . Both feature spaces are used to compute a new set of attention weights in the following manner:


Where indicating the extent to which the model attends to the spatial location of feature map when synthesizing the dimension of the hidden state. The hidden state is a vector therefore and . We apply this set of betas to the output defined in equation 2. This modulation can be seen as a linguistic spatial attention on the visual self attention: each column is reweighted by the scalar .

5 Experiments

Settings  We use the VQA v2.0 train and validation Goyal et al. (2017) consisting of 443,757 and 214,354 questions respectively over 123,287 images. As VQA model, we follow the Bottom-Up and Top-Down Attention from Anderson et al. (2018). We replace the Up-Down features with a ResNet as visual features extractor (ResNet-34 for preliminary experiments and a ResNet-152 He et al. (2016) for final results). The ResNet is pretrained on ImageNet Russakovsky et al. (2015) and frozen during training. Only the self-attention weights are training parameters. For each images, we extract features at the end of the third ResNet stage.

Learning parameters are trained with Adamax optimizer and a learning rate of

. In a block, the self-attention module is always placed between the last convolution and batch normalization layer of the said block.

Block and layer search  Thanks to the attention, the model can check that features in distant portions of the image are consistent with each other. Depending on where we place one or several attention modules, the model can compute affinity scores between low-level or high-level features in distant portions of the image. It is also worthy to note that attention module in early stages of ResNet is computationally expensive (i.e. the first stage has spatial locations).

ResNet-34 Eval set % #param Baseline (No SA)Anderson et al. (2018) 55.00 0M SA (S: 1,2,3 - B: 1) 55.11 } 0.107M SA (S: 1,2,3 - B: 2) 55.17 SA (S: 1,2,3 - B: 3) 55.27 ResNet-34 Eval set % #param SA (S: 3 - M: 1) 55.25 } 0.082M SA (S: 3 - B: 3) 55.42 SA (S: 3 - B: 4) 55.33 SA (S: 3 - B: 6) 55.31 SA (S: 3 - B: 1,3,5) 55.45 } 0.245M SA (S: 3 - B: 2,4,6) 55.56
Table 1: Experiments run on a ResNet-34. Numbers following S (stages) and B (blocks) indicate where SA (self-attention) modules are put. Parameters count concerns only SA and are in millions (M).

We empirically found that self-attention was the most efficient in the 3rd stage. It is also the less computationally expensive 111Early stages requires less training parameters as is smaller, but computation is more expensive. It is also beneficial to focus the last block of a stage rather than the first blocks. We notice a small improvements relative to the baseline which shows that self-attention alone does improve the VQA task.

Linguistic modulation  We noticed that the current architecture was not able to learn the modulation properly: equation 4

would most of the time result in a large scalar, or 1. when we added a sigmoid function. Therefore, the self-attention weight is already too important at the beginning, skipping visual local connections are misleading the overall model training. However, we managed to show improvements with the

modulation with a ResNet-152. Though the improvement is slim, it is encouraging to continue researching into visual modulation

ResNet-34 Eval set % #param SA (S: 3 - B: 1,3,5) 55.56 0.245M + modulation 52.26 0.248M + modulation 55.32 1.573M ResNet-152 Eval set % #param Baseline (No SA)Anderson et al. (2018) 57.10 0M SA (S: 3 - B: 2,18,36) 58.05 3.932M + modulation 58.35 15.731M
Table 2: Experiments run on a ResNet-34 and 152 for the and modulation.

It is worth to note that the ResNet baseline of Anderson et al. (2018) achieved a 57.9% accuracy and was achieve with a ResNet-200. Our best model achieves 58.35% with a ResNet-152.

6 Conclusion

We showed that it is possible to improve the feature extraction procedure for the VQA task by adding self-attention modules in the different ResNet blocks. The improvement margins can be further improved by:

  • Using wider CNN: Now that we include the feature extraction process end-to-end in the learning, we could pick wider CNN such as ResNeXt Xie et al. (2017) or Wide-ResNet Zagoruyko and Komodakis (2016);

  • Better self-attention: More sophisticated self-attention, such as multi-head attention in CNN Bello et al. (2019); Yu et al. (2019) could improve the overall model;

  • Better modulation: The linguistic hidden state should directly an input of the computation for self-attention. Other works have already considered to use all the hidden states from the language model as the "key" input for visual self-attention Yu et al. (2019).

  • Other modulation: The self-attention modulation could be coupled with batch-normalization modulated as previously investigated in De Vries et al. (2017).


  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6077–6086. Cited by: §1, §2, §2, Table 1, Table 2, §5, §5.
  • [2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV), Cited by: §2.
  • [3] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le (2019) Attention augmented convolutional networks. arXiv preprint arXiv:1904.09925. Cited by: §2, 2nd item.
  • [4] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §2.
  • [5] H. De Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. C. Courville (2017) Modulating early visual processing by language. In Advances in Neural Information Processing Systems, pp. 6594–6604. Cited by: 4th item.
  • [6] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1, §5.
  • [8] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §2.
  • [9] D. A. Hudson and C. D. Manning (2019) GQA: a new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §1.
  • [10] A. Karpathy and F. Li (2015) Deep visual-semantic alignments for generating image descriptions.. In CVPR, pp. 3128–3137. External Links: ISBN 978-1-4673-6964-0, Link Cited by: §1.
  • [11] J. Kim, J. Jun, and B. Zhang (2018) Bilinear attention networks. In Advances in Neural Information Processing Systems, pp. 1564–1574. Cited by: §2.
  • [12] J. Lu, J. Yang, D. Batra, and D. Parikh (2016) Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pp. 289–297. Cited by: §2.
  • [13] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §2, §5.
  • [14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.
  • [15] S. Woo, J. Park, J. Lee, and I. So Kweon (2018) Cbam: convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §2.
  • [16] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017)

    Aggregated residual transformations for deep neural networks

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: 1st item.
  • [17] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015-07–09 Jul) Show, attend and tell: neural image caption generation with visual attention. In

    Proceedings of the 32nd International Conference on Machine Learning

    , F. Bach and D. Blei (Eds.),
    Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 2048–2057. External Links: Link Cited by: §1.
  • [18] Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian (2019) Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6281–6290. Cited by: §2, 2nd item, 3rd item.
  • [19] Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE transactions on neural networks and learning systems 29 (12), pp. 5947–5959. Cited by: §2.
  • [20] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: 1st item.
  • [21] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2019) Self-attention generative adversarial networks. In International Conference on Machine Learning, pp. 7354–7363. Cited by: §2.