Granular Multimodal Attention Networks for Visual Dialog

Vision and language tasks have benefited from attention. There have been a number of different attention models proposed. However, the scale at which attention needs to be applied has not been well examined. Particularly, in this work, we propose a new method Granular Multi-modal Attention, where we aim to particularly address the question of the right granularity at which one needs to attend while solving the Visual Dialog task. The proposed method shows improvement in both image and text attention networks. We then propose a granular Multi-modal Attention network that jointly attends on the image and text granules and shows the best performance. With this work, we observe that obtaining granular attention and doing exhaustive Multi-modal Attention appears to be the best way to attend while solving visual dialog.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

page 8

01/28/2019

Multi-modal dialog for browsing large visual catalogs using exploration-exploitation paradigm in a joint embedding space

We present a multi-modal dialog system to assist online shoppers in visu...
04/26/2021

GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization

Traditional video summarization methods generate fixed video representat...
12/04/2017

Examining Cooperation in Visual Dialog Models

In this work we propose a blackbox intervention method for visual dialog...
12/16/2021

Logically at the Factify 2022: Multimodal Fact Verification

This paper describes our participant system for the multi-modal fact ver...
11/26/2019

Efficient Attention Mechanism for Handling All the Interactions between Many Inputs with Application to Visual Dialog

It has been a primary concern in recent studies of vision and language t...
10/14/2019

Dynamic Attention Networks for Task Oriented Grounding

In order to successfully perform tasks specified by natural language ins...
04/20/2019

Saliency-Guided Attention Network for Image-Sentence Matching

This paper studies the task of matching image and sentence, where learni...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In ‘Visual Dialog’ [5] problem, an AI agent has access to an image. The aim is that the bot should be able to answer a question given the image and the context of the previous conversation. We can gain insights for a method by observing the regions of the image the method most focuses on while answering a question. It has been observed in a recent work that humans also attend to specific regions of an image while answering questions [4]. We therefore expect strong correlation between focusing on the “right” regions while answering questions and obtaining better semantic understanding to solve the problem. This correlation exists as far as humans are concerned [4]. We therefore aim in this paper to obtain image based attention regions that correlate better with human attention. It is known that using attention for solving various problems that relate to vision and language is a good approach. However, in an interesting evaluation carried out for the visual question answering task [4] it was observed that the attention networks focus on regions different from that used by humans for answering questions. We hypothesize that this could be due to the fact that they do not focus on the right granularity and correct context when obtaining attention for image and textual regions.

Figure 1:

We first obtain image and text attention. Then we used Multi-modal Attention network to obtain final attention map. Finally, we classify the answer based on the attended feature. We provide the attention map that indicates the actual improvement in attention.

In this work, we aim to address this problem. We particularly aim to obtain ‘granular regions’ for images using object proposals as was used earlier by Anderson et al. [1] and word based attention using the appropriate context. We observe that by using all the text to attend to each image proposal using granular image attention network and using the whole image to attend to each word using granular word attention networks, one can obtain the appropriate attention network regions in image and text respectively. Further, these when used in conjunction result in further improvement through granular multi-modal attention networks. In figure 1 we illustrate the main idea of the network. As can be observed, the multi-modal attention network obtains attention regions from both image and text and when combined provides improved attention regions.

As part of the work, we also carry out a thorough evaluation of all the main attention methods that have been proposed in literature for various vision and language tasks. This evaluation also provides the ground for properly analyzing the various attention methods. Additionally, we consider the correlation between the visual explanation and attention regions. In literature, there have been visualization efforts such as Grad-cam [29]

that aim to provide explanations for the decisions by neural networks by visualizing the gradient information. We show that the proposed attention network regions correlates well with these regions obtained using Grad-cam. To summarize through this paper we provide the following main contributions:

  • We propose a granular image-attention (GIA) and granular text-attention (GTA) based approach to obtain improved attention regions for solving visual dialog.

  • We evaluate three variants of the proposed granular attention networks - one where we only obtain image attention, the other where we obtain text based image attention and the final proposed method where we combine these attentions using multi-modal attention method.

  • We obtain an improved overall accuracy by  6% NDGC score as compared to other baseline approaches using the proposed attention model for the visual dialog task.

  • We provide a thorough empirical analysis for the method and also provide visualizations of attention mask for granular multi-modal attention network(GMA) and measure rank correlation among various attention masks with Grad-CAM regions to ensure that the attention regions correlate with the regions used by network in solving visual dialog.

2 Related Work

A conversation about an image is known as Visual Dialog. This is one of the recent challenges in the field of vision and language. The task of Visual dialog involves image captioning  

[31, 33, 15, 8, 14, 37] that is description about image, visual question answering [19, 2, 28, 21, 36, 30, 18] that is responding natural language question about an image, visual question generation  [20, 25] that generating natural language question about an image and generating similar types of question of given question[26]. The multimodal attention mechanism is one of the core mechanisms in the interaction system. In VQA,[39] proposed attention network in stacked fashion,[10] combine both modality in frequency space and [24] use exemplar way to combine and get multimodal attention map. [12] proposed a cross-modal attention network by looking at word level and object level. [23] proposed a nice algorithm for minimizing uncertainty and get a robust attention map in a multimodal system. Visual dialog requires the agents to have a meaningful conversation about the visual content of an image. This was introduced by [5]. They proposed three approaches, i.e, late fusion, by concatenating all the history around, attention-based hierarchical LSTM for handing variable-length history and memory-based method which resulted in the highest accuracy.

Das et al.[6]

proposed deep reinforcement learning-based end-to-end trained model for Visual Dialog. Strub

et al.[32] proposed an end-to-end RL optimization and it’s applications to multimodal tasks. Chattopadhyay et al.[3] designed an interactive AI image guessing game on visual dialog. Lu et al. [17] and Wu et al.[35] proposed generator and discriminator based architecture. Recent work on the visual dialog as proposed by Jain et al.[13] is based on discriminative question generation and answering. [22] proposed a probabilistic method to generate diverse answers and also minimize uncertainty in the answer generation. Various methods have been proposed to handle variable-length history rounds. Das et al.[5] has also proposed the ‘Late Fusion’ (LF) method, where question word tokens are concatenated with answer tokens then obtained its embedding using LSTM network. In order to handle variable-length history, they proposed hierarchical LSTM. Also, they proposed a memory network model to perform the best results in terms of accuracy. In contrast to the earlier architectures, we address the question of obtaining correct attention regions for solving the task and provide comparisons with the related attention methods.

Figure 2:

Illustration of SAN and its attention mask. We pass an image through a convolutional neural network to get image features and apply tile function on text to obtain text features. Then we obtain softmax mask using attention network. We apply the attention mask on image to get the final attention.

3 Background: Stacked Attention Network (SAN)

Stacked Attention Network (SAN) [38]

is a question guided attention scheme which learns the attention probability vector of the visual information based on the input question vector. We use this as our reference network. Attention is weighted average of question features and image features. The output of weighted average features is fed into tanh layer followed by linear layer to compute the attention probability vector. Softmax is applied over linear layer to obtain attention map, which indicates the probability of contribution of each spatial feature. Finally, this attention map is multiplied by image feature and resultant is added to question feature to predict answer. SAN uses stack of this attention layer as a iterative step to narrow down the selection portion of visual information. Mathematical expression for SAN is as follows:

where is the question vector in iteration and is the image feature matrix. This process repeats times to obtain correct answer.

3.1 Multimodal Compact Bilinear Pooling(MCB)

Figure 3:

Illustration of Multimodal Compact Bilinear Pooling(MCB) and its attention mask. We obtain image features using Convolution Neural Network and question features using LSTM. Finally, we combine these two using Fast Fourier Transform and obtain final attention vector by applying Inverse Fast Fourier Transform to final attention. Then we normalize the attention vector and obtain the final answer

MCB is another approaches to combine image modality with language modality. Here, we obtain image features using Convolution Neural Network and text features using Recurrent Neural Network. Finally, we combine these two using Fast Fourier Transform and obtain final attention vector by applying Inverse Fast Fourier Transform to final attention. Then we normalize the attention vector and obtain the final answer. In compact bi-linear pooling, we transform image features and question features to common space using compact bi-linear pooling. From this, we obtain correlation feature between image features and question features. Then we obtain attention mask by weighted average of question features and image features. The output of weighted average features is fed through tanh layer followed by linear layer which makes their sizes equal. Soft-max is applied over linear layer to get attention map which indicate the probability of contribution of each image feature. Finally, this attention map is multiplied by “conv5” features and resultant is added to question features to predict answer.

(1)
Figure 4: Granular Multi-modal Attention Attention Network. Firstly, we use a CNN to obtain image features , then we use LSTM to obtain text features. We obtain image attention features by attending question features with each object present in the image. Similarly, we obtain text attention feature by attending image features with every word in question. Afterwards, we use a Multi-modal Attention network to obtain final attention. Final answer can be obtained using the result of Multi-modal Attention.

4 Method

The main focus in our approach for solving visual dialog task is to use multi-modal attention based method to combine granular image and textual attention map to improve attention and visual explanation. The key differences in our architecture as compared to an existing visual dialog architecture is the use of granular Attention mask. The other aspects of dialog are retained as is. In particular, we adopt a classification based approach for solving visual dialog where an image embedding is combined with the question and history embedding to solve for the answer. This is illustrated in figure 4.

4.1 Overview

Our method consists of four parts as illustrated in figure 4:

  1. We obtain granular feature for image and Question embeddings using standard pretrained VQA (MCB) Model[11].

  2. We obtain granular image attention feature for question and history based on query image. Similarly, we obtain granular text attention feature for the text based on the query question and previous history.

  3. Then, we combine these attention mechanisms using a Multi-modal Attention mechanism. We also evaluate Multimodal Compact Bilinear Pooling mechanism and concatenation to combine these two attentions but we observed that Multi-modal Attention performs better. Finally, we obtain context encoding vector.

  4. This context embedding compares with all the candidate answer options and obtain a 100 dim vector by reasoning with each answer option and context embedding vector.

4.2 Granular Multimodal Attention Module

Granularity in image and question is required to answer each round of question in visual dialog. It is built on two main networks - Granular Image Attention Network and Granular Text Attention Network. In image attention network, we obtain attention features by attending using text features for each object present in image. In text attention, we obtain attention features by attending image features for each question (history token). The joint attention model (GMA) combines image and text attention features to obtain final answer encoding feature.

4.2.1 Granular Image Attention (GIA) module

In Granular Image Attention Network, we obtain relevant regions in an image for answering particular question in visual dialog. [27] generates importance regions indicating how salient each pixel is for the model’s prediction. Following this explanation approach, we mask different image regions in random combinations to compute the importance of each region to produce the ground truth answer. Saliency map for a given output class is computed as a weighted sum of random masks, where weights are the probability scores of that ground-truth class for the corresponding mask. We generate saliency map for the ground-truth class. We use the image features from vgg-16,CONV-5 given by where and =512. To produce an importance map, we randomly sample a set of binary masks according to distribution and probe the model by running it on masked image regions , and denotes element-wise multiplication. Then, we take the weighted average of the masks where the weights are the confidence scores of the ground-truth class and normalise it by the expectation of .

(2)

Where denotes the saliency map for the ground-truth class. The map generated denotes the importance of each image region for predicting the ground truth class and therefore, we use this map obtained as a basis for improving the image-attention network of the baseline model. We treat each round of the visual dialog as a visual question answer (VQA) [34] with external knowledge as a history as a context input. So We obtain image saliency map for each round of the visual dialog. We term this as granular image feature. The attention applied on this image feature is know as granular image attention. Then we apply attention mask to get attention probability of each of the spatial region. The entire procedure is as follows:

(3)

where is a vector representing bounding box. is a vector representing the question, , are weight matrices and bias terms. is the attention probability mask. is the image attended feature for question sequence. Similarly, we obtain image attention feature vector for history sequence. Finally we combine image attended feature for question with image attended feature for history to obtain final image attended feature which is shown in algorithm  1

4.2.2 Granular Text Attention (GTA) module

Similar to 4.2.1, We find out the question words that the model needs to attend more in order to predict the answer properly. Suppose we have a question , which has words . We mask two words at a time and probe the model to run with the masked question . For eq , can be . There are total possible combinations for . Let the ground truth class probability predicted by the model when probed with masked Question be . We find the masked question with which we get the maximum value of and further, retrieve the attention map generated by the model when probed with . Then we multiply the attention probability of each word with each question token which is as follows:

(4)

where is a vector representation which brings image feature to a common feature space. is a vector representing word which brings feature to a common feature space, where , is the length of the sequence and is the attention probability mask. Here are weight matrices and bias terms. is the attended question feature for question sequence. Similarly we obtain attention feature vector for history sequence. Finally we combine attended question feature with attended history feature to obtain final text attended feature which is shown in algorithm  1.

4.2.3 Granular Multi-modal Attention (GMA)

The challenging task of this module is to combine image attention (GIA) and text attention (GTA) modality. We adapt most efficient method, Multimodal Compact Bilinear Pooling (MCB)  [11], to combine image modality with language modality. This method uses Fast Fourier Transform to convert image and text space into Fourier space, where it combine both modality using compact bi-linear pooling method, then using Inverse Fast Fourier Transform to bring back to final attention space. From this, we obtain correlation feature between image features and question features. Then we obtain attention mask by weighted average of question features and image features. The output of weighted average features is fed through tanh layer followed by linear layer which makes their sizes equal. Softmax is applied over linear layer to get attention map which indicate the probability of contribution of each image feature. Finally, this attention map is multiplied by image features to predict final answer which is as follows:

(5)

where are the attended vector representing for GAIN and GTAN. are weight matrices and bias terms. is the attention probability mask and is the final attention vector. In MCB method, we first project the lower dimensional inputs and to a 16,000 dimensional space and combines them and later projects it back into a lower dimensional space to get a joint feature representation . Finally we minimize the cross entropy loss over all training examples. The cross entropy loss between the predicted and ground truth answer is given by:

(6)
1:I: Given input image
2:Q: 10 rounds of Question
3:H: 10 rounds of Question and Answer pair
4:GMA Mechanism:
5:while loop do
6:       Compute Image Embedding
7:       Compute Question Encoding
8:       Compute History Encoding
9:       while k=1:K do (For GTAN)
10:              = ATTENTION
11:              = ATTENTION
12:             
13:             
14:              = ATTENTION
15:                     
16:       while l=1:L do (For GIAN)
17:              = ATTENTION
18:              = ATTENTION
19:             
20:             
21:              = ATTENTION
22:                     
23:        = MCB_ATT
24:       
25:——————————————————-
26:procedure :attention(, )
27:       Image feature:
28:       Question feature:
29:       
30:       
31:       
32:       
33:       
34:       return
Algorithm 1 Granular Multi-modal Attention
Model R@1 R@5 R@10 MRR Mean NDGC
SAN(Baseline) 40.93 72.61 83.50 0.5616 5.11 0.4595
HCIAE 41.15 72.67 82.98 0.5763 5.32 0.4671
MCB-att 42.62 74.45 84.33 0.6013 5.13 0.4715
GTA 43.89 77.48 87.21 0.6045 4.81 0.4808
GIA 44.03 78.57 88.60 0.6092 4.68 0.4877
GMA_cat 44.82 80.05 89.33 0.6126 4.26 0.5012
GMA_MCB 45.10 80.48 90.15 0.6187 3.91 0.5095
GMA_MCB-att 45.66 81.62 91.26 0.6234 3.68 0.5168
Table 1: Ablation analysis of our model on VisDial-v1.0 in test-std dataset.

5 Experiments

We evaluate our proposed model in the following ways: first, we evaluate our proposed Granular Multi-modal Attention Network against other variants described in section 5.3. Second, we have shown rank correlation in table- 2 to analyze correlation between attention mask with the gradient class activation map. Third, we compare our method with state-of-the-art(SOTA) methods in table- 4 such as ‘visdial’ [5]. Finally, we show the Grad-CAM  [29] visualization of aleatoric uncertainty and baseline model(late fusion). We further compare our model with state-of-the-art model such as ‘visdial’ [5]. The quantitative evaluation is conducted using standard retrieval metrics namely models are evaluated on standard retrieval metrics – (1) mean rank, (2) recall @k and (3) mean reciprocal rank (MRR) of the human response in the returned sorted list.

Model Rank Correlation P value EMD
SAN 0.3415 1.7913 0.48
MCB 0.3326 1.7973 0.47
GTA 0.3549 1.7840 0.44
GIA 0.3590 1.7814 0.42
GMA 0.3670 1.7701 0.41
Table 2: Rank Correlation, EMD of Grad-CAM with attention mask

5.1 Dataset

We evaluate our proposed approach by conducting experiments on Visual Dialog dataset [5], which contains human annotated questions based on images of MS-COCO dataset. This dataset was developed by pairing two subjects on Amazon Mechanical Turk to chat about an image. One person was assigned the job of a ‘questioner’ and the other person acted as an ‘answerer’. The questioner sees only the text description of the an image which is present in caption from MS-COCO dataset and the original image remains hidden to the questioner. Their task is to ask questions about this hidden image to “imagine the scene better”. The answerer sees the image, caption and answers the questions asked by the questioner. The two of them can continue the conversation by asking and answering questions for 10 rounds at max. We have performed experiments on VisDial 1.0. Visual dialog v1.0 contains 123k dialogues on COCO-train for training split and 2k dialogues on Visual dialog val2018 images for validation split and 8k dialogues on visual dialog test-2018 for test-standard set.The caption is considered to be the first round in the dialog history.

Model R@10 MRR Mean
GTA (K=8) 86.50 0.5616 5.41
GTA (K=32) 86.48 0.5763 5.32
GTA (K=64) 87.21 0.5813 4.81
GTA (K=128) 87.35 0.5820 4.83
GIA (K=8) 87.52 0.5745 5.01
GIA (K=32) 87.92 0.5792 4.93
GIA (K=64) 88.60 0.5926 4.68
GIA (K=128) 88.63 0.5943 4.23
GMA (K=8) 89.02 0.6026 4.26
GMA (K=32) 89.15 0.6187 3.91
GMA (K=64) 90.12 0.6213 3.63
GMA (K=128) 91.26 0.6234 3.68
Table 3: Ablation analysis of our model with respect to various Granular features on VisDial-v1.0 in validation dataset.
Figure 5: Visualization of attentions: In this figure, the first row refers to SAN attention visualization of visual dialog model and second row refers to MCB-att attention visualization, third row refers to GIA attention visualization, fourth row refers to GTA attention visualization and fifth row refers to GMA attention visualization. The first column indicates the visualization of attention rounds of dialog from round 1 to 10.

5.2 Ablation Analysis on Granular Feature

We conduct an experiment on various granular features for image and question attention map. We start with K=8,32,64,128 granular per image. We can observe that K=64, we observe significant improvement in accuracy score as shown in table-3. Further increasing K=128 it improves but not significant. So we select number of granular object is 64 in case of image and also in question. Granular feature distribution for various value of K is shown in figure -8. From the distribution we can observe that feature are are correlated and lies between 0.2 to 0.6.

5.3 Ablation Analysis on Attention Network

In this, we provide comparison of our proposed model GMA and other variants along with baseline model using various metrics in the table - 1. Each row provides results for one of the variations. The first block provides scores of our implementation for traditional methods such as stack attention method which is our baseline, History-Conditioned Image Attentive Encoder (HCIAE) [17] method and Multimodal Compact Bilinear Pooling (MCB) [11] based attention method. The second block provides scores for our proposed Granular Image (GIA) and Text(GTA) attention model and the third block provides scores for different variant of our proposed multimodal attention model GMA. It is apparent that our best variant (GMA(MCB-att)) outperforms all the other variants achieving an improvement of 3.57% in R@1 score, 4.22% in R@5 score, 3.77% in R@10 score, 0.0899% in MRR and 2.27% in mean over the baseline variant.

(a) K=0 (b) K=8 (c) K=32 (d) K=64 (e) K=128
Figure 6:

Joint probability distribution of GRAD-CAM mask and predicted attention mask distribution . We observe that as overlapping or EMD improves, the Joint Distribution is also improves as we increase K=8 to K=128. The Earth Mover Distance between them is 0.37. We observe that as overlapping or EMD improves, the Joint Distribution is also improves.

Model R@1 R@5 R@10 MRR Mean NDGC
Baseline 40.56 71.35 82.83 0.53 5.95 0.450
HRE [5] 39.93 70.45 81.50 0.54 6.41 0.454
LF[5] 40.95 72.45 82.83 0.55 5.95 0.453
MN [5] 40.98 72.30 83.30 0.55 5.92 0.475
MN-att [5] 42.43 74.00 84.35 0.56 5.59 0.476
LF-att [5] 42.85 74.83 85.05 0.57 5.41 0.473
NMN [16] 47.50 78.12 88.81 0.61 4.40 0.540
GMA_cat (ours) 44.82 80.05 89.33 0.61 4.26 0.5012
GMA_MCB (ours) 45.10 80.48 90.15 0.61 3.91 0.5095
GMA_MCB-att (ours) 45.66 81.62 91.26 0.62 3.68 0.5168
Table 4: SOTA results for Visual dialog v1.0 in Test-Standard .

5.3.1 Rank correlation, P value and EMD

In the section, we measure Rank correlation(RC), P-value and Earth mover’s distance (EMD). RC is a measure of monotonicity between two datasets and it is usually used for comparing images. P value indicates rough probability of uncorrelated system producing datasets having rank correlation equal to or more than those produced by these datasets. We provide the quantitative ablation results for rank-correlation(higher is better) and P value(lower is better) in table - 2. It is apparent that MCB outperforms other models in generating attention maps with rank correlation value of 0.3570 and p value of 1.7781 followed by GIA and GTA with rank correlation of 0.3509 and 0.3519 respectively and P value of 1.7820 and 1.7834. EMD is a measure of dissimilarity between two distributions. We also provide quantitative results for EMD in table - 2. There is 7% improvement over baseline(SAN) model.

Figure 7: The mean rank of all the models on the basis of all scores are plotted on the x- axis. CD=4.0277, p=0.00008961. Here our GMA model and others variants are described in section 5.3. The colored lines between the two models represents that these models are not significantly different from each other.

5.3.2 Statistical Significance Analysis

We analyze Statistical Significance [7] of our GMA model against other models mentioned in section 5.3. The Critical Difference (CD) for Nemenyi [9] test depends on given (confidence level, which is 0.05 in our case) for average ranks and N(number of tested datasets). Low difference in ranks for two models implies that they are significantly less different. Otherwise, they are statistically different. Figure 7 visualizes the post hoc analysis using the CD diagram. It is clear that GMA works best and is significantly different from other methods. Models within a single colored line are statistically indifferent.

Figure 8: This figure shows the distribution of granular feature for various granular value such as K=8,32,64,128.

5.4 Comparison with Other Baselines

The comparison of our method with various state-of-the-art methods for visual dialog dataset v1.0 are provided in table - 4. The first block of the table consists of state-of-the-art methods for visual dialog model and second block consist of variant of our proposed method. Final row (GMA(MCB-att)) is our best proposed model. We compare our results with the baseline model ‘Late-fusion-QIH’ [5]. We observe that we get an improvement of about 3.77% & 1.55% in R@10 score and 9% & 5% in MRR over the baseline model and the best model of Das et.al [5] respectively. We develop our GMA model on basic attention model (SAN [38]) and we achieve improvement of 5% on MRR score. The simple and flexible architecture of our GMA model can adapt on latest attention model also.

5.5 Qualitative Result

We provide the qualitative results which can distinguish between baseline results with our GMA modal and other variants (SAN, GIA, GTA) as shown in Figure - 5 for a particular example. It is apparent that our proposed GMA model improves dialog attention probability over all other methods. For example, in the first image, the question was “Is this in a park?”. The attention mask for the baseline model distributed over completer image, however our proposed model mainly focused on the field, plant and background image. Our proposed model explains about field, plant and background image, which provides extra information about the query, thus eventually we can observe that granularity in the image and question help to increase confidence of the attention map in the answer. The localisation of attention map increases from SAN to GMA as shown in Figure-5. SAN is least localised and GMA is the most significant. Finally, GMA attention map is much more localised as compare to GTA, GIA model.

6 Conclusion

In this paper we propose a novel Granular Multi-modal Attention Network that aims to jointly attend to appropriately sized granular image attention regions and granular textual regions using the correct context for each cue. We observe that the proposed attention regions provide improved attention regions as evaluated using a thorough empirical analysis. We further observe that the improved attention obtained using the proposed method consistently improves results for the task of visual dialog. Moreover, the proposed attention regions also correlate well with the regions as obtained by visualizing the gradients using Grad-cam. Thus, we consider that we are obtaining consistent attention regions that aid the network in solving the task of visual dialog. In future, we aim to further explore the proposed method in more such vision and language based tasks. We also aim to further explore the idea of obtaining correct semantic granular regions for solving various tasks.

References

  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, Vol. 3, pp. 6. Cited by: §1.
  • [2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) VQA: Visual Question Answering. In

    International Conference on Computer Vision (ICCV)

    ,
    Cited by: §2.
  • [3] P. Chattopadhyay, D. Yadav, V. Prabhu, A. Chandrasekaran, A. Das, S. Lee, D. Batra, and D. Parikh (2017)

    Evaluating visual conversational agents via cooperative human-ai games

    .
    In Proceedings of the Fifth AAAI Conference on Human Computation and Crowdsourcing (HCOMP), Cited by: §2.
  • [4] A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra (2016) Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?. In

    Conference on Empirical Methods in Natural Language Processing (EMNLP)

    ,
    Cited by: §1.
  • [5] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra (2017) Visual dialog. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §1, §2, §2, §5.1, §5.4, Table 4, §5.
  • [6] A. Das, S. Kottur, J. M.F. Moura, S. Lee, and D. Batra (2017) Learning cooperative visual dialog agents with deep reinforcement learning. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [7] J. Demšar (2006) Statistical comparisons of classifiers over multiple data sets.

    Journal of Machine learning research

    7 (Jan), pp. 1–30.
    Cited by: §5.3.2.
  • [8] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. Platt, et al. (2015) From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §2.
  • [9] D. Fišer, T. Erjavec, and N. Ljubešić (2016) JANES v0. 4: korpus slovenskih spletnih uporabniških vsebin. Slovenščina 2 (4), pp. 2. Cited by: §5.3.2.
  • [10] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847. Cited by: §2.
  • [11] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847. Cited by: item 1, §4.2.3, §5.3.
  • [12] J. Hajič (2017) Visual question answering. Cited by: §2.
  • [13] U. Jain, S. Lazebnik, and A. G. Schwing (2018) Two can play this game: visual dialog with discriminative question generation and answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [14] J. Johnson, A. Karpathy, and L. Fei-Fei (2016)

    Densecap: fully convolutional localization networks for dense captioning

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4565–4574. Cited by: §2.
  • [15] A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: §2.
  • [16] S. Kottur, J. M. Moura, D. Parikh, D. Batra, and M. Rohrbach (2018) Visual coreference resolution in visual dialog using neural module networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 153–169. Cited by: Table 4.
  • [17] J. Lu, A. Kannan, J. Yang, D. Parikh, and D. Batra (2017) Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In Advances in Neural Information Processing Systems, pp. 314–324. Cited by: §2, §5.3.
  • [18] J. Lu, J. Yang, D. Batra, and D. Parikh (2016) Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pp. 289–297. Cited by: §2.
  • [19] M. Malinowski and M. Fritz (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems (NIPS), Cited by: §2.
  • [20] N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende (2016) Generating natural questions about an image. arXiv preprint arXiv:1603.06059. Cited by: §2.
  • [21] H. Noh, P. Hongsuck Seo, and B. Han (2016) Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 30–38. Cited by: §2.
  • [22] B. N. Patro, Anupriy, and V. P. Namboodiri (2019) Probabilistic framework for solving visual dialog. ArXiv abs/1909.04800. Cited by: §2.
  • [23] B. N. Patro, M. Lunayach, S. Patel, and V. P. Namboodiri (2019) U-cam: visual explanation using uncertainty based class activation maps. In arXiv preprint arXiv:1908.06306, Cited by: §2.
  • [24] B. Patro and V. P. Namboodiri (2018-06) Differential attention for visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [25] B. N. Patro, S. Kumar, V. K. Kurmi, and V. Namboodiri (2018) Multimodal differential network for visual question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4002–4012. External Links: Link Cited by: §2.
  • [26] B. N. Patro, V. K. Kurmi, S. Kumar, and V. Namboodiri (2018) Learning semantic sentence embeddings using sequential pair-wise discriminator. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 2715–2729. Cited by: §2.
  • [27] V. Petsiuk, A. Das, and K. Saenko (2018) RISE: randomized input sampling for explanation of black-box models. arXiv preprint arXiv:1806.07421. Cited by: §4.2.1.
  • [28] M. Ren, R. Kiros, and R. Zemel (2015) Exploring models and data for image question answering. In Advances in Neural Information Processing Systems (NIPS), pp. 2953–2961. Cited by: §2.
  • [29] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization.. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §5.
  • [30] K. J. Shih, S. Singh, and D. Hoiem (2016) Where to look: focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4613–4621. Cited by: §2.
  • [31] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng (2014) Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association of Computational Linguistics 2 (1), pp. 207–218. Cited by: §2.
  • [32] F. Strub, H. De Vries, J. Mary, B. Piot, A. Courville, and O. Pietquin (2017) End-to-end optimization of goal-driven and visually grounded dialogue systems. arXiv preprint arXiv:1703.05423. Cited by: §2.
  • [33] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. Cited by: §2.
  • [34] Q. Wu, C. Shen, P. Wang, A. Dick, and A. van den Hengel (2017) Image captioning and visual question answering based on attributes and external knowledge. IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1367–1381. Cited by: §4.2.1.
  • [35] Q. Wu, P. Wang, C. Shen, I. Reid, and A. van den Hengel (2018) Are you talking to me? reasoned visual dialog generation through adversarial learning. Cited by: §2.
  • [36] H. Xu and K. Saenko (2016) Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision, pp. 451–466. Cited by: §2.
  • [37] X. Yan, J. Yang, K. Sohn, and H. Lee (2016) Attribute2image: conditional image generation from visual attributes. In European Conference on Computer Vision, pp. 776–791. Cited by: §2.
  • [38] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola (2016) Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29. Cited by: §3, §5.4.
  • [39] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei (2016) Visual7w: grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4995–5004. Cited by: §2.