In ‘Visual Dialog’  problem, an AI agent has access to an image. The aim is that the bot should be able to answer a question given the image and the context of the previous conversation. We can gain insights for a method by observing the regions of the image the method most focuses on while answering a question. It has been observed in a recent work that humans also attend to specific regions of an image while answering questions . We therefore expect strong correlation between focusing on the “right” regions while answering questions and obtaining better semantic understanding to solve the problem. This correlation exists as far as humans are concerned . We therefore aim in this paper to obtain image based attention regions that correlate better with human attention. It is known that using attention for solving various problems that relate to vision and language is a good approach. However, in an interesting evaluation carried out for the visual question answering task  it was observed that the attention networks focus on regions different from that used by humans for answering questions. We hypothesize that this could be due to the fact that they do not focus on the right granularity and correct context when obtaining attention for image and textual regions.
In this work, we aim to address this problem. We particularly aim to obtain ‘granular regions’ for images using object proposals as was used earlier by Anderson et al.  and word based attention using the appropriate context. We observe that by using all the text to attend to each image proposal using granular image attention network and using the whole image to attend to each word using granular word attention networks, one can obtain the appropriate attention network regions in image and text respectively. Further, these when used in conjunction result in further improvement through granular multi-modal attention networks. In figure 1 we illustrate the main idea of the network. As can be observed, the multi-modal attention network obtains attention regions from both image and text and when combined provides improved attention regions.
As part of the work, we also carry out a thorough evaluation of all the main attention methods that have been proposed in literature for various vision and language tasks. This evaluation also provides the ground for properly analyzing the various attention methods. Additionally, we consider the correlation between the visual explanation and attention regions. In literature, there have been visualization efforts such as Grad-cam 
that aim to provide explanations for the decisions by neural networks by visualizing the gradient information. We show that the proposed attention network regions correlates well with these regions obtained using Grad-cam. To summarize through this paper we provide the following main contributions:
We propose a granular image-attention (GIA) and granular text-attention (GTA) based approach to obtain improved attention regions for solving visual dialog.
We evaluate three variants of the proposed granular attention networks - one where we only obtain image attention, the other where we obtain text based image attention and the final proposed method where we combine these attentions using multi-modal attention method.
We obtain an improved overall accuracy by 6% NDGC score as compared to other baseline approaches using the proposed attention model for the visual dialog task.
We provide a thorough empirical analysis for the method and also provide visualizations of attention mask for granular multi-modal attention network(GMA) and measure rank correlation among various attention masks with Grad-CAM regions to ensure that the attention regions correlate with the regions used by network in solving visual dialog.
2 Related Work
A conversation about an image is known as Visual Dialog. This is one of the recent challenges in the field of vision and language. The task of Visual dialog involves image captioning[31, 33, 15, 8, 14, 37] that is description about image, visual question answering [19, 2, 28, 21, 36, 30, 18] that is responding natural language question about an image, visual question generation [20, 25] that generating natural language question about an image and generating similar types of question of given question. The multimodal attention mechanism is one of the core mechanisms in the interaction system. In VQA, proposed attention network in stacked fashion, combine both modality in frequency space and  use exemplar way to combine and get multimodal attention map.  proposed a cross-modal attention network by looking at word level and object level.  proposed a nice algorithm for minimizing uncertainty and get a robust attention map in a multimodal system. Visual dialog requires the agents to have a meaningful conversation about the visual content of an image. This was introduced by . They proposed three approaches, i.e, late fusion, by concatenating all the history around, attention-based hierarchical LSTM for handing variable-length history and memory-based method which resulted in the highest accuracy.
Das et al.
proposed deep reinforcement learning-based end-to-end trained model for Visual Dialog. Strubet al. proposed an end-to-end RL optimization and it’s applications to multimodal tasks. Chattopadhyay et al. designed an interactive AI image guessing game on visual dialog. Lu et al.  and Wu et al. proposed generator and discriminator based architecture. Recent work on the visual dialog as proposed by Jain et al. is based on discriminative question generation and answering.  proposed a probabilistic method to generate diverse answers and also minimize uncertainty in the answer generation. Various methods have been proposed to handle variable-length history rounds. Das et al. has also proposed the ‘Late Fusion’ (LF) method, where question word tokens are concatenated with answer tokens then obtained its embedding using LSTM network. In order to handle variable-length history, they proposed hierarchical LSTM. Also, they proposed a memory network model to perform the best results in terms of accuracy. In contrast to the earlier architectures, we address the question of obtaining correct attention regions for solving the task and provide comparisons with the related attention methods.
3 Background: Stacked Attention Network (SAN)
Stacked Attention Network (SAN) 
is a question guided attention scheme which learns the attention probability vector of the visual information based on the input question vector. We use this as our reference network. Attention is weighted average of question features and image features. The output of weighted average features is fed into tanh layer followed by linear layer to compute the attention probability vector. Softmax is applied over linear layer to obtain attention map, which indicates the probability of contribution of each spatial feature. Finally, this attention map is multiplied by image feature and resultant is added to question feature to predict answer. SAN uses stack of this attention layer as a iterative step to narrow down the selection portion of visual information. Mathematical expression for SAN is as follows:
where is the question vector in iteration and is the image feature matrix. This process repeats times to obtain correct answer.
3.1 Multimodal Compact Bilinear Pooling(MCB)
MCB is another approaches to combine image modality with language modality. Here, we obtain image features using Convolution Neural Network and text features using Recurrent Neural Network. Finally, we combine these two using Fast Fourier Transform and obtain final attention vector by applying Inverse Fast Fourier Transform to final attention. Then we normalize the attention vector and obtain the final answer. In compact bi-linear pooling, we transform image features and question features to common space using compact bi-linear pooling. From this, we obtain correlation feature between image features and question features. Then we obtain attention mask by weighted average of question features and image features. The output of weighted average features is fed through tanh layer followed by linear layer which makes their sizes equal. Soft-max is applied over linear layer to get attention map which indicate the probability of contribution of each image feature. Finally, this attention map is multiplied by “conv5” features and resultant is added to question features to predict answer.
The main focus in our approach for solving visual dialog task is to use multi-modal attention based method to combine granular image and textual attention map to improve attention and visual explanation. The key differences in our architecture as compared to an existing visual dialog architecture is the use of granular Attention mask. The other aspects of dialog are retained as is. In particular, we adopt a classification based approach for solving visual dialog where an image embedding is combined with the question and history embedding to solve for the answer. This is illustrated in figure 4.
Our method consists of four parts as illustrated in figure 4:
We obtain granular feature for image and Question embeddings using standard pretrained VQA (MCB) Model.
We obtain granular image attention feature for question and history based on query image. Similarly, we obtain granular text attention feature for the text based on the query question and previous history.
Then, we combine these attention mechanisms using a Multi-modal Attention mechanism. We also evaluate Multimodal Compact Bilinear Pooling mechanism and concatenation to combine these two attentions but we observed that Multi-modal Attention performs better. Finally, we obtain context encoding vector.
This context embedding compares with all the candidate answer options and obtain a 100 dim vector by reasoning with each answer option and context embedding vector.
4.2 Granular Multimodal Attention Module
Granularity in image and question is required to answer each round of question in visual dialog. It is built on two main networks - Granular Image Attention Network and Granular Text Attention Network. In image attention network, we obtain attention features by attending using text features for each object present in image. In text attention, we obtain attention features by attending image features for each question (history token). The joint attention model (GMA) combines image and text attention features to obtain final answer encoding feature.
4.2.1 Granular Image Attention (GIA) module
In Granular Image Attention Network, we obtain relevant regions in an image for answering particular question in visual dialog.  generates importance regions indicating how salient each pixel is for the model’s prediction. Following this explanation approach, we mask different image regions in random combinations to compute the importance of each region to produce the ground truth answer. Saliency map for a given output class is computed as a weighted sum of random masks, where weights are the probability scores of that ground-truth class for the corresponding mask. We generate saliency map for the ground-truth class. We use the image features from vgg-16,CONV-5 given by where and =512. To produce an importance map, we randomly sample a set of binary masks according to distribution and probe the model by running it on masked image regions , and denotes element-wise multiplication. Then, we take the weighted average of the masks where the weights are the confidence scores of the ground-truth class and normalise it by the expectation of .
Where denotes the saliency map for the ground-truth class. The map generated denotes the importance of each image region for predicting the ground truth class and therefore, we use this map obtained as a basis for improving the image-attention network of the baseline model. We treat each round of the visual dialog as a visual question answer (VQA)  with external knowledge as a history as a context input. So We obtain image saliency map for each round of the visual dialog. We term this as granular image feature. The attention applied on this image feature is know as granular image attention. Then we apply attention mask to get attention probability of each of the spatial region. The entire procedure is as follows:
where is a vector representing bounding box. is a vector representing the question, , are weight matrices and bias terms. is the attention probability mask. is the image attended feature for question sequence. Similarly, we obtain image attention feature vector for history sequence. Finally we combine image attended feature for question with image attended feature for history to obtain final image attended feature which is shown in algorithm 1
4.2.2 Granular Text Attention (GTA) module
Similar to 4.2.1, We find out the question words that the model needs to attend more in order to predict the answer properly. Suppose we have a question , which has words . We mask two words at a time and probe the model to run with the masked question . For eq , can be . There are total possible combinations for . Let the ground truth class probability predicted by the model when probed with masked Question be . We find the masked question with which we get the maximum value of and further, retrieve the attention map generated by the model when probed with . Then we multiply the attention probability of each word with each question token which is as follows:
where is a vector representation which brings image feature to a common feature space. is a vector representing word which brings feature to a common feature space, where , is the length of the sequence and is the attention probability mask. Here are weight matrices and bias terms. is the attended question feature for question sequence. Similarly we obtain attention feature vector for history sequence. Finally we combine attended question feature with attended history feature to obtain final text attended feature which is shown in algorithm 1.
4.2.3 Granular Multi-modal Attention (GMA)
The challenging task of this module is to combine image attention (GIA) and text attention (GTA) modality. We adapt most efficient method, Multimodal Compact Bilinear Pooling (MCB) , to combine image modality with language modality. This method uses Fast Fourier Transform to convert image and text space into Fourier space, where it combine both modality using compact bi-linear pooling method, then using Inverse Fast Fourier Transform to bring back to final attention space. From this, we obtain correlation feature between image features and question features. Then we obtain attention mask by weighted average of question features and image features. The output of weighted average features is fed through tanh layer followed by linear layer which makes their sizes equal. Softmax is applied over linear layer to get attention map which indicate the probability of contribution of each image feature. Finally, this attention map is multiplied by image features to predict final answer which is as follows:
where are the attended vector representing for GAIN and GTAN. are weight matrices and bias terms. is the attention probability mask and is the final attention vector. In MCB method, we first project the lower dimensional inputs and to a 16,000 dimensional space and combines them and later projects it back into a lower dimensional space to get a joint feature representation . Finally we minimize the cross entropy loss over all training examples. The cross entropy loss between the predicted and ground truth answer is given by:
We evaluate our proposed model in the following ways: first, we evaluate our proposed Granular Multi-modal Attention Network against other variants described in section 5.3. Second, we have shown rank correlation in table- 2 to analyze correlation between attention mask with the gradient class activation map. Third, we compare our method with state-of-the-art(SOTA) methods in table- 4 such as ‘visdial’ . Finally, we show the Grad-CAM  visualization of aleatoric uncertainty and baseline model(late fusion). We further compare our model with state-of-the-art model such as ‘visdial’ . The quantitative evaluation is conducted using standard retrieval metrics namely models are evaluated on standard retrieval metrics – (1) mean rank, (2) recall @k and (3) mean reciprocal rank (MRR) of the human response in the returned sorted list.
|Model||Rank Correlation||P value||EMD|
We evaluate our proposed approach by conducting experiments on Visual Dialog dataset , which contains human annotated questions based on images of MS-COCO dataset. This dataset was developed by pairing two subjects on Amazon Mechanical Turk to chat about an image. One person was assigned the job of a ‘questioner’ and the other person acted as an ‘answerer’. The questioner sees only the text description of the an image which is present in caption from MS-COCO dataset and the original image remains hidden to the questioner. Their task is to ask questions about this hidden image to “imagine the scene better”. The answerer sees the image, caption and answers the questions asked by the questioner. The two of them can continue the conversation by asking and answering questions for 10 rounds at max. We have performed experiments on VisDial 1.0. Visual dialog v1.0 contains 123k dialogues on COCO-train for training split and 2k dialogues on Visual dialog val2018 images for validation split and 8k dialogues on visual dialog test-2018 for test-standard set.The caption is considered to be the first round in the dialog history.
5.2 Ablation Analysis on Granular Feature
We conduct an experiment on various granular features for image and question attention map. We start with K=8,32,64,128 granular per image. We can observe that K=64, we observe significant improvement in accuracy score as shown in table-3. Further increasing K=128 it improves but not significant. So we select number of granular object is 64 in case of image and also in question. Granular feature distribution for various value of K is shown in figure -8. From the distribution we can observe that feature are are correlated and lies between 0.2 to 0.6.
5.3 Ablation Analysis on Attention Network
In this, we provide comparison of our proposed model GMA and other variants along with baseline model using various metrics in the table - 1. Each row provides results for one of the variations. The first block provides scores of our implementation for traditional methods such as stack attention method which is our baseline, History-Conditioned Image Attentive Encoder (HCIAE)  method and Multimodal Compact Bilinear Pooling (MCB)  based attention method. The second block provides scores for our proposed Granular Image (GIA) and Text(GTA) attention model and the third block provides scores for different variant of our proposed multimodal attention model GMA. It is apparent that our best variant (GMA(MCB-att)) outperforms all the other variants achieving an improvement of 3.57% in R@1 score, 4.22% in R@5 score, 3.77% in R@10 score, 0.0899% in MRR and 2.27% in mean over the baseline variant.
|(a) K=0||(b) K=8||(c) K=32||(d) K=64||(e) K=128|
Joint probability distribution of GRAD-CAM mask and predicted attention mask distribution . We observe that as overlapping or EMD improves, the Joint Distribution is also improves as we increase K=8 to K=128. The Earth Mover Distance between them is 0.37. We observe that as overlapping or EMD improves, the Joint Distribution is also improves.
5.3.1 Rank correlation, P value and EMD
In the section, we measure Rank correlation(RC), P-value and Earth mover’s distance (EMD). RC is a measure of monotonicity between two datasets and it is usually used for comparing images. P value indicates rough probability of uncorrelated system producing datasets having rank correlation equal to or more than those produced by these datasets. We provide the quantitative ablation results for rank-correlation(higher is better) and P value(lower is better) in table - 2. It is apparent that MCB outperforms other models in generating attention maps with rank correlation value of 0.3570 and p value of 1.7781 followed by GIA and GTA with rank correlation of 0.3509 and 0.3519 respectively and P value of 1.7820 and 1.7834. EMD is a measure of dissimilarity between two distributions. We also provide quantitative results for EMD in table - 2. There is 7% improvement over baseline(SAN) model.
5.3.2 Statistical Significance Analysis
We analyze Statistical Significance  of our GMA model against other models mentioned in section 5.3. The Critical Difference (CD) for Nemenyi  test depends on given (confidence level, which is 0.05 in our case) for average ranks and N(number of tested datasets). Low difference in ranks for two models implies that they are significantly less different. Otherwise, they are statistically different. Figure 7 visualizes the post hoc analysis using the CD diagram. It is clear that GMA works best and is significantly different from other methods. Models within a single colored line are statistically indifferent.
5.4 Comparison with Other Baselines
The comparison of our method with various state-of-the-art methods for visual dialog dataset v1.0 are provided in table - 4. The first block of the table consists of state-of-the-art methods for visual dialog model and second block consist of variant of our proposed method. Final row (GMA(MCB-att)) is our best proposed model. We compare our results with the baseline model ‘Late-fusion-QIH’ . We observe that we get an improvement of about 3.77% & 1.55% in R@10 score and 9% & 5% in MRR over the baseline model and the best model of Das et.al  respectively. We develop our GMA model on basic attention model (SAN ) and we achieve improvement of 5% on MRR score. The simple and flexible architecture of our GMA model can adapt on latest attention model also.
5.5 Qualitative Result
We provide the qualitative results which can distinguish between baseline results with our GMA modal and other variants (SAN, GIA, GTA) as shown in Figure - 5 for a particular example. It is apparent that our proposed GMA model improves dialog attention probability over all other methods. For example, in the first image, the question was “Is this in a park?”. The attention mask for the baseline model distributed over completer image, however our proposed model mainly focused on the field, plant and background image. Our proposed model explains about field, plant and background image, which provides extra information about the query, thus eventually we can observe that granularity in the image and question help to increase confidence of the attention map in the answer. The localisation of attention map increases from SAN to GMA as shown in Figure-5. SAN is least localised and GMA is the most significant. Finally, GMA attention map is much more localised as compare to GTA, GIA model.
In this paper we propose a novel Granular Multi-modal Attention Network that aims to jointly attend to appropriately sized granular image attention regions and granular textual regions using the correct context for each cue. We observe that the proposed attention regions provide improved attention regions as evaluated using a thorough empirical analysis. We further observe that the improved attention obtained using the proposed method consistently improves results for the task of visual dialog. Moreover, the proposed attention regions also correlate well with the regions as obtained by visualizing the gradients using Grad-cam. Thus, we consider that we are obtaining consistent attention regions that aid the network in solving the task of visual dialog. In future, we aim to further explore the proposed method in more such vision and language based tasks. We also aim to further explore the idea of obtaining correct semantic granular regions for solving various tasks.
-  (2018) Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, Vol. 3, pp. 6. Cited by: §1.
VQA: Visual Question Answering.
International Conference on Computer Vision (ICCV), Cited by: §2.
Evaluating visual conversational agents via cooperative human-ai games. In Proceedings of the Fifth AAAI Conference on Human Computation and Crowdsourcing (HCOMP), Cited by: §2.
Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?.
Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, §2, §5.1, §5.4, Table 4, §5.
-  (2017) Learning cooperative visual dialog agents with deep reinforcement learning. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
Statistical comparisons of classifiers over multiple data sets.
Journal of Machine learning research7 (Jan), pp. 1–30. Cited by: §5.3.2.
-  (2015) From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §2.
-  (2016) JANES v0. 4: korpus slovenskih spletnih uporabniških vsebin. Slovenščina 2 (4), pp. 2. Cited by: §5.3.2.
-  (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847. Cited by: §2.
-  (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847. Cited by: item 1, §4.2.3, §5.3.
-  (2017) Visual question answering. Cited by: §2.
-  (2018) Two can play this game: visual dialog with discriminative question generation and answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
Densecap: fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4565–4574. Cited by: §2.
-  (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: §2.
-  (2018) Visual coreference resolution in visual dialog using neural module networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 153–169. Cited by: Table 4.
-  (2017) Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In Advances in Neural Information Processing Systems, pp. 314–324. Cited by: §2, §5.3.
-  (2016) Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pp. 289–297. Cited by: §2.
-  (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems (NIPS), Cited by: §2.
-  (2016) Generating natural questions about an image. arXiv preprint arXiv:1603.06059. Cited by: §2.
-  (2016) Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 30–38. Cited by: §2.
-  (2019) Probabilistic framework for solving visual dialog. ArXiv abs/1909.04800. Cited by: §2.
-  (2019) U-cam: visual explanation using uncertainty based class activation maps. In arXiv preprint arXiv:1908.06306, Cited by: §2.
-  (2018-06) Differential attention for visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2018) Multimodal differential network for visual question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4002–4012. External Links: Cited by: §2.
-  (2018) Learning semantic sentence embeddings using sequential pair-wise discriminator. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 2715–2729. Cited by: §2.
-  (2018) RISE: randomized input sampling for explanation of black-box models. arXiv preprint arXiv:1806.07421. Cited by: §4.2.1.
-  (2015) Exploring models and data for image question answering. In Advances in Neural Information Processing Systems (NIPS), pp. 2953–2961. Cited by: §2.
-  (2017) Grad-cam: visual explanations from deep networks via gradient-based localization.. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §5.
-  (2016) Where to look: focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4613–4621. Cited by: §2.
-  (2014) Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association of Computational Linguistics 2 (1), pp. 207–218. Cited by: §2.
-  (2017) End-to-end optimization of goal-driven and visually grounded dialogue systems. arXiv preprint arXiv:1703.05423. Cited by: §2.
-  (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. Cited by: §2.
-  (2017) Image captioning and visual question answering based on attributes and external knowledge. IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1367–1381. Cited by: §4.2.1.
-  (2018) Are you talking to me? reasoned visual dialog generation through adversarial learning. Cited by: §2.
-  (2016) Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision, pp. 451–466. Cited by: §2.
-  (2016) Attribute2image: conditional image generation from visual attributes. In European Conference on Computer Vision, pp. 776–791. Cited by: §2.
-  (2016) Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29. Cited by: §3, §5.4.
-  (2016) Visual7w: grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4995–5004. Cited by: §2.