Answering questions regarding images requires us to obtain an understanding about the image. We can gain insights into a method by observing the region of an image the method focuses on while answering a question. It has been observed in a recent work that humans also attend to specific regions of an image while answering questions . We therefore expect strong correlation between focusing on the “right” regions while answering questions and obtaining better semantic understanding to solve the problem. This correlation exists as far as humans are concerned . We therefore aim in this paper to obtain image based attention regions that correlate better with human attention. We do that by obtaining a differential attention. The differential attention relies on an exemplar model of cognition.
In cognition studies, the exemplar theory suggests that humans are able to obtain generalisation for solving cognitive tasks by relying on an exemplar model. In this model, individuals compare new stimuli with the instances already stored in memory  and obtain answers based on these exemplars. We would like an exemplar model to provide attention. We want to focus on the specific parts in a nearest exemplar that distinguishes it from a far example. We do that by obtaining the differential attention region that distinguishes a supporting exemplar from an opposing exemplar. Our premise is that the difference between a nearest semantic exemplar and a far semantic exemplar can guide attention on a specific image region. We show that by using this differential attention mechanism we are able to obtain significant improvement in solving the visual question answering task. Further, we show that the obtained attention regions are more correlated with human attention regions both quantitatively and qualitatively. We evaluate this on the challenging VQA-1 , VQA-2  and HAT 
The main flow of the method followed is outlined in figure 1
. Given an image and an associated question, we use an attention network to combine the image and question to obtain a reference attention embedding. This is used to order the examples in the database. We obtain a near example as a supporting exemplar and a far example as an opposing exemplar. These are used to obtain a differential attention vector. We evaluate two variants of our approach, one we term as ‘differential attention network’ (DAN) where the supporting and opposing exemplars are used only to provide a better attention on the image. The other we term as ‘differential context network’ (DCN) that obtains a differential context feature. This is obtained from the difference between supporting and opposing exemplars with respect to the original image to provide a differential feature. The additional context is used in answering questions. Both variants improve results over the baseline with the differential context variant being better.
Through this paper we provide the following contributions
We adopt an exemplar based approach to improve visual question answering (VQA) methods by providing a differential attention
We evaluate two variants for obtaining differential attention - one where we only obtain attention and the other where we obtain differential context in addition to attention
We show that this method correlates better with human attention and results in an improved visual question answering that improves the state-of-the-art for image based attention methods. It is also competitive with respect to other proposed methods for this problem.
2 Related Work
The problem of Visual Question Answering (VQA) is a recent problem that was initiated as a new kind of visual Turing test. The aim was to show progress of systems in solving even more challenging tasks as compared to the traditional visual recognition tasks such as object detection and segmentation. An initial work in this area was by Geman et al.  that proposed this visual Turing test. Around the same time Malinowski et al.  proposed a multi-world based approach to obtain questions and answer them from images. These works aimed at answering questions of a limited type. In this work we aim at answering free-form open-domain questions as was attempted by later works.
An initial approach towards solving this problem in the open-domain form was by 
. This was inspired by the work on neural machine translation that proposed translation as a sequence to sequence encoder-decoder framework. However, subsequent works  approached the problem as a classification problem using encoded embeddings. They used soft-max classification over an image embedding (obtained by a CNN) and a question embedding (obtained using an LSTM). Further work by Ma et al.  varied the way to obtain an embedding by using CNNs to obtain both image and question embeddings. Another interesting approach  used dynamic parameter prediction where weights of the CNN model for the image embedding are modified based on the question embedding using hashing. These methods however, are not attention based. Use of attention enables us to focus on specific parts of an image or question that are pertinent for an instance and also offer valuable insight into the performance of the system.
There has been significant interest in including attention to solve the VQA problem. Attention based models comprises of image based attention models, question based attention and some that are both image and question based attention. In image based attention approach the aim is to use the question in order to focus attention over specific regions in an image. An interesting recent work  has shown that it is possible to repeatedly obtain attention by using stacked attention over an image based on the question. Our work is closely related to this work. There has been further work  that considers a region based attention model over images. The image based attention has allowed systematic comparison of various methods as well as enabled analysis of the correlation with human attention models as shown by . In our approach we focus on image based attention using differential attention and show that it correlates better with image based attention. There has been a number of interesting works on question based attention as well . An interesting work obtains varied set of modules for answering questions of different types . Recent work also explores joint image and question based hierarchical co-attention . The idea of differential attention can also be explored through these approaches. However, we restrict ourselves to image based attention as our aim is to obtain a method that correlates well with human attention . There has been an interesting work by  that advocates multimodal pooling and obtains state of the art in VQA. Interestingly, we show that by combining it with the proposed method further improves our results.
In this paper we adopt a classification framework that uses the image embedding combined with the question embedding to solve for the answer using a softmax function in a multiple choice setting. A similar setting is adopted in the Stacked Attention Network (SAN) , that also aims at obtaining better attention and several other state-of-the-art methods. We provide two different variants for obtaining differential attention in the VQA system. We term the first variant a ‘Differential Attention Network’ (DAN) and the other a ‘Differential Context Network’ (DCN). We explain both the methods in the following sub-sections. A common requirement for both these tasks is to obtain nearest semantic exemplars.
3.1 Finding Exemplars
In our method, we use semantic nearest neighbors. Image level similarity does not suffice as the nearest neighbor may be visually similar but may not have the same context implied in the question (for instance, ‘Are the children playing?’ produces similar results for images with children based on visual similarity, whether the children are playing or not). In order to obtain semantic features we use a VQA system to provide us with a joint image-question level embedding that relates meaningful exemplars. We compared image level features against the semantic nearest neighbors and observed that the semantic nearest neighbors were better. We used the semantic nearest neighbors in a k-nearest neighbor approach using a K-D tree data structure to represent the features. The ordering of the data-set features is based on the Euclidean distance. In section 4.1 we provide the evaluation with several values of nearest neighbors that were used as supporting exemplars. For obtaining opposing exemplar we used a far neighbor that was an order of magnitude further than the nearest neighbor. This we obtained through a coarse quantization of training data into bins. We specified the opposing exemplar as one that was around 20 clusters away in a 50 cluster ordering. This parameter is not stringent and it only matters that the opposing exemplar be far from the supporting exemplar. We show that using these supporting and opposing exemplars aids the method and any random ordering adversely effects the method.
3.2 Differential Attention Network (DAN)
In the DAN method, we use a multi-task setting. As one of the tasks we use a triplet loss to learn a distance metric. This metric ensures that the distance between the attention weighted regions of near examples is less and the distance between attention weighted far examples is more. The other task is the main task of VQA. More formally, given an image we obtain an embedding using a CNN that we parameterize through a function where are the weights of the CNN. Similarly the question results in a question embedding after passing through an LSTM parameterised using the function where are the weights of the LSTM. This is illustrated in part 1 of figure 2. The output image embedding and question embedding are used in an attention network that combines the image and question embeddings with a weighted softmax function and produces an output attention weighted vector . The attention mechanism is illustrated in figure 2. The weights of this network are learnt end-to-end learning using the two losses, a triplet loss and a soft-max classification loss for the answer (shown in part 3 of figure 2
). The aim is to obtain attention weight vectors that bring the supporting exemplar attention close to the image attention and far from the opposing exemplar attention. The joint loss function used for training is given by:
where is the set of model parameters for the two loss functions, is the output class label and is the input sample. is the total number of classes in VQA ( consists of the set of total number of output classes including color, count etc. ) and is the total number of samples. The first term is the classification loss and the second term is the triplet loss. is a constant that controls the ratio between classification loss and triplet loss. is the triplet loss function that is used. This is decomposed into two terms, one that brings the positive sample closer and one that pushes the negative sample farther. This is given by T(s_i,s_i^+,s_i^-) = max(0, ||t(s_i)-t(s_i^+)||^2_2 + α- ||t(s_i)-t(s_i^-)||^2_2) The constant controls the separation margin between supporting and opposing exemplars. The constants and are obtained through validation data.
The method is illustrated in figure 2. We further extend the model to a quintuplet setting where we bring two supporting attention weights closer and two opposing attention weights further in a metric learning setting. We observe in section 4.1 that this further improves the performance of the DAN method.
|VQA1.0 Open-Ended (test-dev)||HAT val dataset|
LSTM Q+I+ Attention(LQIA)
|VQA1.0 Open-Ended (test-dev)||HAT val dataset|
LSTM Q+I+ Attention(LQIA)
3.3 Differential Context Network (DCN)
We next consider the other variant that we propose where the differential context feature is added instead of only using it for obtaining attention. The first two parts are same as that for the DAN network. In part 1, we use the image, the supporting and the opposing exemplar and obtain the corresponding image and question embedding. This is followed by obtaining attention vectors for the image, the supporting and the opposing exemplar. While in DAN, these were trained using a triplet loss function, in DCN, we obtain two context features, the supporting context and the opposing context . This is shown in part 3 in figure 3. The supporting context is obtained using the following equation
where is the dot product. This results in obtaining correlations between the attention vectors.
The first term of the supporting context is the vector projection of on and and second term is the vector projection of on . Similarly, for opposing context we compute vector projection of on and on . The idea is that the projection measures similarity between the vectors that are related. We subtract the vectors that are not related from the resultant. While doing so, we ensure that we enhance similarity and only remove the feature vector that is not similar to the original semantic embedding. This equation provides the additional feature that is supporting and is relevant for answering the current question for the image .
Similarly, the opposing context is obtained by the following equation
We next compute the difference between the supporting and opposing context features i.e. that provides us with the differential context feature . This is then either added with the original attention vector (DCN-Add) or multiplied with the original attention vector (DCN-Mul) providing us with the final differential context attention vector . This is then the final attention weight vector multiplied to the image embedding to obtain the vector that is then used with the classification loss function. This is shown in part 4 in the figure 3. The resultant attention is observed to be better than the earlier differential attention feature obtained through DAN as the features are also used as context.
The network is trained end-to-end using the following soft-max classification loss function
|LSTM Q+I ||53.7||78.9||35.2||36.4|
|LSTM Q+I+ Attention(LQIA)||56.1||80.3||37.4||40.4|
MCB + att 
|DCN Add_v2(K=1) +LQIA||53.07||70.46||34.30||44.10|
|DCN Mul_v1(K=1) +LQIA||53.18||70.24||34.53||44.24|
The experiments have been conducted using the two variants of differential attention that are proposed and compared against baselines on standard datasets. We first analyze the different parameters for the two variants DAN and DCN that are proposed. We further evaluate the two networks by comparing the networks with comparable baselines and evaluate the performance against the state of the art methods. The main evaluation is conducted to evaluate the performance in terms of correlation of attention with human correlation where we obtain state-of-the-art in terms of correlation with human attention. Further, we observe that its performance in terms of accuracy for solving the VQA task is substantially improved and is competitive with the current state of the art results on standard benchmark datasets. We also analyse the performance of the network on the recently proposed VQA2 dataset.
4.1 Analysis of Network Parameters
In the proposed DAN network, we have a dependency on the number of k-nearest neighbors that should be considered. We observe in table 1, that using 4 nearest neighbors in the triplet network we obtain the highest correlation with human attention as well as accuracy using VQA-1 dataset. We therefore use 4 nearest neighbors in our experiments. We observe that increasing nearest neighbors beyond 4 nearest neighbors results in reduction in accuracy. Further, even using a single nearest neighbor results in substantial improvement that is marginally improved as we move to 4 nearest neighbors.
We also evaluate the effect of using the nearest neighbors as obtained through a baseline model  versus using a random assignment of supporting and opposing exemplar. We observe that using DAN with a random set of nearest neighbors decreases the performance of the network. While comparing the network parameters, the comparable baseline we use is the basic model for VQA using LSTM and CNN . This however does not use attention and we evaluate this method with attention. With the best set of parameters the performance improves the correlation with human attention by 10.64%. We also observe that correspondingly the VQA performance improves by 4.1% over the comparable baseline. We further then incorporate this model with the model from MCB  which is a state of the art VQA model. This further improves the result by 4.8% more on VQA and a further increase in correlation with human attention by 1.2%.
In the proposed DCN network we have two different configurations, one where we use the add module (DCN-add) for adding the differential context feature and one where we use the (DCN-mul) multiplication module for adding the differential context feature. We further have a dependency on the number of k-nearest neighbors for the DCN network as well. This is also considered. We next evaluate the effect of using a fixed scaling weight (DCN_v1) for adding the differential context feature against learning a linear scaling weight (DCN_v2) for adding the differential context feature. All these parameter results are compared in table 2.
As can be observed from table 2 the configuration that obtains maximum accuracy on VQA dataset  and in correlation with human attention is the version that uses multiplication with learned weights and with 4 nearest neighbors being considered. This results in an improvement of 11% in terms of correlation with human attention and 4.8% improvement in accuracy on the VQA-1 dataset . We also observe that incorporating DCN with MCB  further improves the results by 4.5% further on VQA dataset and results in an improvement of 1.47% improvement in correlation with attention. These configurations are used in comparison with the baselines.
|LSTM Q+I+ Attention(LQIA)||0.214 0.001|
|HieCoAtt-P ||0.256 0.004|
|MCB + Att.||0.279 0.004|
|DAN (K=4) +LQIA||0.321 0.001|
|DCN Mul_v2(K=4) +LQIA||0.324 0.001|
|DAN (K=4) +MCB||0.332 0.001|
|DCN Mul_v2(K=4) +MCB||0.338 0.001|
4.2 Comparison with baseline and state of the art
We obtain the initial comparison with the baselines on the rank correlation on human attention (HAT) dataset  that provides human attention by using a region deblurring task while solving for VQA. Between humans the rank correlation is 62.3%. The comparison of various state-of-the-art methods and baselines are provided in table 5. The baseline we use  is the method used by us for obtaining exemplars. This uses a question embedding using an LSTM and an image embedding using a CNN. We additionally consider a variant of the same that uses attention. We have also obtained results for the stacked attention network . The results for the Hierarchical Co-Attention work  are obtained from the results reported in Das et al. . We observe that in terms of rank correlation with human attention we obtain an improvement of around 10.7% using DAN network (with 4 nearest neighbors) and using DCN network (4 neighbors with multiplication module and learned scaling weights) we obtain an improvement of around 11% over the comparable baseline. We also obtain an improvement of around 6% over the Hierarchical Co-Attention work  that uses co-attention on both image and questions. Further incorporating MCB improves the results for both DAN and DCN resulting in an improvement of 7.4% over Hierarchical co-attention work and 5.9% improvement over MCB method. However, as noted by , using a saliency based method  that is trained on eye tracking data to obtain a measure of where people look in a task independent manner results in more correlation with human attention (0.49). However, this is explicitly trained using human attention and is not task dependent. In our approach, we aim to obtain a method that can simulate human cognitive abilities for solving tasks.
We next evaluate the different baseline and state of the art methods on the VQA dataset  in table 3. There have been a number of methods proposed for this benchmark dataset for evaluating the VQA task. Among the notable different methods, the Hierarchical Co-Attention work  obtains 61.8% accuracy on VQA task, the dynamic parameter prediction  method obtains 57.2% and the stacked attention network  obtains 58.7% accuracy. We observe that the differential context network performs well outperforming all the image based attention methods and results in an accuracy of 60.9%. This is a strong result and we observe that the performance improves across different kinds of questions. Further, on combining the method with MCB, we obtain improved results of 65% and 65.4% using DAN and DCN respectively improving over the results of MCB by 1.2%. This is consistent with the improved correlation with human attention that we observe in table 5.
We next evaluate the proposed method on a recently proposed VQA-2 dataset . The aim in this new dataset is to remove the bias in different questions. It is a more challenging dataset as compared to the previous VQA-1 dataset . We provide a comparison of the proposed DAN and DCN methods against the stacked attention network (SAN)  method. As can be observed in table 4, the proposed methods obtain improved performance over a strong stacked attention baseline. We observe that our proposed methods are also able to improve the result over the SAN method. DCN with 4 nearest neighbors when combined with MCB obtains an accuracy of 65.90%
4.3 Attention Visualization
The main aim of the proposed method is to obtain an improved attention that correlates better with human attention. Hence we visualize the attention regions and compare them. In attention visualization we overlay the attention probability distribution matrix, that is the most prominent part of a given image based on the query question. The procedure followed is same as that followed by Daset al. . We provide the results of the attention visualization in figure 4. We obtain significant improvement in attention by using DCN as compared to the SAN method . Figure 5 provides how the supporting and opposing attention map helps to improve the reference attention using DAN and DCN. We have provided more results for attention map visualization on the project website 111project website: https://badripatro.github.io/DVQA/ .
In this section we further discuss different aspects of our method that are useful for understanding the method in more detail
We first consider how exemplars improve attention. In differential attention network, we use the exemplars and train them using a triplet network. It is known that using a triplet ( and earlier by ), that we can learn a representation that accentuates how the image is closer to the supporting exemplar as against the opposing exemplar. The attention is obtained between the image and language representations. Therefore the improved image representation helps in obtaining an improved attention vector. In DCN the same approach is used with the change that the differential exemplar feature is also included in the image representation using projections. More analysis in terms of understanding how the methods qualitatively improves attention is included in the project website.
We next consider whether improved attention implies improved performance. In our empirical analysis we observed that we obtain improved attention and improved accuracies in VQA task. While there could be other ways of improving performance on VQA (as suggested by MCB ) these can be additionally incorporated with the proposed method and these do yield improved performance in VQA
Lastly we consider whether image (I) and question embedding (Q) are both relevant. We had considered this issue and had conducted experiments by considering I only, by considering Q only, and by considering nearest neighbor using the semantic feature of both Q&I. We had observed that the Q&I embedding from the baseline VQA model performed better than other two. Therefore we believe that both contribute to the embedding.
In this paper we propose two different variants for obtaining differential attention for solving the problem of visual question answering. These are differential attention network (DAN) and differential context network (DCN). Both the variants provide significant improvement over the baselines. The method provides an initial view of improving VQA using an exemplar based approach. In future, we would like to further explore this model and extend it to joint image and question based attention models.
We acknowledge the help provided by our Delta Lab members and our family who have supported us in our research activity.
J. Andreas, M. Rohrbach, T. Darrell, and D. Klein.
Learning to compose neural networks for question answering.In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2016.
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and
VQA: Visual Question Answering.
International Conference on Computer Vision (ICCV), 2015.
A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra.
Human Attention in Visual Question Answering: Do Humans and Deep
Networks Look at the Same Regions?
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
-  A. Frome, Y. Singer, F. Sha, and J. Malik. Learning globally-consistent local distance functions for shape-based image retrieval and classification. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007.
-  A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Austin, Texas, USA, 2016.
-  D. Geman, S. Geman, N. Hallonquist, and L. Younes. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences of the United States of America, 112(12):3618–3623, 03 2015.
Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh.
Making the v in vqa matter: Elevating the role of image understanding
in visual question answering.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2017.
-  E. Hoffer and N. Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pages 84–92. Springer, 2015.
-  C. Huang, Y. Li, C. Change Loy, and X. Tang. Learning deep representation for imbalanced classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5375–5384, 2016.
F. Jäkel, B. Schölkopf, and F. Wichmann.
Generalization and similarity in exemplar models of categorization: Insights from machine learning.Psychonomic Bulletin and Review, 15(2):256–271, Apr. 2008.
-  T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to predict where humans look. In Computer Vision, 2009 IEEE 12th international conference on, pages 2106–2113. IEEE, 2009.
-  J.-H. Kim, K. W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T. Zhang. Hadamard Product for Low-rank Bilinear Pooling. In The 5th International Conference on Learning Representations, 2017.
-  R. Li and J. Jia. Visual question answering with question representation update (qru). In Advances in Neural Information Processing Systems, pages 4655–4663, 2016.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
-  J. Lu, X. Lin, D. Batra, and D. Parikh. Deeper lstm and normalized cnn visual question answering model. https://github.com/VT-vision-lab/VQA_LSTM_CNN, 2015.
-  J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pages 289–297, 2016.
L. Ma, Z. Lu, and H. Li.
Learning to answer questions from image using convolutional neural network.In
Thirtieth AAAI Conference on Artificial Intelligence, 2016.
-  M. Malinowski and M. Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems, 2014.
M. Malinowski, M. Rohrbach, and M. Fritz.
Ask your neurons: A neural-based approach to answering questions about images bibtex.In International Conference on Computer Vision (ICCV), 2015.
-  J. H. McDonald. Handbook of biological statistics, volume 2. 2009.
-  H. Noh, P. Hongsuck Seo, and B. Han. Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 30–38, 2016.
-  M. Ren, R. Kiros, and R. Zemel. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems, pages 2953–2961, 2015.
F. Schroff, D. Kalenichenko, and J. Philbin.
Facenet: A unified embedding for face recognition and clustering.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
-  R. Shepard. Toward a universal law of generalization for psychological science. Science, 237(4820):1317–1323, 1987.
-  K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4613–4621, 2016.
-  I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS’14, pages 3104–3112, 2014.
-  K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10(Feb):207–244, 2009.
-  C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. In Proceedings of International Conference on Machine Learning (ICML), 2016.
-  H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision, pages 451–466. Springer, 2016.
-  Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 21–29, 2016.
-  Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7w: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4995–5004, 2016.
Appendix A Experimental Setup
We have conducted our experiments on two types of dataset, first one is VQA dataset ,which contains human annotated question and answer based on images on MS-COCO dataset. Second one is HAT dataset based on attention map.
a.1.1 VQA dataset
VQA dataset is one of the largest dataset for VQA benchmark so far. It built on complex images from ms-coco dataset. VQA dataset contains total 204721 images, out of which, 82783 images for training, 40504 images for validation and 81434 images for testing. Each image in the MS-COCO dataset is associated with 3 questions and each question has 10 possible answers. This dataset is annotated by different people. So there are 248349 QA pair for training, 121512 QA pairs for validating and 244302 QA pairs for testing. We use the top 1000 most frequently output as our possible answer set as is commonly used. This covers 82.67% of the train+val answer.
a.1.2 VQA-HAT(Human Attention) dataset
We used VQA-HAT dataset, which is developed based on the de-blurring task to answering visual questions. This dataset contains human attention map for training set of 58475 example out of 248349 VQA training set. It contains 1374 validation example out of 121512 examples of question image pair in VQA validation set.
a.2 Evaluation methods
a.2.1 VQA dataset
VQA dataset contain 3 type of answer: yes/no, number and other. The evaluation is carried out using two test splits,i.e test-dev and test-standard. The question in corresponding test split are answered using two ways: Open-Ended and Multiple-choice. Open-Ended task should generate a natural language answer in form of single word or phrase. For each question there are 10 candidate answer provided with their respective confidence level. Our module generates a single word answer on the open ended task. This answer can be evaluated using accuracy metric provide by Antol et al. as follows.
Where the predicted answer and t is the annotated answer in the target answer set of the example and is the indicator function. The predicted answer is correct if at least 3 annotators agree on the predicted answer. If the predicted answer is not correct then the accuracy score depends on the number of annotator that agree on the answer. Before checking accuracy, we need to convert the predicted answer to lowercase, number to digits and punctuation & article to be removed.
a.2.2 HAT dataset
We used rank correlation technique to evaluate the correlation between human attention map and DAN attention probability. Here we scale down human attention map to 14x14 in order to make same size as DAN attention probability. We then compute rank correlation using the following steps. Rank correlation technique is used to obtain the degree of association between the data. The value of rank correlation lies between +1 to -1. When is close to 1, it indicates positive correlation between them, When is close to -1, it indicates negative correlation between them, and when is close to 0, it indicates No correlation between them. A higher value of rank correlation is better.
a.3 Training and Model Configuration
We trained the differential attention model using joint loss in an end-to-end manner. We have used RMSPROP optimizer to update the model parameter and configured hyper-parameter values to be as follows: learning rate =0.0004 , batch size = 200, alpha = 0.99 and epsilon=1e-8 to train the classification network . In order to train a triplet model, we have used RMSPROP to optimize the triplet model model parameter and configure hyper-parameter values to be: learning rate =0.001 , batch size = 200, alpha = 0.9 and epsilon=1e-8. We have used learning rate decay to to decrease the learning rate on every epoch by a factor given by:
where value of a=1500 and b=1250 is set empirically. Selection of training controlling factor() has a major role during training. If =1 , means updating triplet and classification network parameter at a same rate . If 1 , means updating triplet net more frequently as compare to classification net. Since, triplet loss decreases much lower then classification loss, we fixed value of 1 that is a fixed value of =10.
|DAN (K=1,Random )||0.1238||0.1070||0.1163||0.1157|
Appendix B Results
In this section we provide additional results that were omitted from the main paper due to space limitation.
b.1 Rank Correlation results for DAN and DCN
In this sub-section we provide a few additional columns that were omitted from the results table in the main paper. Table- 6 provides the statistics of rank correlation on HAT validation dataset for various differential attention networks(DAN) and differential context network(DCN). DAN network is varied by varying the number of nearest supporting and opposing exemplars. We did experiments by considering K=1,2, 3,4,5 and random selections of nearest and farthest neighbors.
This table also mentions rank correlation for two types of DCN, i.e DCN Add and DCN Mul. Each type of network has two different methods for training , one is fixed scaling weights,i.e DCN Mul and second one is learn-able scaling weights,i.e, DCN Mul_v1. From the statistics of rank correlation in this table indicates that learnable scaling weights performs better than fixed weights. Further, we observed that Multiplication network performs better than addition network in case of differential context. We did experiments for K=1,2,3,4, but this table only shows the results of K=1 and K=4 for number of nearest and farthest neighbors selections.
We have computed rank correlation between attention probability of differential netwrok(DAN or DCN) and Human attention provided by HAT for validation set. This table contains 3 rank correlations for 3 attention maps per image on HAT validation dataset. First attention map gives better accuracy than other two. Finally we take an average of three rank correlation for a particular model. We can observe that, all our model attention maps correlate positively with human attention.
b.2 How important are the supporting and contrasting exemplar?
We carried out an experiment by considering only the supportive exemplar in triplet loss mentioned in equation-2 and obtained consistent result as shown in figure 6. From the rank correlation result, we can conclude that, If we use only the supportive exemplar, we obtain most of the gain in the performance. The quantitative results for this ablation analysis is shown in the table 7, which provides the rank correlation on HAT Validation Dataset.
|DAN (K=4) +LQIA||0.312 0.001|
|DAN (K=4) +MCB||0.320 0.001|
b.3 Contribution of different term in DCN
We carried out an experiment by dropping the vector projection of on term in the supporting context as mentioned in equation-3 and the vector rejection of on term in opposing context as mentioned in equation-4 and obtained consistent result as shown in figure 7. The contribution of these terms in the corresponding equations are very small.The quantitative results for this ablation analysis is shown in the table 8, which provides the rank correlation on HAT Validation Dataset.
|DCN Mul_v2(K=4) +LQIA||0.319 0.001|
|DCN Mul_v2(K=4) +MCB||0.3287 0.001|
b.4 Attention Visualization DAN and DCN with Supporting and Opposing Exemplar
The first row of figure- 8 indicates the target image along with a supporting and opposing image. Second row provides human attention map, reference, supporting, opposing, DAN and DCN attention map respectively. Third row gives corresponding attention visualization for all the images. We can observe that from the given the target image and question: "what unusual topping has been added to the hot dog" , the reference model provides attention map( row, column of figure- 8) somewhere in the yellow part which is different from the ground truth human attention map ( row, column of figure- 8). With the help of supporting and contrasting exemplar attention map( row, & column of figure- 8), the reference model attention is improved, which is shown in DAN and DCN ( row, & column of figure- 8). The attention map of DCN model is more correlated with the ground truth human attention map than reference model. Thus we observe that with the help of supporting and contrasting exemplar, VQA accuracy is improving. Also, figure- 9 provides attention visualization for DAN and DCN with the help of supporting and contrasting attention.
b.5 Attention visualization of DCN with various Human Attention Maps
We compute rank correlation for all three ground truth human attention map provide by VQA- HAT val dataset with our DAN and DCN exemplar model and also visualized attention map with all thee ground truth human attention map as shown in figure 10 and 11. We can evaluated our rank correlation for all three human attention map and observed that human attention map one is better than attention map 2 and 3 in term of visualization and rank correlation as mention in figure 10 and 11.
b.6 Attention Visualization of DAN and DCN
We provide the results of the attention visualization in figure 12 and 13. As can be observed in figure 12 and 13, we obtain significant improvement of rank correlation in attention map by using exemplar model(DCN or DAN) as compared to the SAN method . We can observed that DAN and DCN has more correlation with human attention. We observed that DAN and DCN has better rank correlation then SAN attention map.
Appendix C Algorithm for Differential Attention
The algorithm 2 for differential attention illustrates the dimensions of inputs and outputs.
Appendix D Details of Triplet and Quintuplet Network
d.1 Triplet Model
The concept triplet loss is motivated in the context of larger margin nearest neighbor classification, which minimize the distance between target and supporting feature and maximize the distance between target and contrasting feature. is the embedding feature of example of training image in n dimensional euclidean space.
: The embedding of target
:The embedding of supporting exemplar
:The embedding of contrasting exemplar
The objective of triplet loss is to make both supporting features target will have same identity & target and contrasting feature will have differ identity. which means it brings all supporting features more close to target feature than that of contrasting features.
where is defined as the euclidean distance between . is the margin between supporting and contrasting feature.The default value of is 0.2. T is training dataset set, which contain all set of possible triplets. The objective function for triplet loss is given by T(s_i,s_i^+,s_i^-) = max(0, ||f(s_i)-f(s_i^+)||^2_2 + α- ||f(s_i)-f(s_i^-)||^2_2) For simplicity ,the notation are replaced like this , .
Gradient computation of L2 norm is given by
The gradient of loss w.r.t the "Supporting" input :
The gradient of loss w.r.t the "Opposing" input :
The gradient of loss w.r.t the "Target" input :
d.2 Quintuplet Model
Unlike triplet model, In this model we considered two supporting and two opposing image along with target image. we have selected supporting and opposing image by clustering. i.e, The 2000th nearest neighbor is divided into 20 cluster based on the distance from the target image. That is first cluster mean distance is minimum cluster distance from target and 20th cluster mean distance is the maximum cluster distance from the target.
: The embedding of Target
:The embedding of supporting exemplar from cluster 1
:The embedding of opposing exemplar from cluster 20
:The embedding of supporting exemplar from cluster 2
:The embedding of opposing exemplar from cluster 19
The objective of quintuplet is to bring (cluster 1) supporting feature more close to target feature than that of (cluster 2) supporting feature than that of (cluster 19) opposing feature than that of (cluster 20) opposing feature.
where ,, are the margin between , , respectively. T is training dataset set, which contain all set of possible quintuplet set.
Objective function for Quintuplet loss is defined as :
subjected to :
where are the slack variable and is the parameter of attention network and is a regularizing control parameter.The value of , , are 0.006, 0.2,0.006 set experimentally.