U-CAM
Visual Explanation using Uncertainty based Class Activation Maps
view repo
Understanding and explaining deep learning models is an imperative task. Towards this, we propose a method that obtains gradient-based certainty estimates that also provide visual attention maps. Particularly, we solve for visual question answering task. We incorporate modern probabilistic deep learning methods that we further improve by using the gradients for these estimates. These have two-fold benefits: a) improvement in obtaining the certainty estimates that correlate better with misclassified samples and b) improved attention maps that provide state-of-the-art results in terms of correlation with human attention regions. The improved attention maps result in consistent improvement for various methods for visual question answering. Therefore, the proposed technique can be thought of as a recipe for obtaining improved certainty estimates and explanation for deep learning models. We provide detailed empirical analysis for the visual question answering task on all standard benchmarks and comparison with state of the art methods.
READ FULL TEXT VIEW PDFVisual Explanation using Uncertainty based Class Activation Maps
To interpret and explain the deep learning models, many approaches have been proposed. One of the approaches uses probabilistic techniques to obtain uncertainty estimates, [17, 18]. Other approaches aim at obtaining visual explanations through methods such as Grad-CAM [9] or by attending to specific regions using hard/soft attention. With the recent probabilistic deep learning techniques by Gal and Ghahramani [17], it became feasible to obtain uncertainty estimates in a computationally efficient manner. This was further extended to data uncertainty and model uncertainty based estimates [25]. Through this work, we focus on using gradients uncertainty losses to improve attention maps while also enhancing the explainability leveraging the Bayesian nature of our approach. The uncertainties that we use are aleatoric and predictive[26].
For the estimated uncertainties, we calculate gradients using the approach similar to gradient-based class activation maps [9]. This provides “certainty maps” which helps in attending to certain regions of the attention maps. Doing this, we report an improvement in attention maps. This is illustrated in the Figure 1.
Our method combines techniques from both the explanation [9] and uncertainty [25]
estimation techniques to obtain improved results. We have provided an extensive evaluation. We show that the results obtained for uncertainty estimates show a strong correlation with misclassification, i.e., when the classifier is wrong, the model is usually uncertain. Further, the attention maps provide state of the art correlation with human-annotated attention maps. We also show that on various VQA datasets, our model provides results comparable to the state of the art while significantly improving the performance of baseline methods on which we incorporated our approach. Our method may be seen as a generic way to obtain Bayesian uncertainty estimates, visual explanation, and as a result, improved accuracy for Visual Question Answering (VQA) task.
Our contributions, therefore, lie in, a) unifying approaches for understanding deep learning methods using uncertainty estimate and explanation b) obtaining visual attention maps that correlate best with human attention regions and c) showing that the improved attention maps result in consistent improvement in results. This is particularly suited for vision and language-based tasks where we are interested in understanding visual grounding, i.e., for instance, if the answer for a question is ‘dog’ (Corresponding question: ‘Who is on the boat?’), it is important to understand whether the model is certain and whether it is focusing on the correct regions containing a dog. This important requirement is met by the proposed approach.
Data uncertainty in a multi-modal setting, Uncertainty in VQA task is two-fold. In the example below, Question, ”Which kind of animal is it?” when asked (irrespective of image), may not be concretely answered. Also, seeing the image alone, in the given setting, the animal (especially the one behind) could easily be mis-classified as a dog or some other animal. These kinds of data uncertainties are tapped & hence minimized best when we consider uncertainties of the fused input (image+question). In Figure 2, we show the resultant attention maps of baseline (not minimizing uncertainty) & when we tried to minimize only-image, only-question & the fused uncertainty respectively.
The task of Visual question answering [36, 2, 41, 19, 37] is well studied in the vision and language community, but it has been relatively less explored for providing explanation[42] for answer prediction. Recently, lot of works that focus on explanation models, one of that is image captioning for provide basic explanation for image [5, 12, 29, 46, 49, 23, 52, 11, 7, 21, 53]. [39] has proposed an exemplar-based explanation method for generating question based on the image. Similarly, [40] has suggested a discriminator based method to obtain an explanation for paraphrase generation in text. In VQA, [56][51] have proposed interesting methods for improving attention in the question. Work that explores image and question jointly and is based on hierarchical co-attention is [34]. [43, 54, 33, 38] have proposed attention-based methods for the explanation in VQA, which use question to attend over specific regions in an image. [15, 28, 27] have suggested exciting works that advocate multimodal pooling and obtain close to state of the art in VQA. [38] has proposed an exemplar-based explanation method to improve attention in VQA. We can do systematic comparison of image-based attention while correlating with human attention maps as shown by [8]. Also, we need to ensure that our approach works in other dataset distributions [30, 32] of images while taking less time [44] to train our explanation model.
Recently a lot of researchers have focused on estimating uncertainty in the predictions using deep learning. [6]
has first proposed a method to learn uncertainty in the weights of the neural network. Kendall
et.al. [24]has proposed method to measure model uncertainty for image segmentation task. They observed that softmax probability function approximates relative probability between the class labels, but does not provide information about the model’s uncertainty. The work by
[17, 14] estimates model uncertainty of the deep network (CNN, RNN) with the help of dropout [47]. [48]has estimated uncertainty for batch normalized deep networks.
[25, 26, 45] have mainly decomposed predictive uncertainty into two major types, namely aleatoric and epistemic uncertainty, which capture uncertainty about the predicted model and uncertainty present in the data itself. [35] suggested a method to measure predictive uncertainty with the help of model and data uncertainty. Recently, [31] proposed a certainty method to bring two data distributions close for the domain adaption task. Here, our objective is to analyze and minimize the uncertainty in attention mask to predict answer in VQA. In our approach, We are proposing a gradient-based certainty explanation mask which minimizes uncertainty in attention regions to improve the correct answer’s predicted probability in VQA. Our method also provides visual explanation based on uncertainty class activation maps, capturing and visualizing the uncertainties present in the attention maps in VQA.We consider two type of uncertainties to model uncertainty present in the network, one due to uncertainty present in the data (Aleatoric), and the other due to model (Epistemic uncertainty).
Given an input the model (
) predicts the logit output
which is then an input to uncertainty network () for obtaining the variance
as shown in Figure-3. To capture Aleatoric uncertainty [25], we learn the observational noise parameter for each input point . Then, Aleatoric uncertainty, is estimated by applying softplus function on the output logit variance. This is given by,(1) |
For calculating the aleotoric uncertainty loss, we perturb the logit value () with Gaussian noise of variance
(diagonal matrix with one element corrosponding to each logits value) before the softmax layer. The logits reparameterization trick
[26] and [16] combines and to give . We then obtain a loss with respect to ground truth. It is expressed as:(2) |
(3) |
where is the aleatoric uncertainty loss (AUL), T is the number of Monte Carlo simulations.
is a the class index of the logit vector
which is defined for all the classes.To obtain the model uncertainty, we measure epistemic uncertainty. However, estimating epistemic uncertainty[35]
is computationally expensive, and thus we measure the predictive uncertainty, having both aleatoric and epistemic uncertainties present in it. To estimate it, we sample weights in the Bayesian networks
and then perform Monte Carlo simulations over the model to obtain the predicted class probabilities . That is,where c is the answer class, and is the aleatoric variance of each logit in the th MC Simulation. The entropy of the sampled logit’s probabilities can be calculated as:
(4) |
The predictive uncertainty contains entropy and aleatoric variance when it’s expectation is taken across number of Monte Carlo simulations:
(5) |
where is the entropy of the probability , which depends on the spread of class probabilities while the variance (second term in the above equation) captures both the spread and the magnitude of logit outputs, . In Equation 2, we can replace with predictive uncertainty (mentioned above in Equation 5) to get the predictive uncertainty loss (PUL).
Task: We solve for VQA task. The key difference in our architecture as compared to the existing VQA models is the introduction of gradient-based certainty maps. A detailed figure of the model is given in the Figure 4. We keep other aspects of the VQA model unchanged. In a typical open-ended VQA task, we have a multi-class classification task. A combined (image and question) input embedding is fed to the model. Then, the output logits are fed to a softmax function, giving probabilities of the predictions in the multiple-choice answer space. That is, , where is a set of all possible answers, I is the image, Q is the corresponding question, and is representing the parameters of the network.
The three main parts of our method are Attention Representation, Uncertainties Estimation, and computing gradients of uncertainty losses. In the following sections, we explain them in detail.
We obtain an embedding, where u is width, v is height of the image and C represents the number of applied filters on the image
in the convolution neural network (CNN). The CNN is parameterized by a function
, where represents the weights. Similarly, for the query question , we obtain a question feature embedding using a LSTM network. This network is parameterized by a function , where represents the weights. Both and are fed to an attention network that combines the image and question embeddings using a weighted softmax function and produces a weighted output attention vector, as illustrated in Figure 4. People have tried with the various kinds of attention networks. In this paper, we tried with SAN [54] and MCB [15]. Finally, we obtain attention feature using attention extractor network . The attended feature is passed through a classifier and the model is trained using the cross-entropy loss. Many a times, model is not certain about the answer class to which the input belongs, which sometimes leads to decrease in accuracy. To tackle this, we have proposed a technique to reduce the class uncertainty by increasing the certainty of the attention mask. Additionally, we also incorporate a loss based on the uncertainty which is described next.The attention feature, obtained from the previous step is fed to the classifier . The output of the classifier is fed to , which produces class probabilities, . ’s output is also fed to a variance predictor network, , which outputs the logits’ variance, as mentioned in the Equation 1. For calculating the aleatoric uncertainty loss, we perturb the logit value () with Gaussian noise of variance before the softmax layer. The Gaussian likelihood for classification is given by , where represents model’s parameters, is the precision, is the attended fused input, and is the output logit producing network as shown in the Figure 4 . The above setting represents the perturbation of model output with the variance of the observed noise, . We make sure that is a positive or positive definite matrix (in case of Multivariate) by using the logit reparameterization trick [26] and [16]. Finally, we then obtain an aleatoric loss, with respect to ground truth as mentioned in the Equation 3. Our proposed model, which uses this loss as one of the components of its uncertainty loss, is called Aleatoric-GCA (A-GCA). Along with aleatoric loss , we combine and as mentioned in the Equation 10 and 11 respectively to get total uncertainty loss . The classifier is trained by jointly minimizing both the classification loss, and the uncertainty loss, . In Equation 2, we can replace with predictive uncertainty (mentioned above in Equation 5) to get the predictive uncertainty loss(PUL). Accordingly, the model which uses this loss as one of the constituents of its uncertainty loss is called Predictive-GCA (P-GCA). Next, we compute the gradients of standard classification loss and uncertainty loss with respect to attended image feature, . Besides training, we also use these gradients to obtain visualizations describing important regions responsible for answer prediction, as mentioned in the qualitative analysis section. (Section 5.6)
Uncertainty present in the attention maps often leads to uncertainty in the predictions and can be attributed to the noise in data and the uncertainty present in the model itself. We improve the certainty in these cases by adding the certainty gradients to the existing Standard Cross-Entropy (SCE) loss gradients for training the model during backpropagation.
Our objective is to improve the model’s attention in the regions where the classifier is more certain. The classifier will perform better by focusing more on certain attention regions, as those regions are more suited for the classification task. We can get an explanation for the classifier output as done in the existing Grad-CAM approach (). But that explanation does not take the model and data uncertainties into the account. We improve this explanation using the certainty gradients (). If we can minimize uncertainty in the VQA explanation, then uncertainties in the image and question features, and thus uncertainties in the attention regions would be subsequently reduced. It is the uncertain regions which are a primary source for errors in the prediction, as shown in Figure 1.
In our proposed method, we compute the gradient of the Standard Classification (cross entropy) loss with respect to attention feature i.e. and and also the gradient of the uncertainty loss i.e. . The obtained uncertainty gradients are passed through a gradient reversal layer, giving us the certainty gradients, i.e., .
(6) |
The positive sign of gradient indicates that the attention certainty is activated on these regions and vice-versa.
(7) |
We apply a ReLU activation function on the product of gradients of the attention map and the gradients of certainty as we are only interested in attention regions that have a positive influence on interested answer class, i.e. attention regions whose intensity should be increased in order to increase answer class probability
, whereas negative values are multiplied by (large negative number) as the negative attention regions are likely to belong to other categories in image. As expected, without this ReLU, localization maps sometimes highlights more than just the desired class and achieves lower localization performance. Then we normalize to get attention regions which are highly activated and giving more weight to certain regions.(8) |
Images with higher uncertainty are equivalent to having lower certainty, so the certain regions of these images should have lower attention values. We use residual gradient connection to obtain the final gradient, which is the sum of gradient mask of (with respect to attention feature) and the gradient certainty mask and is given by:
(9) |
Where is the gradient mask of when gradients are taken with respect to attention feature. More details are given in the Algorithm 1.
We estimate aleatoric uncertainty in logits space by perturbing each logit using the variance obtained from data. The uncertainty present in the logits value can be minimized using cross-entropy loss on Gaussian distorted logits, as shown in the Equation 3. The distorted logit is obtained using a Gaussian multivariate function, having positive diagonal variance. To stabilize the training process [16], we add an additional term to the uncertainty loss, calling it Variance Equalizer(VE) loss, .
(10) |
where is a constant. The uncertainty distorted loss (UDL) is the difference between the typical cross-entropy loss and the aleatoric/predictive loss estimated in the Equation 3. The scalar difference is passed to an activation function to enhance the difference in either direction and is given by :
(11) |
By putting this constraint, we ensure that the predictive uncertainty loss does not deviate much from the actual cross-entropy loss. The total uncertainty loss is the combination of Aleatoric (or prediction uncertainty loss), Uncertainty Distorted Loss, and Variance equalizer loss.
(12) |
The final cost function for the network combines the loss obtained through uncertainty (aleatoric or predictive) loss for the attention network with the cross-entropy.
The cost function used for obtaining the parameters of the attention network, of the classification network, of the prediction network and for uncertainty network is as follows:
where n is the number of examples, and is the hyper-parameter which is fine-tuned using validation set, is standard cross-entropy loss and is the uncertainty loss. We train the model with this cost function until it converges so that the parameters. deliver a saddle point function
(13) |
We evaluate the proposed GCA methods and have provided both quantitative analysis and qualitative analysis. The former includes: i) Ablation analysis of proposed models (Section- 5.2), ii) Analysis of uncertainty effect on answer predictions (Figure- 5 (a,b)), iii) Differences of Top-2 softmax scores for answers for some representative questions (Figure- 5 (c,d)) and iv) Comparison of attention map of our proposed uncertainty model against other variants using Rank correlation (RC) and Earth Mover Distance (EMD) [3] as shown in Table-3 for VQA-HAT [8] and in Table- 2 for VQA-X [20] . Finally, we compare PGCA with state of the art methods as mentioned in Section-5.4 . Qualitative analysis includes visualization of certainty activation maps for some representative images as we move from our basic model to the P-GCA model. (Section 5.6)
VQA-v1 [2]: We conduct our experiments on VQA benchmark VQA-v1 [2] dataset, which contains human-annotated questions and answers based on images on MS-COCO dataset. This dataset includes 2,04,721 images in total, out of which 82,783 images are for training, 40,504 images for validation, and 81,434 images for testing. Each image is associated with three questions, and each question has ten possible answers. There are 248349 Question-Answer pairs for training, 121512 pairs for validation, and 244302 pairs for testing.
VQA-v2[19]: We provide benchmark result on VQA-v2[19] dataset. This dataset removes bias present in VQA-v1 by adding a conjugate image pair. It contains 443,757 image-question pairs on the training set, 214,354 pairs on the validation set and 447,793 pairs on the test set, which is more than twice the first version. All the questions and answers pairs are annotated by human annotators. The benchmark results on VQA-v2 dataset is presented in table-5.
VQA-HAT [8]: To compare our attention map with human-annotated attention maps, we use VQA-HAT [8] dataset. This dataset is developed for image de-blurring for answering the visual question. It contains 58475 human-annotated attention maps out of 248349 training examples and includes three sets of 1374 human-annotated attention maps out of 121512 validation examples of question image pairs in the validation dataset. This dataset is developed for VQA-v1 only.
Models | All | Yes/No | Number | Others |
---|---|---|---|---|
Baseline | 63.8 | 82.2 | 37.3 | 54.2 |
VE | 64.1 | 82.3 | 37.2 | 54.3 |
UDL | 64.4 | 82.6 | 37.2 | 54.5 |
AUL | 64.7 | 82.9 | 37.4 | 54.6 |
PUL | 64.9 | 83.0 | 37.5 | 54.6 |
UDL+VE | 64.8 | 82.8 | 37.4 | 54.5 |
AUL+VE | 65.0 | 83.3 | 37.8 | 54.7 |
PUL+ VE | 65.3 | 83.3 | 37.9 | 54.9 |
AUL +UDL | 65.6 | 83.3 | 37.6 | 55.0 |
PUL + UDL | 65.9 | 83.7 | 37.8 | 55.2 |
A-GCA (ours) | 66.3 | 84.2 | 38.0 | 55.5 |
P-GCA (ours) | 66.5 | 84.7 | 38.4 | 55.9 |
Model | RC() | EMD() |
---|---|---|
Baseline | 0.3017 | 0.3825 |
Deconv ReLU | 0.3198 | 0.3801 |
Guided GradCAM | 0.3275 | 0.3781 |
Aleatoric mask | 0.3571 | 0.3763 |
Predictive mask | 0.3718 | 0.3714 |
(a) Classification error | (b) Misclassified | (c) CD-Others | (d) CD-Yes/No |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
Our proposed GCA model’s loss consists of undistorted and distorted loss. The undistorted loss is the Standard Cross-Entropy (SCE) loss. The distorted loss consists of uncertain loss (either aleatoric uncertainty loss (AUL), or predictive uncertainly loss (PUL)), Variance Equalizer (VE) loss and Uncertainty Distorted loss (UDL). In the first block of the Table- 1, we report the results when these losses are used individually. (Only SCE loss is there in the Baseline). We use a variant of the MCB [15] model as our baseline method. As seen, PUL, when used individually, outperforms the other 4. This could be attributed to PUL guiding the model to minimize both the data and the model uncertainty. The second block of the Table- 1 depicts the results when we tried while combining two different individual losses. The model variant, which is guided using the combination of PUL and UDL loss performs best among the five variants. Then finally, after combining (AUL+UDL+VE+SCE), denoting it as A-GCA model and combining (PUL+UDL+VE+SCE), indicating it as P-GCA, we report an improvement of around 2.5% and 2.7% accuracy score respectively.
Model | RC() | EMD() | CD() |
---|---|---|---|
SAN [8] | 0.2432 | 0.4013 | – |
CoAtt-W[34] | 0.246 | – | – |
CoAtt-P [34] | 0.256 | – | – |
CoAtt-Q[34] | 0.264 | – | – |
DVQA(K=1)[38] | 0.328 | – | – |
Baseline (MCB) | 0.2790 | 0.3931 | – |
VE (ours) | 0.2832 | 0.3931 | 0.1013 |
UDL (ours) | 0.2850 | 0.3914 | 0.1229 |
AUL (ours) | 0.2937 | 0.3867 | 0.1502 |
PUL(ours) | 0.3012 | 0.3805 | 0.1585 |
PUL + VE (ours) | 0.3139 | 0.3851 | 0.1631 |
PUL + UDL(ours) | 0.3243 | 0.3824 | 0.1630 |
A-GCA (ours) | 0.3311 | 0.3784 | 0.1683 |
P-GCA (ours) | 0.3341 | 0.3721 | 0.1710 |
Human [8] | 0.623 | – | – |
Further, we plotted Predictive uncertainty (Figure- 5(a,b)) of some randomly chosen samples against the Classification error (error=, where p is the probability of misclassification). As seen, when the samples are correct, they are also certain and have less Classification Error (CE). To visualize the direct effect of decreased uncertainty, we plotted (Figure- 5(c, d)). It can be seen that how similar classes like (glasses, sunglasses) and (black, gray), etc., thus leading to uncertainty, got separated more in the logit space in the proposed model.
We compare attention maps produced by our proposed GCA model, and it’s variants with the base model and reports them in Table-3. Rank correlation and EMD score are calculated for the produced attention map against human-annotated attention (HAT) maps [8]. In the table, as we approach the best-proposed GCA model, Rank correlation (RC) is increasing. EMD is also decreasing (Lower the better) as we move towards GCA. To verify our intuition, that we can learn better attention mask by minimizing the uncertainty present in the attention mask, we start with VE and observe that both rank correlation and answer accuracy increase by 0.42 and 0.3 % from baseline respectively. We also observe that with UDL, AUL, and PUL based loss minimization technique, both RC and EMD improves, as shown in the Table- 3. Aleatoric-GCA (A-GCA) improves 5.21% in terms of RC and 2.5% in terms of accuracy. Finally, the proposed Predictive-GCA (P-GCA), which is modeled to consider both data and the model uncertainty improves the RC by 5.51% and accuracy by 2.7% as shown in the Table- 3 and Table- 1. Since HAT maps are only available for VQA-v1 dataset, thus, this ablation analysis has been performed only for VQA-v1. We also providing SOTA results for VQA-v1 and VQA-v2 dataset as shown in Table- 4 and Table- 5 respectively. Also, we compare with our gradient certainty explanation with human explanation present in VQA-v2 dataset for the various model as mentioned in Table- 2. This human explanation mask only available for VQA-v2 dataset. We observe that our attention (P-GCA) mask performs better than others as well.
Models | All | Y/N | Num | Oth |
---|---|---|---|---|
DPPnet [37] | 57.2 | 80.7 | 37.2 | 41.7 |
SMem[[51]] | 58.0 | 80.9 | 37.3 | 43.1 |
SAN [54] | 58.7 | 79.3 | 36.6 | 46.1 |
DMN[50] | 60.3 | 80.5 | 36.8 | 48.3 |
QRU(2)[33] | 60.7 | 82.3 | 37.0 | 47.7 |
HieCoAtt [34] | 61.8 | 79.7 | 38.9 | 51.7 |
MCB [15] | 64.2 | 82.2 | 37.7 | 54.8 |
MLB [28] | 65.0 | 84.0 | 37.9 | 54.7 |
DVQA[38] | 65.4 | 83.8 | 38.1 | 55.2 |
P-GCA + SAN (ours) | 60.4 | 80.7 | 36.6 | 47.9 |
A-GCA + MCB (ours) | 66.3 | 84.2 | 38.0 | 55.5 |
P-GCA + MCB (ours) | 66.5 | 84.6 | 38.4 | 55.9 |
We obtain the initial comparison with the baselines on the rank correlation on human attention (HAT) dataset [8] that provides human attention while solving for VQA. Between humans, the rank correlation is 62.3%. The comparison of various state-of-the-art methods and baselines are provided in Table 3. We use a variant of MCB [15] model as our baseline method. We obtain an improvement of around 5.2% using A-GCA model and 5.51% using P-GCA model in terms of rank correlation with human attention. From this, we justify that our attention map is more similar to human attention map. We also compare with the baselines on the answer accuracy on VQA-v1[2] dataset, as shown in Table- 4. We obtain an improvement of around 2.7% over the comparable MCB baseline. Our MCB based model A-GCA and P-GCA improves by 0.9% and 1.1% accuracy as compared to state of the art model DVQA [38] on VQA-v1. However, using a saliency-based method [22] that is trained on eye-tracking data to obtain a measure of where people look in a task-independent manner, results in more correlation with human attention (0.49), as noted by [8]. However, this is explicitly trained using human attention and is not task-dependent. In our approach, we aim to obtain a method that can simulate human cognitive abilities for solving the tasks. We provide state of the art results for VQA-v2 in Table- 5. This table shows that using GCA method, the VQA result improves. We have provided more results for attention map visualization for both types of uncertainty, training setup, dataset, and evaluation methods here.111https://delta-lab-iitk.github.io/U-CAM/ .
Models | All | Y/N | Num | Oth |
---|---|---|---|---|
SAN-2[54] | 56.9 | 74.1 | 35.5 | 44.5 |
MCB [15] | 64.0 | 78.8 | 38.3 | 53.3 |
Bottom[[1]] | 65.3 | 81.8 | 44.2 | 56.0 |
DVQA[38] | 65.9 | 82.4 | 43.2 | 56.8 |
MLB [28] | 66.3 | 83.6 | 44.9 | 56.3 |
DA-NTN [4] | 67.5 | 84.3 | 47.1 | 57.9 |
Counter[55] | 68.0 | 83.1 | 51.6 | 58.9 |
BAN[27] | 69.5 | 85.3 | 50.9 | 60.2 |
P-GCA + SAN (ours) | 59.2 | 75.7 | 36.6 | 46.8 |
P-GCA + MCB (ours) | 65.7 | 79.6 | 40.1 | 54.7 |
P-GCA + Counter (ours) | 69.2 | 85.4 | 50.1 | 59.4 |
We trained the P-GCA model using classification loss and uncertainty loss in an end-to-end manner. We have used ADAM optimizer to update the classification model parameter and configured hyper-parameter values using validation dataset as follows: {learning rate = 0.0001, batch size = 200, beta = 0.95, alpha = 0.99 and epsilon = 1e-8} to train the classification model. We have used SGD optimizer to update the uncertainty model parameter and configured hyper-parameter values using validation dataset as follows: {learning rate = 0.004, batch size = 200, and epsilon = 1e-8} to train the uncertainty model.
We provide attention map visualization of all models for 5 example images, as shown in Figure- 6. The first raw, the baseline model misclassifies the answer due to high uncertainty value, that gets resolved by our methods(P-GCA). We can see how attention is improved as we go from our baseline model (MCB) to the proposed Gradient Certainty model (P-GCA). For example, in the first row, MCB is unable to focus on any specific portion of the image, but as we go towards the right, it focuses the cup bottom, (indicated by intense orange color in the map). Same can be seen for other images also. We have visualized Grad-CAM maps to support our hypothesis that Grad-CAM is a very good way for visualizing what the network learns as it can focus on right portions of the image even in the baseline model (MCB), and therefore, can be used as a tutor to improve attention maps. For example, in MCB it tries to focus on the right portions but with the focus to other points as well. However, in our proposed model, visualization improves as the models focuses only on the required portion.
In this paper, we provide a method that uses gradient-based certainty attention regions to obtain improved visual question answering. The proposed method yields improved uncertainty estimates that are correspondingly more certain or uncertain, show consistent correlation with misclassification and are focused quantitatively on better attention regions as compared to other states of the art methods. The proposed architecture can be easily incorporated in various existing VQA methods as we show by incorporating the method in SAN [54] and MCB [15] models. The proposed technique could be used as a general means for obtaining improved uncertainty and explanation regions for various vision and language tasks, and in future, we aim to evaluate this further for other tasks such as ‘Visual Dialog’ and image captioning tasks.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 6077–6086, 2018.Deep attention neural tensor network for visual question answering.
In Proceedings of the European Conference on Computer Vision (ECCV), pages 20–35, 2018.International Conference on Machine Learning
, pages 1613–1622, 2015.Conference on Empirical Methods in Natural Language Processing (EMNLP)
, 2016.Learning cooperative visual dialog agents with deep reinforcement learning.
In IEEE International Conference on Computer Vision (ICCV), 2017.A theoretically grounded application of dropout in recurrent neural networks.
In Advances in neural information processing systems, pages 1019–1027, 2016.Densecap: Fully convolutional localization networks for dense captioning.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4565–4574, 2016.We acknowledge the help provided by our Delta Lab members and our family who have supported us in our research activity.
Question | ID |
---|---|
What does the person in this picture have on his face? | 1 |
How many baby elephants are there? | 2 |
What is in the bowl? | 3 |
Is the television on or off? | 4 |
What color is the walk light? | 5 |
Which way is its head turned? | 6 |
How many people are riding on each bike? | 7 |
What animal is in this picture? | 8 |
What color is the road? | 9 |
What color is the boy’s hair? | 10 |
Question | ID |
---|---|
Is this wheat bread? | 1 |
Is the cat looking at the camera? | 2 |
Is this chair broken? | 3 |
Are these animals monitored? | 4 |
Does the cat recognize someone? | 5 |
Is the figurine life size? | 6 |
Is the smaller dog on a leash? | 7 |
Is this in the mountains? | 8 |
Is the woman sitting on the bench? | 9 |
Is the church empty? | 10 |