Factors visualization: We provide additional visualization in fig:onecol. We visualize scores for each image region obtained from different types of factors. ‘Image-Local-Information,’ ‘Image-Caption’ and ‘Image-Local-Interaction’ are constant for different questions, while ‘Image-Question,’ ‘Image-Answer,’ ‘Image-History-Q’ and ‘Image-History-A’ change for every question. We calculated the variance of interactions and observe that ‘Image-Question’ has the highest variance (), while ‘Image-Answer,’ ‘Image-History-Q’ and ‘Image-History-A’ have a variance of . Beyond the importance score, the high-variance also suggests that the ‘Image-Question’ cue is most important. Attention over dialogs: In fig:res, we present a randomly-picked set of 50 images along with their corresponding dialogs. An automatic script is used to generate the figures. We highlight that image attention is aware of the scene in the question context, and able to attend to correct foreground or background regions. Question attention attends to informative words, and answer attention frequently correlates with the predicted answer. History attention emphasizes nuances.