Factor Graph Attention

by   Idan Schwartz, et al.

Dialog is an effective way to exchange information, but subtle details and nuances are extremely important. While significant progress has paved a path to address visual dialog with algorithms, details and nuances remain a challenge. Attention mechanisms have demonstrated compelling results to extract details in visual question answering and also provide a convincing framework for visual dialog due to their interpretability and effectiveness. However, the many data utilities that accompany visual dialog challenge existing attention techniques. We address this issue and develop a general attention mechanism for visual dialog which operates on any number of data utilities. To this end, we design a factor graph based attention mechanism which combines any number of utility representations. We illustrate the applicability of the proposed approach on the challenging and recently introduced VisDial datasets, outperforming recent state-of-the-art methods by 1.1 MRR. Our ensemble model improved the MRR score on VisDial1.0 by more than 6



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 8

page 9


Recursive Visual Attention in Visual Dialog

Visual dialog is a challenging vision-language task, which requires the ...

Efficient Attention Mechanism for Handling All the Interactions between Many Inputs with Application to Visual Dialog

It has been a primary concern in recent studies of vision and language t...

Show, Price and Negotiate: A Hierarchical Attention Recurrent Visual Negotiator

Negotiation, as a seller or buyer, is an essential and complicated aspec...

Ensemble based discriminative models for Visual Dialog Challenge 2018

This manuscript describes our approach for the Visual Dialog Challenge 2...

A Simple Baseline for Audio-Visual Scene-Aware Dialog

The recently proposed audio-visual scene-aware dialog task paves the way...

Visual Reference Resolution using Attention Memory for Visual Dialog

Visual dialog is a task of answering a series of inter-dependent questio...

Two Causal Principles for Improving Visual Dialog

This paper is a winner report from team MReaL-BDAI for Visual Dialog Cha...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Qualitative Evaluation

Factors visualization: We provide additional visualization in fig:onecol. We visualize scores for each image region obtained from different types of factors. ‘Image-Local-Information,’ ‘Image-Caption’ and ‘Image-Local-Interaction’ are constant for different questions, while ‘Image-Question,’ ‘Image-Answer,’ ‘Image-History-Q’ and ‘Image-History-A’ change for every question. We calculated the variance of interactions and observe that ‘Image-Question’ has the highest variance (), while ‘Image-Answer,’ ‘Image-History-Q’ and ‘Image-History-A’ have a variance of . Beyond the importance score, the high-variance also suggests that the ‘Image-Question’ cue is most important. Attention over dialogs: In fig:res, we present a randomly-picked set of 50 images along with their corresponding dialogs. An automatic script is used to generate the figures. We highlight that image attention is aware of the scene in the question context, and able to attend to correct foreground or background regions. Question attention attends to informative words, and answer attention frequently correlates with the predicted answer. History attention emphasizes nuances.


Figure : 50 dialogs along with question, answers and history attention. The predicted answer (, A) and ground-truth answer (, GT) are also provided.