SeqDialN: Sequential Visual Dialog Networks in Joint Visual-Linguistic Representation Space

08/02/2020
by   Liu Yang, et al.
0

In this work, we formulate a visual dialog as an information flow in which each piece of information is encoded with the joint visual-linguistic representation of a single dialog round. Based on this formulation, we consider the visual dialog task as a sequence problem consisting of ordered visual-linguistic vectors. For featurization, we use a Dense Symmetric Co-Attention network as a lightweight vison-language joint representation generator to fuse multimodal features (i.e., image and text), yielding better computation and data efficiencies. For inference, we propose two Sequential Dialog Networks (SeqDialN): the first uses LSTM for information propagation (IP) and the second uses a modified Transformer for multi-step reasoning (MR). Our architecture separates the complexity of multimodal feature fusion from that of inference, which allows simpler design of the inference engine. IP based SeqDialN is our baseline with a simple 2-layer LSTM design that achieves decent performance. MR based SeqDialN, on the other hand, recurrently refines the semantic question/history representations through the self-attention stack of Transformer and produces promising results on the visual dialog task. On VisDial v1.0 test-std dataset, our best single generative SeqDialN achieves 62.54 NDCG and 49.98 model. We fine-tune discriminative SeqDialN with dense annotations and boost the performance up to 72.41 extensive experiments we have conducted to demonstrate the effectiveness of our model components. We also provide visualization for the reasoning process from the relevant conversation rounds and discuss our fine-tuning methods. Our code is available at https://github.com/xiaoxiaoheimei/SeqDialN

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/07/2019

CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

Visual Dialog is a multimodal task of answering a sequence of questions ...
research
04/29/2020

Multi-View Attention Networks for Visual Dialog

Visual dialog is a challenging vision-language task in which a series of...
research
12/18/2019

DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog

Visual Dialog is a vision-language task that requires an AI agent to eng...
research
05/29/2022

VD-PCR: Improving Visual Dialog with Pronoun Coreference Resolution

The visual dialog task requires an AI agent to interact with humans in m...
research
11/26/2019

Efficient Attention Mechanism for Handling All the Interactions between Many Inputs with Application to Visual Dialog

It has been a primary concern in recent studies of vision and language t...
research
07/24/2020

Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

Can we develop visually grounded dialog agents that can efficiently adap...
research
11/23/2022

Unified Multimodal Model with Unlikelihood Training for Visual Dialog

The task of visual dialog requires a multimodal chatbot to answer sequen...

Please sign up or login with your details

Forgot password? Click here to reset