Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline

12/05/2019
by   Vishvak Murahari, et al.
16

Prior work in visual dialog has focused on training deep neural models on the VisDial dataset in isolation, which has led to great progress, but is limiting and wasteful. In this work, following recent trends in representation learning for language, we introduce an approach to leverage pretraining on related large-scale vision-language datasets before transferring to visual dialog. Specifically, we adapt the recently proposed ViLBERT (Lu et al., 2019) model for multi-turn visually-grounded conversation sequences. Our model is pretrained on the Conceptual Captions and Visual Question Answering datasets, and finetuned on VisDial with a VisDial-specific input representation and the masked language modeling and next sentence prediction objectives (as in BERT). Our best single model achieves state-of-the-art on Visual Dialog, outperforming prior published work (including model ensembles) by more than 1 NDCG and MRR. Next, we carefully analyse our model and find that additional finetuning using 'dense' annotations i.e. relevance scores for all 100 answer options corresponding to each question on a subset of the training set, leads to even higher NDCG – more than 10 17 primary metrics for this task – NDCG and MRR. We find that this is because dense annotations in the dataset do not correlate well with the original ground-truth answers to questions, often rewarding the model for generic responses (e.g. "can't tell").

READ FULL TEXT

page 2

page 6

page 11

page 12

research
04/28/2020

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Visual dialog is a challenging vision-language task, where a dialog agen...
research
03/06/2022

Modeling Coreference Relations in Visual Dialog

Visual dialog is a vision-language task where an agent needs to answer a...
research
08/24/2021

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

With recent progress in joint modeling of visual and textual representat...
research
11/23/2022

Unified Multimodal Model with Unlikelihood Training for Visual Dialog

The task of visual dialog requires a multimodal chatbot to answer sequen...
research
09/23/2019

Improving Generative Visual Dialog by Answering Diverse Questions

Prior work on training generative Visual Dialog models with reinforcemen...
research
10/07/2020

"I'd rather just go to bed": Understanding Indirect Answers

We revisit a pragmatic inference problem in dialog: understanding indire...
research
10/10/2022

Transformer-based Localization from Embodied Dialog with Large-scale Pre-training

We address the challenging task of Localization via Embodied Dialog (LED...

Please sign up or login with your details

Forgot password? Click here to reset