Unified Multimodal Model with Unlikelihood Training for Visual Dialog

11/23/2022
by   Zihao Wang, et al.
0

The task of visual dialog requires a multimodal chatbot to answer sequential questions from humans about image content. Prior work performs the standard likelihood training for answer generation on the positive instances (involving correct answers). However, the likelihood objective often leads to frequent and dull outputs and fails to exploit the useful knowledge from negative instances (involving incorrect answers). In this paper, we propose a Unified Multimodal Model with UnLikelihood Training, named UniMM-UL, to tackle this problem. First, to improve visual dialog understanding and generation by multi-task learning, our model extends ViLBERT from only supporting answer discrimination to holding both answer discrimination and answer generation seamlessly by different attention masks. Specifically, in order to make the original discriminative model compatible with answer generation, we design novel generative attention masks to implement the autoregressive Masked Language Modeling (autoregressive MLM) task. And to attenuate the adverse effects of the likelihood objective, we exploit unlikelihood training on negative instances to make the model less likely to generate incorrect answers. Then, to utilize dense annotations, we adopt different fine-tuning methods for both generating and discriminating answers, rather than just for discriminating answers as in the prior work. Finally, on the VisDial dataset, our model achieves the best generative results (69.23 NDCG score). And our model also yields comparable discriminative results with the state-of-the-art in both single-model and ensemble settings (75.92 and 76.17 NDCG scores).

READ FULL TEXT
research
04/28/2020

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Visual dialog is a challenging vision-language task, where a dialog agen...
research
02/26/2019

Image-Question-Answer Synergistic Network for Visual Dialog

The image, question (combined with the history for de-referencing), and ...
research
12/05/2019

Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline

Prior work in visual dialog has focused on training deep neural models o...
research
02/21/2022

Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

There have been many attempts to build multimodal dialog systems that ca...
research
04/15/2021

Ensemble of MRR and NDCG models for Visual Dialog

Assessing an AI agent that can converse in human language and understand...
research
04/14/2020

DialGraph: Sparse Graph Learning Networks for Visual Dialog

Visual dialog is a task of answering a sequence of questions grounded in...
research
08/02/2020

SeqDialN: Sequential Visual Dialog Networks in Joint Visual-Linguistic Representation Space

In this work, we formulate a visual dialog as an information flow in whi...

Please sign up or login with your details

Forgot password? Click here to reset