Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog

02/01/2020
by   Zekang Li, et al.
0

Audio-Visual Scene-Aware Dialog (AVSD) is a task to generate responses when chatting about a given video, which is organized as a track of the 8th Dialog System Technology Challenge (DSTC8). To solve the task, we propose a universal multimodal transformer and introduce the multi-task learning method to learn joint representations among different modalities as well as generate informative and fluent responses. Our method extends the natural language generation pre-trained model to multimodal dialogue generation task. Our system achieves the best performance in both objective and subjective evaluations in the challenge.

READ FULL TEXT

page 1

page 2

research
10/21/2020

TMT: A Transformer-based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-aware Dialog

Audio Visual Scene-aware Dialog (AVSD) is a task to generate responses w...
research
02/25/2020

Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge

Audio-Visual Scene-Aware Dialog (AVSD) is an extension from Video Questi...
research
02/21/2022

Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

There have been many attempts to build multimodal dialog systems that ca...
research
12/17/2018

From FiLM to Video: Multi-turn Question Answering with Multi-modal Context

Understanding audio-visual content and the ability to have an informativ...
research
01/17/2020

Multi-step Joint-Modality Attention Network for Scene-Aware Dialogue System

Understanding dynamic scenes and dialogue contexts in order to converse ...
research
10/13/2021

Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (...
research
10/26/2022

End-to-End Multimodal Representation Learning for Video Dialog

Video-based dialog task is a challenging multimodal learning task that h...

Please sign up or login with your details

Forgot password? Click here to reset