Video-Grounded Dialogues with Pretrained Generation Language Models

06/27/2020
by   Hung Le, et al.
0

Pre-trained language models have shown remarkable success in improving various downstream NLP tasks due to their ability to capture dependencies in textual data and generate natural responses. In this paper, we leverage the power of pre-trained language models for improving video-grounded dialogue, which is very challenging and involves complex features of different dynamics: (1) Video features which can extend across both spatial and temporal dimensions; and (2) Dialogue features which involve semantic dependencies over multiple dialogue turns. We propose a framework by extending GPT-2 models to tackle these challenges by formulating video-grounded dialogue tasks as a sequence-to-sequence task, combining both visual and textual representation into a structured sequence, and fine-tuning a large pre-trained GPT-2 network. Our framework allows fine-tuning language models to capture dependencies across multiple modalities over different levels of information: spatio-temporal level in video and token-sentence level in dialogue context. We achieve promising improvement on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark from DSTC7, which supports a potential direction in this line of research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/19/2020

Are Pre-trained Language Models Knowledgeable to Ground Open Domain Dialogues?

We study knowledge-grounded dialogue generation with pre-trained languag...
research
10/22/2022

Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation

We study video-grounded dialogue generation, where a response is generat...
research
10/20/2020

BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues

Video-grounded dialogues are very challenging due to (i) the complexity ...
research
10/06/2020

StyleDGPT: Stylized Response Generation with Pre-trained Language Models

Generating responses following a desired style has great potentials to e...
research
03/24/2021

Structured Co-reference Graph Attention for Video-grounded Dialogue

A video-grounded dialogue system referred to as the Structured Co-refere...
research
05/22/2023

VideoLLM: Modeling Video Sequence with Large Language Models

With the exponential growth of video data, there is an urgent need for a...
research
09/26/2022

Towards Parameter-Efficient Integration of Pre-Trained Language Models In Temporal Video Grounding

This paper explores the task of Temporal Video Grounding (TVG) where, gi...

Please sign up or login with your details

Forgot password? Click here to reset