Video Dialog as Conversation about Objects Living in Space-Time

07/08/2022
by   Hoang-Anh Pham, et al.
3

It would be a technological feat to be able to create a system that can hold a meaningful conversation with humans about what they watch. A setup toward that goal is presented as a video dialog task, where the system is asked to generate natural utterances in response to a question in an ongoing dialog. The task poses great visual, linguistic, and reasoning challenges that cannot be easily overcome without an appropriate representation scheme over video and dialog that supports high-level reasoning. To tackle these challenges we present a new object-centric framework for video dialog that supports neural reasoning dubbed COST - which stands for Conversation about Objects in Space-Time. Here dynamic space-time visual content in videos is first parsed into object trajectories. Given this video abstraction, COST maintains and tracks object-associated dialog states, which are updated upon receiving new questions. Object interactions are dynamically and conditionally inferred for each question, and these serve as the basis for relational reasoning among them. COST also maintains a history of previous answers, and this allows retrieval of relevant object-centric information to enrich the answer forming process. Language production then proceeds in a step-wise manner, taking into the context of the current utterance, the existing dialog, the current question. We evaluate COST on the DSTC7 and DSTC8 benchmarks, demonstrating its competitiveness against state-of-the-arts.

READ FULL TEXT
research
01/25/2019

Audio-Visual Scene-Aware Dialog

We introduce the task of scene-aware dialog. Given a follow-up question ...
research
12/18/2019

DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog

Visual Dialog is a vision-language task that requires an AI agent to eng...
research
04/12/2021

Object-Centric Representation Learning for Video Question Answering

Video question answering (Video QA) presents a powerful testbed for huma...
research
09/17/2021

GoG: Relation-aware Graph-over-Graph Network for Visual Dialog

Visual dialog, which aims to hold a meaningful conversation with humans ...
research
06/25/2021

Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering

Video Question Answering (Video QA) is a powerful testbed to develop new...
research
10/12/2021

We've had this conversation before: A Novel Approach to Measuring Dialog Similarity

Dialog is a core building block of human natural language interactions. ...
research
09/11/2019

Probabilistic framework for solving Visual Dialog

In this paper, we propose a probabilistic framework for solving the task...

Please sign up or login with your details

Forgot password? Click here to reset