Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents

08/18/2020
by   Ye Zhu, et al.
0

With the arising concerns for the AI systems provided with direct access to abundant sensitive information, researchers seek to develop more reliable AI with implicit information sources. To this end, in this paper, we introduce a new task called video description via two multi-modal cooperative dialog agents, whose ultimate goal is for one conversational agent to describe an unseen video based on the dialog and two static frames. Specifically, one of the intelligent agents - Q-BOT - is given two static frames from the beginning and the end of the video, as well as a finite number of opportunities to ask relevant natural language questions before describing the unseen video. A-BOT, the other agent who has already seen the entire video, assists Q-BOT to accomplish the goal by providing answers to those questions. We propose a QA-Cooperative Network with a dynamic dialog history update learning mechanism to transfer knowledge from A-BOT to Q-BOT, thus helping Q-BOT to better describe the video. Extensive experiments demonstrate that Q-BOT can effectively learn to describe an unseen video by the proposed model and the cooperative learning method, achieving the promising performance where Q-BOT is given the full ground truth history dialog.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/26/2021

Saying the Unseen: Video Descriptions via Dialog Agents

Current vision and language tasks usually take complete visual data (e.g...
research
07/10/2019

Vision-and-Dialog Navigation

Robots navigating in human environments should use language to ask for a...
research
03/20/2017

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

We introduce the first goal-driven training for visual question answerin...
research
11/16/2020

Where Are You? Localization from Embodied Dialog

We present Where Are You? (WAY), a dataset of  6k dialogs in which two h...
research
05/11/2022

Learning to Retrieve Videos by Asking Questions

The majority of traditional text-to-video retrieval systems operate in s...
research
08/28/2022

JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents

Building a conversational embodied agent to execute real-life tasks has ...
research
06/24/2023

Full Automation of Goal-driven LLM Dialog Threads with And-Or Recursors and Refiner Oracles

We automate deep step-by step reasoning in an LLM dialog thread by recur...

Please sign up or login with your details

Forgot password? Click here to reset