DeepAI AI Chat
Log In Sign Up

VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks

04/16/2021
by   Hung Le, et al.
0

Neural module networks (NMN) have achieved success in image-grounded tasks such as Visual Question Answering (VQA) on synthetic images. However, very limited work on NMN has been studied in the video-grounded language tasks. These tasks extend the complexity of traditional visual tasks with the additional visual temporal variance. Motivated by recent NMN approaches on image-grounded tasks, we introduce Video-grounded Neural Module Network (VGNMN) to model the information retrieval process in video-grounded language tasks as a pipeline of neural modules. VGNMN first decomposes all language components to explicitly resolve any entity references and detect corresponding action-based inputs from the question. The detected entities and actions are used as parameters to instantiate neural module networks and extract visual cues from the video. Our experiments show that VGNMN can achieve promising performance on two video-grounded language tasks: video QA and video-grounded dialogues.

READ FULL TEXT

page 2

page 5

page 8

page 15

11/26/2018

Visual Entailment Task for Visually-Grounded Language Learning

We introduce a new inference task - Visual Entailment (VE) - which diffe...
06/12/2018

iParaphrasing: Extracting Visually Grounded Paraphrases via an Image

A paraphrase is a restatement of the meaning of a text in other words. P...
12/28/2019

All-in-One Image-Grounded Conversational Agents

As single-task accuracy on individual language and image tasks has impro...
10/07/2020

Towards Understanding Sample Variance in Visually Grounded Language Generation: Evaluations and Observations

A major challenge in visually grounded language generation is to build r...
03/01/2021

Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues

Compared to traditional visual question answering, video-grounded dialog...
10/26/2017

Understanding Grounded Language Learning Agents

Neural network-based systems can now learn to locate the referents of wo...
10/11/2021

Pano-AVQA: Grounded Audio-Visual Question Answering on 360^∘ Videos

360^∘ videos convey holistic views for the surroundings of a scene. It p...