VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks

by   Hung Le, et al.

Neural module networks (NMN) have achieved success in image-grounded tasks such as Visual Question Answering (VQA) on synthetic images. However, very limited work on NMN has been studied in the video-grounded language tasks. These tasks extend the complexity of traditional visual tasks with the additional visual temporal variance. Motivated by recent NMN approaches on image-grounded tasks, we introduce Video-grounded Neural Module Network (VGNMN) to model the information retrieval process in video-grounded language tasks as a pipeline of neural modules. VGNMN first decomposes all language components to explicitly resolve any entity references and detect corresponding action-based inputs from the question. The detected entities and actions are used as parameters to instantiate neural module networks and extract visual cues from the video. Our experiments show that VGNMN can achieve promising performance on two video-grounded language tasks: video QA and video-grounded dialogues.



There are no comments yet.


page 2

page 5

page 8

page 15


Visual Entailment Task for Visually-Grounded Language Learning

We introduce a new inference task - Visual Entailment (VE) - which diffe...

iParaphrasing: Extracting Visually Grounded Paraphrases via an Image

A paraphrase is a restatement of the meaning of a text in other words. P...

All-in-One Image-Grounded Conversational Agents

As single-task accuracy on individual language and image tasks has impro...

Towards Understanding Sample Variance in Visually Grounded Language Generation: Evaluations and Observations

A major challenge in visually grounded language generation is to build r...

Neural Variational Learning for Grounded Language Acquisition

We propose a learning system in which language is grounded in visual per...

Understanding Grounded Language Learning Agents

Neural network-based systems can now learn to locate the referents of wo...

Pano-AVQA: Grounded Audio-Visual Question Answering on 360^∘ Videos

360^∘ videos convey holistic views for the surroundings of a scene. It p...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.