Video Object Grounding using Semantic Roles in Language Description

03/24/2020
by   Arka Sadhu, et al.
12

We explore the task of Video Object Grounding (VOG), which grounds objects in videos referred to in natural language descriptions. Previous methods apply image grounding based algorithms to address VOG, fail to explore the object relation information and suffer from limited generalization. Here, we investigate the role of object relations in VOG and propose a novel framework VOGNet to encode multi-modal object relations via self-attention with relative position encoding. To evaluate VOGNet, we propose novel contrasting sampling methods to generate more challenging grounding input samples, and construct a new dataset called ActivityNet-SRL (ASRL) based on existing caption and grounding datasets. Experiments on ASRL validate the need of encoding object relations in VOG, and our VOGNet outperforms competitive baselines by a significant margin.

READ FULL TEXT

page 1

page 3

page 4

page 8

page 12

page 13

page 15

research
11/17/2022

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

Localizing objects in 3D scenes based on natural language requires under...
research
04/09/2022

On the Importance of Karaka Framework in Multi-modal Grounding

Computational Paninian Grammar model helps in decoding a natural languag...
research
03/14/2021

Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images

Grounding referring expressions in RGBD image has been an emerging field...
research
03/19/2021

ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

Text-based video segmentation is a challenging task that segments out th...
research
06/02/2021

Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

Referring video object segmentation (RVOS) aims to segment video objects...
research
12/24/2021

Grounding Linguistic Commands to Navigable Regions

Humans have a natural ability to effortlessly comprehend linguistic comm...
research
12/01/2021

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

The recent and increasing interest in video-language research has driven...

Please sign up or login with your details

Forgot password? Click here to reset