Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

01/19/2020
by   Zhu Zhang, et al.
0

In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG). Given an untrimmed video and a declarative/interrogative sentence depicting an object, STVG aims to localize the spatiotemporal tube of the queried object. STVG has two challenging settings: (1) We need to localize spatio-temporal object tubes from untrimmed videos, where the object may only exist in a very small segment of the video; (2) We deal with multi-form sentences, including the declarative sentences with explicit objects and interrogative sentences with unknown objects. Existing methods cannot tackle the STVG task due to the ineffective tube pre-generation and the lack of object relationship modeling. Thus, we then propose a novel Spatio-Temporal Graph Reasoning Network (STGRN) for this task. First, we build a spatio-temporal region graph to capture the region relationships with temporal object dynamics, which involves the implicit and explicit spatial subgraphs in each frame and the temporal dynamic subgraph across frames. We then incorporate textual clues into the graph and develop the multi-step cross-modal graph reasoning. Next, we introduce a spatio-temporal localizer with a dynamic selection method to directly retrieve the spatiotemporal tubes without tube pre-generation. Moreover, we contribute a large-scale video grounding dataset VidSTG based on video relation dataset VidOR. The extensive experiments demonstrate the effectiveness of our method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/16/2020

Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding

Spatio-temporal video grounding aims to retrieve the spatio-temporal tub...
research
11/10/2020

Human-centric Spatio-Temporal Video Grounding With Visual Transformers

In this work, we introduce a novel task - Humancentric Spatio-Temporal V...
research
07/02/2022

Gaussian Kernel-based Cross Modal Network for Spatio-Temporal Video Grounding

Spatial-Temporal Video Grounding (STVG) is a challenging task which aims...
research
02/21/2023

Tracking Objects and Activities with Attention for Temporal Sentence Grounding

Temporal sentence grounding (TSG) aims to localize the temporal segment ...
research
09/27/2022

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-...
research
02/16/2023

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

Video understanding tasks take many forms, from action detection to visu...
research
07/17/2020

Visual Relation Grounding in Videos

In this paper, we explore a novel task named visual Relation Grounding i...

Please sign up or login with your details

Forgot password? Click here to reset