Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

09/27/2022
by   Yang Jin, et al.
0

Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-temporal tube of a specific object depicted by a free-form textual expression. Existing approaches mainly treat this complicated task as a parallel frame-grounding problem and thus suffer from two types of inconsistency drawbacks: feature alignment inconsistency and prediction inconsistency. In this paper, we present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT), to alleviate these issues. Specially, we introduce a novel multi-modal template as the global objective to address this task, which explicitly constricts the grounding region and associates the predictions among all video frames. Moreover, to generate the above template under sufficient video-textual perception, an encoder-decoder architecture is proposed for effective global context modeling. Thanks to these critical designs, STCAT enjoys more consistent cross-modal feature alignment and tube prediction without reliance on any pre-trained object detectors. Extensive experiments show that our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks (VidSTG and HC-STVG), illustrating the superiority of the proposed framework to better understanding the association between vision and natural language. Code is publicly available at <https://github.com/jy0205/STCAT>.

READ FULL TEXT

page 2

page 9

page 17

page 18

research
03/30/2022

TubeDETR: Spatio-Temporal Video Grounding with Transformers

We consider the problem of localizing a spatio-temporal tube in a video ...
research
01/19/2020

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

In this paper, we consider a novel task, Spatio-Temporal Video Grounding...
research
08/11/2022

HERO: HiErarchical spatio-tempoRal reasOning with Contrastive Action Correspondence for End-to-End Video Object Grounding

Video Object Grounding (VOG) is the problem of associating spatial objec...
research
03/23/2021

Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos

In this paper, we address the problem of referring expression comprehens...
research
07/23/2023

Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision

Visual Grounding (VG) aims at localizing target objects from an image ba...
research
02/22/2023

Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS Instance Segmentation

This paper presents a deep learning framework for medical video segmenta...
research
04/04/2022

Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding

Grounding temporal video segments described in natural language queries ...

Please sign up or login with your details

Forgot password? Click here to reset