TubeDETR: Spatio-Temporal Video Grounding with Transformers

03/30/2022
by   Antoine Yang, et al.
8

We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query. This is a challenging task that requires the joint and efficient modeling of temporal, spatial and multi-modal interactions. To address this task, we propose TubeDETR, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection. Our model notably includes: (i) an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and (ii) a space-time decoder that jointly performs spatio-temporal localization. We demonstrate the advantage of our proposed components through an extensive ablation study. We also evaluate our full approach on the spatio-temporal video grounding task and demonstrate improvements over the state of the art on the challenging VidSTG and HC-STVG benchmarks. Code and trained models are publicly available at https://antoyang.github.io/tubedetr.html.

READ FULL TEXT

page 3

page 8

page 13

page 14

page 15

research
09/27/2022

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-...
research
03/29/2023

What, when, and where? – Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Spatio-temporal grounding describes the task of localizing events in spa...
research
02/22/2015

Spatio-temporal Video Parsing for Abnormality Detection

Abnormality detection in video poses particular challenges due to the in...
research
07/13/2021

ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer

We propose ST-DETR, a Spatio-Temporal Transformer-based architecture for...
research
06/16/2021

Grounding Spatio-Temporal Language with Transformers

Language is an interface to the outside world. In order for embodied age...
research
05/21/2023

Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios

Audio-visual question answering (AVQA) is a challenging task that requir...
research
07/09/2022

Human-centric Spatio-Temporal Video Grounding via the Combination of Mutual Matching Network and TubeDETR

In this technical report, we represent our solution for the Human-centri...

Please sign up or login with your details

Forgot password? Click here to reset