Gaussian Kernel-based Cross Modal Network for Spatio-Temporal Video Grounding

07/02/2022
by   Zeyu Xiong, et al.
0

Spatial-Temporal Video Grounding (STVG) is a challenging task which aims to localize the spatio-temporal tube of the interested object semantically according to a natural language query. Most previous works not only severely rely on the anchor boxes extracted by Faster R-CNN, but also simply regard the video as a series of individual frames, thus lacking their temporal modeling. Instead, in this paper, we are the first to propose an anchor-free framework for STVG, called Gaussian Kernel-based Cross Modal Network (GKCMN). Specifically, we utilize the learned Gaussian Kernel-based heatmaps of each video frame to locate the query-related object. A mixed serial and parallel connection network is further developed to leverage both spatial and temporal relations among frames for better grounding. Experimental results on VidSTG dataset demonstrate the effectiveness of our proposed GKCMN.

READ FULL TEXT

page 2

page 3

research
01/19/2020

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

In this paper, we consider a novel task, Spatio-Temporal Video Grounding...
research
08/11/2022

HERO: HiErarchical spatio-tempoRal reasOning with Contrastive Action Correspondence for End-to-End Video Object Grounding

Video Object Grounding (VOG) is the problem of associating spatial objec...
research
02/21/2023

Tracking Objects and Activities with Attention for Temporal Sentence Grounding

Temporal sentence grounding (TSG) aims to localize the temporal segment ...
research
07/06/2022

STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding

In this technical report, we introduce our solution to human-centric spa...
research
06/02/2021

Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

Referring video object segmentation (RVOS) aims to segment video objects...
research
03/14/2023

You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos

Given an untrimmed video, temporal sentence grounding (TSG) aims to loca...
research
05/14/2021

Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation

Language-queried video actor segmentation aims to predict the pixel-leve...

Please sign up or login with your details

Forgot password? Click here to reset