Local-Global Video-Text Interactions for Temporal Grounding

04/16/2020
by   Jonghwan Mun, et al.
0

This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query. We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query, which corresponds to important semantic entities described in the query (e.g., actors, objects, and actions), and reflect bi-modal interactions between the linguistic features of the query and the visual features of the video in multiple levels. The proposed method effectively predicts the target time interval by exploiting contextual information from local to global during bi-modal interactions. Through in-depth ablation studies, we find out that incorporating both local and global context in video and text interactions is crucial to the accurate grounding. Our experiment shows that the proposed method outperforms the state of the arts on Charades-STA and ActivityNet Captions datasets by large margins, 7.44% and 4.61% points at Recall@tIoU=0.5 metric, respectively. Code is available in https://github.com/JonghwanMun/LGI4temporalgrounding.

READ FULL TEXT

page 1

page 8

page 11

page 12

page 13

research
08/31/2022

Hierarchical Local-Global Transformer for Temporal Sentence Grounding

This paper studies the multimedia problem of temporal sentence grounding...
research
03/22/2021

Context-aware Biaffine Localizing Network for Temporal Sentence Grounding

This paper addresses the problem of temporal sentence grounding (TSG), w...
research
11/20/2019

Learning Cross-modal Context Graph for Visual Grounding

Visual grounding is a ubiquitous building block in many vision-language ...
research
03/18/2022

Local-Global Context Aware Transformer for Language-Guided Video Segmentation

We explore the task of language-guided video segmentation (LVS). Previou...
research
04/12/2022

Position-aware Location Regression Network for Temporal Video Grounding

The key to successful grounding for video surveillance is to understand ...
research
08/11/2023

ViGT: Proposal-free Video Grounding with Learnable Token in Transformer

The video grounding (VG) task aims to locate the queried action or event...
research
03/15/2023

Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos

Video temporal grounding aims to pinpoint a video segment that matches t...

Please sign up or login with your details

Forgot password? Click here to reset