Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal Sentence Localization in Videos

03/02/2023
by   Daizong Liu, et al.
0

Temporal sentence localization in videos (TSLV) aims to retrieve the most interested segment in an untrimmed video according to a given sentence query. However, almost of existing TSLV approaches suffer from the same limitations: (1) They only focus on either frame-level or object-level visual representation learning and corresponding correlation reasoning, but fail to integrate them both; (2) They neglect to leverage the rich semantic contexts to further benefit the query reasoning. To address these issues, in this paper, we propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN), which enables both visual- and semantic-aware query reasoning from object-level to frame-level. Specifically, we present a new graph memory mechanism to perform visual-semantic query reasoning: For visual reasoning, we design a visual graph memory to leverage visual information of video; For semantic reasoning, a semantic graph memory is also introduced to explicitly leverage semantic knowledge contained in the classes and attributes of video objects, and perform correlation reasoning in the semantic space. Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.

READ FULL TEXT
research
06/16/2020

Exploiting Visual Semantic Reasoning for Video-Text Retrieval

Video retrieval is a challenging research topic bridging the vision and ...
research
01/02/2023

Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding

Temporal sentence grounding (TSG) aims to identify the temporal boundary...
research
10/31/2021

Hierarchical Deep Residual Reasoning for Temporal Moment Localization

Temporal Moment Localization (TML) in untrimmed videos is a challenging ...
research
01/05/2023

Hypotheses Tree Building for One-Shot Temporal Sentence Localization

Given an untrimmed video, temporal sentence localization (TSL) aims to l...
research
07/26/2021

Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference

Video-and-Language Inference is a recently proposed task for joint video...
research
06/15/2023

Single-Stage Visual Query Localization in Egocentric Videos

Visual Query Localization on long-form egocentric videos requires spatio...
research
12/17/2017

Probabilistic Semantic Retrieval for Surveillance Videos with Activity Graphs

We present a novel framework for finding complex activities matching use...

Please sign up or login with your details

Forgot password? Click here to reset