Hypotheses Tree Building for One-Shot Temporal Sentence Localization

01/05/2023
by   Daizong Liu, et al.
0

Given an untrimmed video, temporal sentence localization (TSL) aims to localize a specific segment according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on dense video frame annotations, which require a tremendous amount of human effort to collect. In this paper, we target another more practical and challenging setting: one-shot temporal sentence localization (one-shot TSL), which learns to retrieve the query information among the entire video with only one annotated frame. Particularly, we propose an effective and novel tree-structure baseline for one-shot TSL, called Multiple Hypotheses Segment Tree (MHST), to capture the query-aware discriminative frame-wise information under the insufficient annotations. Each video frame is taken as the leaf-node, and the adjacent frames sharing the same visual-linguistic semantics will be merged into the upper non-leaf node for tree building. At last, each root node is an individual segment hypothesis containing the consecutive frames of its leaf-nodes. During the tree construction, we also introduce a pruning strategy to eliminate the interference of query-irrelevant nodes. With our designed self-supervised loss functions, our MHST is able to generate high-quality segment hypotheses for ranking and selection with the query. Experiments on two challenging datasets demonstrate that MHST achieves competitive performance compared to existing methods.

READ FULL TEXT

page 3

page 7

research
03/15/2022

End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video Grounding

Natural language spatial video grounding aims to detect the relevant obj...
research
01/02/2023

Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding

Temporal sentence grounding (TSG) aims to identify the temporal boundary...
research
03/02/2023

Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal Sentence Localization in Videos

Temporal sentence localization in videos (TSLV) aims to retrieve the mos...
research
08/04/2020

Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization

Query-based moment localization is a new task that localizes the best ma...
research
05/06/2023

Transform-Equivariant Consistency Learning for Temporal Sentence Grounding

This paper addresses the temporal sentence grounding (TSG). Although exi...
research
06/15/2023

Single-Stage Visual Query Localization in Egocentric Videos

Visual Query Localization on long-form egocentric videos requires spatio...

Please sign up or login with your details

Forgot password? Click here to reset