Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding

09/14/2021
by   Daizong Liu, et al.
0

A key solution to temporal sentence grounding (TSG) exists in how to learn effective alignment between vision and language features extracted from an untrimmed video and a sentence description. Existing methods mainly leverage vanilla soft attention to perform the alignment in a single-step process. However, such single-step attention is insufficient in practice, since complicated relations between inter- and intra-modality are usually obtained through multi-step reasoning. In this paper, we propose an Iterative Alignment Network (IA-Net) for TSG task, which iteratively interacts inter- and intra-modal features within multiple steps for more accurate grounding. Specifically, during the iterative reasoning process, we pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs, and enhance the basic co-attention mechanism in a parallel manner. To further calibrate the misaligned attention caused by each reasoning step, we also devise a calibration module following each attention module to refine the alignment knowledge. With such iterative alignment scheme, our IA-Net can robustly capture the fine-grained relations between vision and language domains step-by-step for progressively reasoning the temporal boundaries. Extensive experiments conducted on three challenging benchmarks demonstrate that our proposed model performs better than the state-of-the-arts.

READ FULL TEXT

page 1

page 8

research
10/21/2022

Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding

Temporal language grounding (TLG) aims to localize a video segment in an...
research
02/26/2021

A Universal Model for Cross Modality Mapping by Relational Reasoning

With the aim of matching a pair of instances from two different modaliti...
research
01/02/2023

Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding

Temporal sentence grounding (TSG) aims to identify the temporal boundary...
research
02/21/2023

Tracking Objects and Activities with Attention for Temporal Sentence Grounding

Temporal sentence grounding (TSG) aims to localize the temporal segment ...
research
08/18/2023

Language-Guided Diffusion Model for Visual Grounding

Visual grounding (VG) tasks involve explicit cross-modal alignment, as s...
research
09/22/2022

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

Video temporal grounding (VTG) targets to localize temporal moments in a...
research
08/23/2023

Sign Language Translation with Iterative Prototype

This paper presents IP-SLT, a simple yet effective framework for sign la...

Please sign up or login with your details

Forgot password? Click here to reset