Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos

08/06/2020
by   Xiaoye Qu, et al.
0

Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query. To tackle this task, designing an effective model to extract ground-ing information from both visual and textual modalities is crucial. However, most previous attempts in this field only focus on unidirectional interactions from video to query, which emphasizes which words to listen and attends to sentence information via vanilla soft attention, but clues from query-by-video interactions implying where to look are not taken into consideration. In this paper, we propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction. Specifically, in the iterative attention module, each word in the query is first enhanced by attending to each frame in the video through fine-grained attention, then video iteratively attends to the integrated query. Finally, both video and query information is utilized to provide robust cross-modal representation for further moment localization. In addition, to better predict the target segment, we propose a content-oriented localization strategy instead of applying recent anchor-based localization. We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA. FIAN significantly outperforms the state-of-the-art approaches.

READ FULL TEXT

page 2

page 3

page 4

page 8

research
08/04/2020

Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization

Query-based moment localization is a new task that localizes the best ma...
research
11/18/2020

A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus

Identifying a short segment in a long video that semantically matches a ...
research
11/19/2020

VLG-Net: Video-Language Graph Matching Network for Video Grounding

Grounding language queries in videos aims at identifying the time interv...
research
10/17/2022

Effective and Efficient Query-aware Snippet Extraction for Web Search

Query-aware webpage snippet extraction is widely used in search engines ...
research
11/07/2016

Memory-augmented Attention Modelling for Videos

We present a method to improve video description generation by modeling ...
research
09/07/2021

Learning to Combine the Modalities of Language and Video for Temporal Moment Localization

Temporal moment localization aims to retrieve the best video segment mat...
research
05/25/2022

You Need to Read Again: Multi-granularity Perception Network for Moment Retrieval in Videos

Moment retrieval in videos is a challenging task that aims to retrieve t...

Please sign up or login with your details

Forgot password? Click here to reset