Language Guided Networks for Cross-modal Moment Retrieval

06/18/2020
by   Kun Liu, et al.
0

We address the challenging task of cross-modal moment retrieval, which aims to localize a temporal segment from an untrimmed video described by a natural language query. It poses great challenges over the proper semantic alignment between vision and linguistic domains. Most of these methods only leverage sentences in the multi-modal fusion stage and independently extract the features of videos and sentences, which do not make full use of the potential of language. In this paper, we present Language Guided Networks (LGN), a new framework that tightly integrates cross-modal features in multiple stages. In the first feature extraction stage, we introduce to capture the discriminative visual features which can cover the complex semantics in the sentence query. Specifically, the early modulation unit is designed to modulate convolutional feature maps by a linguistic embedding. Then we adopt a multi-modal fusion module in the second fusion stage. Finally, to get a precise localizer, the sentence information is utilized to guide the process of predicting temporal positions. Specifically, the late guidance module is developed to further bridge vision and language domain via the channel attention mechanism. We evaluate the proposed model on two popular public datasets: Charades-STA and TACoS. The experimental results demonstrate the superior performance of our proposed modules on moment retrieval (improving 5.8% in terms of R1@IoU5 on Charades-STA and 5.2% on TACoS). We put the codes in the supplementary material and will make it publicly available.

READ FULL TEXT

page 2

page 9

research
06/06/2019

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

Query-based moment retrieval aims to localize the most relevant moment i...
research
01/03/2022

Memory-Guided Semantic Learning Network for Temporal Sentence Grounding

Temporal sentence grounding (TSG) is crucial and fundamental for video u...
research
12/04/2021

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Referring image segmentation is a fundamental vision-language task that ...
research
03/30/2022

Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation

Referring video segmentation aims to segment the corresponding video obj...
research
07/07/2023

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

Current mainstream vision-language (VL) tracking framework consists of t...
research
08/11/2023

ViGT: Proposal-free Video Grounding with Learnable Token in Transformer

The video grounding (VG) task aims to locate the queried action or event...
research
05/23/2023

Faster Video Moment Retrieval with Point-Level Supervision

Video Moment Retrieval (VMR) aims at retrieving the most relevant events...

Please sign up or login with your details

Forgot password? Click here to reset