T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval

04/20/2021
by   Xiaohan Wang, et al.
0

Text-video retrieval is a challenging task that aims to search relevant video contents based on natural language descriptions. The key to this problem is to measure text-video similarities in a joint embedding space. However, most existing methods only consider the global cross-modal similarity and overlook the local details. Some works incorporate the local comparisons through cross-modal local matching and reasoning. These complex operations introduce tremendous computation. In this paper, we design an efficient global-local alignment method. The multi-modal video sequences and text features are adaptively aggregated with a set of shared semantic centers. The local cross-modal similarities are computed between the video feature and text feature within the same center. This design enables the meticulous local comparison and reduces the computational cost of the interaction between each text-video pair. Moreover, a global alignment method is proposed to provide a global cross-modal measurement that is complementary to the local perspective. The global aggregated visual features also provide additional supervision, which is indispensable to the optimization of the learnable semantic centers. We achieve consistent improvements on three standard text-video retrieval benchmarks and outperform the state-of-the-art by a clear margin.

READ FULL TEXT

page 7

page 8

research
11/17/2017

Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models

Textual-visual cross-modal retrieval has been a hot research topic in bo...
research
01/30/2023

Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval

Vision-language alignment learning for video-text retrieval arouses a lo...
research
03/29/2021

Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval

Cross-modal video-text retrieval, a challenging task in the field of vis...
research
02/24/2023

Deep Learning for Video-Text Retrieval: a Review

Video-Text Retrieval (VTR) aims to search for the most relevant video re...
research
05/20/2023

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment

Text-video retrieval is a challenging cross-modal task, which aims to al...
research
05/23/2023

Faster Video Moment Retrieval with Point-Level Supervision

Video Moment Retrieval (VMR) aims at retrieving the most relevant events...
research
11/09/2020

Learning the Best Pooling Strategy for Visual Semantic Embedding

Visual Semantic Embedding (VSE) is a dominant approach for vision-langua...

Please sign up or login with your details

Forgot password? Click here to reset