X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

03/28/2022
by   Satya Krishna Gorti, et al.
0

In text-video retrieval, the objective is to learn a cross-modal similarity function between a text and a video that ranks relevant text-video pairs higher than irrelevant pairs. However, videos inherently express a much wider gamut of information than texts. Instead, texts often capture sub-regions of entire videos and are most semantically similar to certain frames within videos. Therefore, for a given text, a retrieval model should focus on the text's most semantically similar video sub-regions to make a more relevant comparison. Yet, most existing works aggregate entire videos without directly considering text. Common text-agnostic aggregations schemes include mean-pooling or self-attention over the frames, but these are likely to encode misleading visual information not described in the given text. To address this, we propose a cross-modal attention model called X-Pool that reasons between a text and the frames of a video. Our core mechanism is a scaled dot product attention for a text to attend to its most semantically similar frames. We then generate an aggregated video representation conditioned on the text's attention weights over the frames. We evaluate our method on three benchmark datasets of MSR-VTT, MSVD and LSMDC, achieving new state-of-the-art results by up to 12 improvement in Recall@1. Our findings thereby highlight the importance of joint text-video reasoning to extract important visual cues according to text. Full code and demo can be found at: https://layer6ai-labs.github.io/xpool/

READ FULL TEXT

page 1

page 4

page 8

research
03/10/2023

Improving Text-Audio Retrieval by Text-aware Attention Pooling and Prior Matrix Revised Loss

In text-audio retrieval (TAR) tasks, due to the heterogeneity of content...
research
10/16/2022

Efficient Cross-Modal Video Retrieval with Meta-Optimized Frames

Cross-modal video retrieval aims to retrieve the semantically relevant v...
research
04/04/2019

ExCL: Extractive Clip Localization Using Natural Language Descriptions

The task of retrieving clips within videos based on a given natural lang...
research
07/14/2023

TVPR: Text-to-Video Person Retrieval and a New Benchmark

Most existing methods for text-based person retrieval focus on text-to-i...
research
04/10/2020

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval

Existing dominant approaches for cross-modal video-text retrieval task a...
research
05/13/2023

Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval

Recently, masked video modeling has been widely explored and significant...
research
07/05/2019

Video Question Generation via Cross-Modal Self-Attention Networks Learning

Video Question Answering (Video QA) is a critical and challenging task i...

Please sign up or login with your details

Forgot password? Click here to reset