Deep Learning for Video-Text Retrieval: a Review

02/24/2023
by   Cunjuan Zhu, et al.
0

Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence, and vice versa. In general, this retrieval task is composed of four successive steps: video and textual feature representation extraction, feature embedding and matching, and objective functions. In the last, a list of samples retrieved from the dataset is ranked based on their matching similarities to the query. In recent years, significant and flourishing progress has been achieved by deep learning techniques, however, VTR is still a challenging task due to the problems like how to learn an efficient spatial-temporal video feature and how to narrow the cross-modal gap. In this survey, we review and summarize over 100 research papers related to VTR, demonstrate state-of-the-art performance on several commonly benchmarked datasets, and discuss potential challenges and directions, with the expectation to provide some insights for researchers in the field of video-text retrieval.

READ FULL TEXT

page 2

page 3

research
04/20/2021

T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval

Text-video retrieval is a challenging task that aims to search relevant ...
research
07/24/2023

Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment

Text-to-video retrieval systems have recently made significant progress ...
research
06/11/2019

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Visual-semantic embedding aims to find a shared latent space where relat...
research
10/09/2022

ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval

In this paper, we re-examine the task of cross-modal clip-sentence retri...
research
09/19/2022

Tree-based Text-Vision BERT for Video Search in Baidu Video Advertising

The advancement of the communication technology and the popularity of th...
research
06/07/2023

MarineVRS: Marine Video Retrieval System with Explainability via Semantic Understanding

Building a video retrieval system that is robust and reliable, especiall...
research
05/13/2023

Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval

Recently, masked video modeling has been widely explored and significant...

Please sign up or login with your details

Forgot password? Click here to reset