Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

11/21/2022
by   Damianos Galanopoulos, et al.
0

In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on text-to-video retrieval. We investigate how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, which encode text-video pairs into comparable representations. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure. Moreover, at the retrieval stage, we introduce additional softmax operations for revising the inferred query-video similarities. Extensive experiments in several setups based on three large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to best combine text-visual features and document the performance of the proposed network. Source code is made publicly available at: https://github.com/bmezaris/TextToVideoRetrieval-TtimesV

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/18/2023

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

The canonical approach to video-text retrieval leverages a coarse-graine...
research
09/05/2017

Predicting Visual Features from Text for Image and Video Caption Retrieval

This paper strives to find amidst a set of sentences the one best descri...
research
06/11/2019

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Visual-semantic embedding aims to find a shared latent space where relat...
research
07/06/2023

T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

Large web-sourced multimodal datasets have powered a slew of new methods...
research
04/29/2015

Visual Information Retrieval in Endoscopic Video Archives

In endoscopic procedures, surgeons work with live video streams from the...
research
03/17/2023

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

Existing text-video retrieval solutions are, in essence, discriminant mo...
research
07/11/2022

LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval

Video-text retrieval is a class of cross-modal representation learning p...

Please sign up or login with your details

Forgot password? Click here to reset