Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

02/19/2023
by   Yimu Wang, et al.
0

While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-space multi-grained supervised learning framework, SUMA, to learn an aligned representation space shared between the video and the text for video-text retrieval. The shared aligned space is initialized with a finite number of concept clusters, each of which refers to a number of basic concepts (words). With the text data at hand, we are able to update the shared aligned space in a supervised manner using the proposed similarity and alignment losses. Moreover, to enable multi-grained alignment, we incorporate frame representations for better modeling the video modality and calculating fine-grained and coarse-grained similarity. Benefiting from learned shared aligned space and multi-grained similarity, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of SUMA over existing methods.

READ FULL TEXT

page 3

page 14

research
07/15/2022

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

Video-text retrieval has been a crucial and fundamental task in multi-mo...
research
12/16/2022

HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval

Image-text retrieval (ITR) is a challenging task in the field of multimo...
research
09/18/2023

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

The canonical approach to video-text retrieval leverages a coarse-graine...
research
10/10/2022

Contrastive Video-Language Learning with Fine-grained Frame Sampling

Despite recent progress in video and language representation learning, t...
research
11/14/2022

Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization

We investigate composed image retrieval with text feedback. Users gradua...
research
04/22/2022

A Multi-level Alignment Training Scheme for Video-and-Language Grounding

To solve video-and-language grounding tasks, the key is for the network ...
research
10/11/2022

Analyzing Text Representations under Tight Annotation Budgets: Measuring Structural Alignment

Annotating large collections of textual data can be time consuming and e...

Please sign up or login with your details

Forgot password? Click here to reset