MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

03/14/2022
by   Alexander Kunitsyn, et al.
0

In this work we present a new State-of-The-Art on the text-to-video retrieval task on MSR-VTT, LSMDC, MSVD, YouCook2 and TGIF obtained by a single model. Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs. A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones. We introduce three-stage training procedure that provides high transfer knowledge efficiency and allows to use noisy datasets during training without prior knowledge degradation. Additionally, double positional encoding is used for better fusion of different modalities and a simple method for non-square inputs processing is suggested.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/19/2021

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

We present a new state-of-the-art on the text to video retrieval task on...
research
09/16/2023

In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval

Large-scale noisy web image-text datasets have been proven to be efficie...
research
04/07/2018

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

Joint understanding of video and language is an active research area wit...
research
07/14/2023

PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting

Vision-language (VL) Pre-training (VLP) has shown to well generalize VL ...
research
01/19/2023

Multimodal Video Adapter for Parameter Efficient Video Text Retrieval

State-of-the-art video-text retrieval (VTR) methods usually fully fine-t...
research
06/28/2023

ICSVR: Investigating Compositional and Semantic Understanding in Video Retrieval Models

Video retrieval (VR) involves retrieving the ground truth video from the...
research
10/05/2021

Manifold learning-supported estimation of relative transfer functions for spatial filtering

Many spatial filtering algorithms used for voice capture in, e.g., telec...

Please sign up or login with your details

Forgot password? Click here to reset