Log In Sign Up

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

by   Alexander Kunitsyn, et al.

In this work we present a new State-of-The-Art on the text-to-video retrieval task on MSR-VTT, LSMDC, MSVD, YouCook2 and TGIF obtained by a single model. Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs. A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones. We introduce three-stage training procedure that provides high transfer knowledge efficiency and allows to use noisy datasets during training without prior knowledge degradation. Additionally, double positional encoding is used for better fusion of different modalities and a simple method for non-square inputs processing is suggested.


page 1

page 2

page 3

page 4


MDMMT: Multidomain Multimodal Transformer for Video Retrieval

We present a new state-of-the-art on the text to video retrieval task on...

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

Joint understanding of video and language is an active research area wit...

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

Multi-channel video-language retrieval require models to understand info...

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

We present HERO, a Hierarchical EncodeR for Omni-representation learning...

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Video-text retrieval plays an essential role in multi-modal research and...

Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Most methods for conditional video synthesis use a single modality as th...

Manifold learning-supported estimation of relative transfer functions for spatial filtering

Many spatial filtering algorithms used for voice capture in, e.g., telec...

1 Introduction

The text-to-video retrieval task is defined as search of most relevant video segments for an arbitrary natural language text query. A search query may contain description of arbitrary actions, objects, sounds or a combinations of them. Note that an arbitrary search query means zero-shot mode of search. A specific search query might not occur in the training database. Despite this, the model should successfully perform the search operation.

The text-to-video retrieval technology can be used for semantic search within a single long video. For example, inside a full-length movie or a stream video. After describing the event, the user can easily find the appropriate video segment. A more general task is the search for a relevant video segment within a large gallery, for example, the entire video hosting like YouTube or Vimeo.

Another application is the search for a specific event in a surveillance cameras dataset or real time video stream. This can be useful to identify illegal actions, accidents or any other important events.

An important requirement for a text-to-video retrieval system is scaling to a large video gallery. A good example of an efficient architecture is the two-stream models. Within this approach the video segment and the text query are encoded independently by the video model and text model respectively. Separate processing allows to compute embeddings for the entire video gallery beforehand. During the inference time, the system calculates the embedding for the search query. Next it calculates the similarity function between query embedding and each embedding from the gallery. Most common choice for similarity function is the cosine similarity.

The data required for the training consists of pairs of video segment and text description. Noise Contrastive Estimation (NCE) is currently the most common framework for this task 

[gabeur2020multimodal, miech19howto100m, portilloquintero2021straightforward, CLIP4Clip, clip2video, camoe, gao2021clip2tv]. Within the framework, the model learns to distinguish a positive pair from a set of negative pairs. The most popular losses used in NCE are bi-directional max-margin ranking loss [Karpathy2014DeepFE] and symmetric cross entropy loss [NIPS2016_6b180037, oord, infoNCE].

Since a search query may describe a sound or a visual component, it is important to capture information from both visual stream and audio stream of input video. In this work we fuse information from three modalities: RGB modality (processes each frame independently), motion modality (processes multiple consecutive frames) and audio modality.

2 Related work

The text-to-video retrieval task originates in 2016 from work [rohrbach2016movie].

Nowadays there is large a number of high-quality crowd-labeled datasets suitable for text-to-video retrieval task [xu2016msr-vtt, caba2015activitynet, tgif-cvpr2016, rohrbach2016movie, chen-dolan-2011-collecting, wang2020vatex, lei2019tvqa, 2020trecvidawad, ZhXuCoAAAI18, goyal2017something] and numerous works using these datasets [gao2021clip2tv, camoe, clip2video, CLIP4Clip, laff, mdmmt, gabeur2020multimodal, yang2021taco]. In [miech2020endtoend] the authors leverage large amount of weakly-supervised data (HT100M dataset) from YouTube to train a model. In [gabeur2020multimodal, mdmmt] both weakly-supervised data for pre-training and crowd-labeled datasets for fine-tuning are used.

The task requires a large amount of data and looking for alternative data sources is quite reasonable. Since the visual stream of a video is a sequence of frames (images), any individual image can be considered as a one-frame video. In [miech2020learning] the authors successfully use both image-text and video-text datasets.

Impressive results are achieved in text-to-image retrieval by CLIP model, which is trained with a large amount of web-crawled data 


To create a text-to-video retrieval model for general application (without specialization for a particular domain), a large amount of data is required. The authors of CLIP use hundreds of millions of data units to train text-to-image retrieval model for general application. Most probably text-to-video retrieval task requires no less and rather more data.

Unfortunately, combining all crowd-labeled text-video and text-image datasets do not allow to approach to the high quality general application model. In [miech2020endtoend] the authors attempt to use large amount of weakly-supervised data but the result is still far from high quality model.

Transfer learning-based methods are getting more and more popular to be applied for this task. One of the first successful applications of transfer learning for text-to-video retrieval task can be attributed to [miech2018learning], where several pre-trained networks are used to extract features from video. In [gabeur2020multimodal] the authors additionally adopted BERT model [bert] as initialization for text encoder. Later works [gao2021clip2tv, camoe, clip2video, CLIP4Clip, mdmmt] use CLIP model as initialization for both text and vision encoders.

Pre-trained models, suitable for the text-to-video retrieval task can be divided into two classes. The first class is trained using crowd-labeled datasets such as Imagenet 

[imagenet_cvpr09] or Kinetics [kay2017kinetics] datasets. Usually such models produce task-specific embeddings, which does not allow to achieve high quality in the text-to-video retrieval task. The second class is trained with a large amount of weakly-supervised data collected from the Internet. The most popular are CLIP, BERT and irCSN152, which are trained with the IG65M dataset (irCSN152-IG65M) [ghadiyaram2019largescale].

The analysis of pre-trained models in [mdmmt] and our experience show that models trained with a large amount of web-crawled data are able to produce embeddings for general application and allow to reach better quality in the text-to-video retrieval task.

Using CLIP as an initialization or a feature extractor significantly improves the results in the text-to-video retrieval task [gao2021clip2tv, camoe, clip2video, CLIP4Clip, mdmmt]. The CLIP model family has several different architectures. All of them have independent text encoder and visual encoder.

In this work we manage to use text-video, text-image and text-video weakly-supervised (HT100M) datasets together in the same training. In addition, we use the best pre-trained models. This allows us to achieve State-of-The-Art results with a single model on a number of benchmarks.

3 Methodology

Our model follows the idea of MDMMT [gabeur2020multimodal, mdmmt]. However, we suggest an advanced multistage training approach, as well as perform analysis of existing prior knowledge and choose optimal backbones.

3.1 Architecture

The architecture consists of four parts: pre-trained experts, aggregator, text encoder and text embedding projection.

Pre-trained expert is a frozen pre-trained network that produces sequence of features for input video. In this work we use three experts, each for different modality. The first one is for image (RGB modality), processes video frames independently. The second one is for motion. It deals with several continuous frames together. The third one is for audio. See pseudocode example in Lst. 1.

The aggregator accepts embeddings made by experts and produce single embedding for the video. See pseudocode example in Lst. 2.

The text encoder accepts arbitrary English natural language text and produces embedding.

def encode_rgb(V):
   # V: input video
  embs = []
  frames_lst = read_1_frame_per_second(V)
  for frame in frames_lst:
    emb = image_network(frame)
  rgb_embs = concatenate(embs, dim=0)
  return rgb_embs
Listing 1: Example of pre-trained expert usage
def aggregator(rgb_embs, motion_embs, audio_embs):
  rgb_embs = FC_768_to_512(rgb_embs)
  rgb_cls = rgb_embs.max(dim=0) + rgb_bias  # (1, 512)
  rgb_input = rgb_embs + positional + rgb_bias
   # do the same for other modalities
  x = concatenate([
    rgb_cls, motion_cls, audio_cls,
    rgb_input, motion_input, audio_input], dim=0)
  x = transformer_encoder(x)
  x = normalize(x)
  video_emb = x[:3].reshape(-1)  # (512*3, )
  return video_emb
Listing 2: Example of aggregator

The text embedding projection part maps text embedding to distinct space for each modality. See example in Lst 3. GEU* means Gated Embedding Unit [liu2020use].

def text_embedding_projection(temb):
  a1, a2, a3 = softmax(FC_512_to_3(temb))
  temb_rgb = a1 * GEU1(temb)
  temb_motion = a2 * GEU2(temb)
  temb_audio = a3 * GEU3(temb)
  return concatenate([
    temb_rgb, temb_motion, temb_audio
  ])  # (512*3, )
Listing 3: Text embedding projection

Note that this architecture is flexible. It is possible to remove or add additional modalities. Also it is possible to replace a given pre-trained text encoder with another one. For example, it is possible to use CLIP ViT-B/32 as RGB expert and text part of CLIP ViT-B/16 as text encoder.

3.2 Double positional encoding

Each expert takes different type and shape of data as input. For example, CLIP takes a single image frame to produce an embedding. irCSN152-IG65M produces a single embedding from a sequence of 32 consecutive frames. Slow-Fast (SF) [kazakos2021slowfast] takes a melspectrogram of a 5 seconds long audio frame to produce an embedding.

Positional encoding is used in the transformer encoder architecture to provide information about the order of tokens in input sequence. In our case, positional (temporal) encoding has to provide information not only about the order of tokens, but also about the time length of each individual token.

We introduce double positional encoding. For each embedding we add two biases: the first stands for the timestamp of beginning of video segment and the second represents timestamp of end of video segment, see pseudocode in Lst. 4.

# nsec: video duration in seconds
positions_beg = nn.Parameter(32, 512)
positions_end = nn.Parameter(32, 512)
audio_embs = audio_embs +
             positions_beg[0::5][:nsec//5] +
rgb_embs = rgb_embs +
           positions_beg[:nsec] +
Listing 4: Pseudocode for double positional encoding

This way we make sure that different time lengths per expert embedding are processed correctly. The results in Tab. 1 support this novelty.

Temporal Text Video
Embedding R@1 R@5 R@10 MdR
Single 22.1 48.2 60.0 6.0
Double 22.2 48.5 60.3 6.0
Table 1: Comparison of standard positional encoding with proposed double positional encoding. Dataset: MSR-VTT full clean split (see Sec. 3.3); Text backbone: CLIP ViT-B/32; Experts: CLIP ViT-L/14, irCSN152-IG65M, SF

3.3 Datasets

A list of datasets used in this work is provided in Tab. 2. Only training splits of listed datasets are used in training dataset. Note that we use both text-video and text-image datasets. In Sec. 4.4 we show results for video only datasets and image plus video datasets. Since each dataset has different amount of videos and captions, it is important to combine datasets properly [mdmmt].

In the following experiments MSR-VTT full clean split is used. This split is introduced in [mdmmt]. The test part of full clean split is the same as test part of full split. The training part of full clean split mostly similar to full split, but some videos are removed. All removed videos have corresponding duplicate in test part.

Dataset Num Num Num
videos pairs unique
(images) captions
MSR-VTT [xu2016msr-vtt] 10k 200k 167k
ActivityNet [caba2015activitynet] 14k 70k 69k
LSMDC [rohrbach2016movie] 101k 101k 101k
TwitterVines [2020trecvidawad] 6.5k 23k 23k
YouCook2 [ZhXuCoAAAI18] 1.5k 12k 12k
MSVD [chen-dolan-2011-collecting] 2k 80k 64k
TGIF [tgif-cvpr2016] 102k 125k 125k
SomethingV2 [goyal2017something] 193k 193k 124k
VATEX [wang2020vatex] 28k 278k 278k
TVQA [lei2019tvqa] 20k 179k 178k
Sum above 477k 1261k
Flicker30k [young-etal-2014-image] 32k 159k 158k
COCO [chen2015microsoft] 123k 617k 592k
Conceptual Captions [Sharma2018ConceptualCA] 3M 3M 2M
Table 2: The "Num videos" column represents the number of video clips (images) in the dataset, the "Num pairs" column represents the total number of video-caption (image-caption) pairs, the "Num unique captions" column represents the number of unique captions in the dataset

3.4 Loss

The MDMMT-2 is trained with the bi-directional max-margin ranking loss [Karpathy2014DeepFE]:


where denote the batch size, the similarity score between the -th query and the -th video of the given batch, and some predefined margin correspondingly. We set and in all our experiments.

4 Experiments

In sections 4.1 - 4.3 all experiments are made on MSR-VTT full clean split (see Sec. 3.3

) for 50 epochs and 60k examples per epoch. The initial learning rate is 5e-5. After each epoch we multiply learning rate by

. In these experiments we freeze text backbone and train only aggregator model and text embedding projection part.

For training on MSR-VTT, we use aggregator with 4 layers and 4 heads. On larger dataset (see Sec. 4.4 - 4.6) aggregator has 9 layers and 8 heads.

Results are reported as mean or just mean over 3 experiments.

4.1 Clip

In [mdmmt] it is shown that CLIP works as a strong visual feature extractor and outperforms other available models by large margin. We found out that CLIP text backbone also works better than other available text models, such as BERT [bert], which was originally used in [gabeur2020multimodal], or GPT [gpt].

Currently there are several publicly available CLIP models. In this section we compare their performance to make sure that we use the best possible combination. Results are presented in Tab. 3.

Our observations:

  • Suppose we have pre-trained CLIP: text backbone and corresponding visual backbone. We observe that if we replace original visual backbone with a bigger/deeper one, we obtain better video retrieval system.

  • If we use the same visual backbone with different text backbones, a text backbone of a bigger/deeper model not necessarily shows better results.
    In fact, if we take a look at (Tab. 3) RN50(xN) models, the best result is achieved by a combination of the deepest visual backbone (RN50x64) and the text backbone from the most shallow model (RN50).

  • CLIP ViT-L/14 shows the best performance both as visual and text backbone.










RN50 40.1 38.7 39.3 39.3 40.1 39.8 39.8
RN50x4 42.8 41.9 42.5 42.5 43.2 43.1 43.2
RN50x16 43.9 43.5 43.6 43.0 44.4 44.5 44.4
RN50x64 44.6 43.9 44.1 44.2 44.8 45.2 45.4
ViT-B/32 42.0 41.2 40.9 40.9 42.5 42.4 42.2
ViT-B/16 44.4 43.8 43.4 43.3 44.8 45.4 44.9
ViT-L/14 46.2 45.7 45.3 45.3 46.5 46.8 47.2
Table 3: Comparison of CLIP visual and text backbones combinations. Experts: CLIP; Metric: R@5
Audio Text Video
expert R@1 R@5 R@10 MdR
vggish [hershey2017cnn] 19.3 44.3 56.3 7.0
Slow-Fast [kazakos2021slowfast] 19.6 44.9 57.0 7.0
Table 4: Experiments on different audio experts. Text backbone: CLIP ViT-B/32; Experts: CLIP ViT-B/32, irCSN152-IG65M, audio

4.2 Experts combination

Experts Text Video
CLIP irCSN152-IG65M SF R@1 R@5 MdR
10.2 29.3 17.3
11.2 31.5 15.0
21.3 46.5 7.0
21.5 46.7 7.0
22.0 47.8 6.0
22.2 48.5 6.0
Table 5: Experts combinations. Text backbone: CLIP ViT-B/32

Using combination of different experts allows to achieve better performance. In Tab. 5 various combinations of experts are presented. Using three modalities gives the best result.

4.3 Dealing with non-square videos

TrainTest Squeeze Center crop Padding Mean
Squeeze 46.3 46.0 46.0 47.1
Center Crop 46.0 46.5 46.0 47.3
Padding 46.0 46.2 46.7 47.0
Mean 45.9 46.4 45.9 47.4
Table 6: Comparison of different techniques for extracting features from non-square videos. Text backbone: CLIP ViT-B/32; Experts: CLIP ViT-L/14; Metric: R@5

Both irCSN152-IG65M and CLIP take videos (images) of square shape as input. Therefore it is not possible to use information from the whole video directly. It may happen that some object or action is taking place in the corner (out of the center crop) of the video. So if we use center crop to compute embeddings, the information from the corners will be lost. There are several possible solutions to this problem:

  • Squeeze a video to a square without saving the aspect ratio (squeeze)

  • Pad a video to a square with blackbars (padding)

  • Take several crops from the video, average the embeddings of these crops, and use this average as embedding (mean)

For the mean technique we take three crops: left or bottom, center, right or top (depending on video orientation) and then average embeddings of these crops.

Experiments in Tab. 6 show that squeeze works worse than center crop, padding works slightly better than center crop, and mean works the best.

We want to emphasize that using mean during test improves video-retrieval performance even if other methods were used during train.

4.4 Adding images

Dataset Weight Type
MSR-VTT 140 Text-video datasets (10V)
ActivityNet 100
Twitter Vines 60
YouCook2 20
TGIF 102
SomethingV2 169
TVQA 150
COCO 280 Text-image datasets (3I)
Flicker30k 200
Conceptual Captions 160
Table 7: Datasets used in train procedure. The "Weight" column describes how often we sample examples from the dataset. The probability of obtaining an example from the dataset with the weight equals to divided by a sum of all weights
Dataset Text Video
R@1 R@5 R@10 MdR
10V 30.2 56.6 67.1 4.0
10V+3I 30.9 57.4 67.8 4.0
Table 8: Test results on MSR-VTT full clean split. Text backbone: CLIP ViT-B/32; Experts: CLIP ViT-L/14, irCSN152-IG65M, SF

In [mdmmt] it is shown that the proper combination of datasets allows to train a single model that can capture the knowledge from all used datasets and in most cases the model trained on the combination of datasets is better than the model trained on a single dataset.

In Tab. 8

we show that proper combination of text-video and text-image datasets allows to improve video-retrieval performance. Hyperparameters are specified in Sec. 

4.5, stage S.

Weights for combining all datasets are specified in Tab. 7. First 10 rows are video datasets (denoted as 10V) and last 3 are image datasets (denoted as 3I).

4.5 Pre-training and fine-tuning

Note that in our work aggregator is initialised form scratch, while text backbone is pre-trained. If we simultaneously train randomly initialised aggregator and pre-trained text backbone, then at the time when aggregator will be trained, the text backbone might degrade. That is why for final result we introduce training procedure that consists of three stages (denoted as S, S, S).

During stage S we use noisy HT100M dataset. Text backbone is frozen, only aggregator and text embedding projection part are trained.

During stage S we use crowd-labeled datasets 10V+3I . Same as in S, text backbone is frozen, only aggregator and text embedding projection part are trained.

During stage S, same as in S, we use crowd-labeled datasets 10V+3I. Now, however, we unfreeze text backbone and train all three main components: aggregator, text backbone and text embedding projection.

Hyperparameters for these stages are listed in Tab. 9. Results for different combinations of stages are listed in Tab. 10.

Train Examples Num. Learning Datasets
stage per epoch epochs rate
S 60k 200 5e-5 0.98 HT100M
S 380k 45 5e-5 0.95 10V+3I
S 200k 20 2e-5 0.8 10V+3I
Table 9: Hyperparameters for different stages
Train stages Text Video
S S S R@1 R@5 MdR
7.7 19.0 60.0
29.0 55.3 4.0
30.5 56.9 4.0
31.2 57.8 4.0
32.5 59.4 3.0
Table 10: Test results for train stages on MSR-VTT full clean split. Text backbone: CLIP ViT-B/32; Experts: CLIP ViT-L/14, irCSN152-IG65M, SF

4.6 Final result

In this section we compare our solution with the prior art. Our best solution uses three modalities: CLIP ViT-L/14 (RGB modality), irCSN152-IG65M (motion modality), Slow-Fast trained on VGG-Sound (audio modality). Text backbone is used from CLIP ViT-L/14. To fuse modalities we use aggregator with 9 layers and 8 heads. Training procedure is described in Sec. 4.5. Results are shown in Tab. 11 - Tab. 16.

Center crop is used for visual features extraction during training and testing for all datasets except MSR-VTT (see Tab. 

12), where we report two results on testing set: center crop and mean method (see Sec. 4.3).

Results on MSR-VTT, LSMDC, MSVD, YouCook2, TGIF are obtained using single model. Our model outperforms SOTA by 1.6, 0.6, 3.9, 4.3, 1.1 % correspondingly on R@5. On MSR-VTT-1kA (see Tab. 11) we report two results with different training splits: full(7k) and 1k-A(9k). First result approaches SOTA and second result outperforms SOTA by 0.8 % on R@5.

Model MSR-VTT-1k-A text video
R@1 R@5 R@10 MnR MdR
JSFusion [yu2018joint] 10.2 31.2 43.2 13.0
E2E [miech2020endtoend] 9.9 24.0 32.4 29.5
HT [miech19howto100m] 14.9 40.2 52.8 9.0
CE [liu2020use] 20.9 48.8 62.4 28.2 6.0
CLIP [radford2learning] 22.5 44.3 53.7 61.7 8.0
MMT  [gabeur2020multimodal] 26.6 57.1 69.6 24.0 4.0
AVLnet[rouditchenko2020avlnet] 27.1 55.6 66.6 4.0
SSB [patrick2021supportset] 30.1 58.5 69.3 3.0
CLIP agg [portilloquintero2021straightforward] 31.2 53.7 64.2 4.0
MDMMT [mdmmt] 38.9 69.0 79.7 16.5 2.0
CLIP4Clip [CLIP4Clip] 44.5 71.4 81.6 15.3 2.0
CLIP2Video [clip2video] 45.6 72.6 81.7 14.6 2.0
LAFF [laff] 45.8 71.5 82.0
CAMoE [camoe] 44.6 72.6 81.8 13.3 2.0
MDMMT-2 full (Ours) 46.5 74.3 83.3 14.1 2.0
QB-Norm+CLIP2Video [qbnorm] 47.2 73.0 83.0 2.0
CLIP2TV [gao2021clip2tv] 48.3 74.6 82.8 14.9 2.0
MDMMT-2 1k-A (Ours) 48.5 75.4 83.9 13.8 2.0
Table 11: Test results on MSR-VTT-1k-A dataset. Results that were obtained using original testing protocol (without dual softmax [camoe, gao2021clip2tv] on inference) are shown. Results are collected from articles and
Model Split MSR-VTT text video
R@1 R@5 R@10 MnR MdR
VSE [mithun2018learning] full 5.0 16.4 24.6 47.0
VSE++ [mithun2018learning] 5.7 17.1 24.8 65.0
Multi Cues [mithun2018learning] 7.0 20.9 29.7 38.0
W2VV [Dong_2018] 6.1 18.7 27.5 45.0
Dual Enc. [dong2019dual] 7.7 22.0 31.8 32.0
CE [liu2020use] 10.0 29.0 41.2 86.8 16.0
MMT  [gabeur2020multimodal] 10.7 31.1 43.4 88.2 15.0
CLIP [radford2learning] 15.1 31.8 40.4 184.2 21.0
CLIP agg [portilloquintero2021straightforward] 21.5 41.1 50.4 4.0
MDMMT [mdmmt] 23.1 49.8 61.8 52.8 6.0
TACo [yang2021taco] 24.8 52.1 64.0 5.0
LAFF [laff] 29.1 54.9 65.8
CLIP2Video [clip2video] 29.8 55.5 66.2 45.4 4.0
CAMoE [camoe] 32.9 58.3 68.4 42.6 3.0
CLIP2TV [gao2021clip2tv] 33.1 58.9 68.9 44.7 3.0
MDMMT-2 (Ours) 33.4 60.1 70.5 39.2 3.0
MDMMT-2 test mean (Ours) 33.7 60.5 70.8 37.8 3.0
MMT [gabeur2020multimodal] full clean 10.4 30.2 42.3 89.4 16.0
MDMMT [mdmmt] 22.8 49.5 61.5 53.8 6.0
MDMMT-2 (Ours) 33.3 59.8 70.2 38.7 3.0
Table 12: Test results on MSR-VTT dataset. Results are collected from articles and
Model LSMDC text video
R@1 R@5 R@10 MnR MdR
CT-SAN [yu2017endtoend] 5.1 16.3 25.2 46.0
JSFusion [yu2018joint] 9.1 21.2 34.1 36.0
MEE [miech2020learning] 9.3 25.1 33.4 27.0
MEE-COCO [miech2020learning] 10.1 25.6 34.6 27.0
CE [liu2020use] 11.2 26.9 34.8 96.8 25.3
CLIP agg [portilloquintero2021straightforward] 11.3 22.7 29.2 56.5
CLIP [radford2learning] 12.4 23.7 31.0 142.5 45.0
MMT  [gabeur2020multimodal] 12.9 29.9 40.1 75.0 19.3
MDMMT [mdmmt] 18.8 38.5 47.9 58.0 12.3
CLIP4Clip [CLIP4Clip] 21.6 41.8 49.8 58.0
QB-Norm+CLIP4Clip [qbnorm] 22.4 40.1 49.5 11.0
CAMoE [camoe] 25.9 46.1 53.7 54.4
MDMMT-2 (Ours) 26.9 46.7 55.9 48.0 6.7
Table 13: Test results on LSMDC dataset. Results are collected from articles and
Model MSVD text video
R@1 R@5 R@10 MnR MdR
LAFF [laff] 45.4 76.0 84.6
CLIP4Clip [CLIP4Clip] 46.2 76.1 84.6 10.0 2.0
CLIP2Video [clip2video] 47.0 76.8 85.9 9.6 2.0
QB-Norm+CLIP2Video [qbnorm] 48.0 77.9 86.2 2.0
CAMoE [camoe] 49.8 79.2 87.0 9.4
MDMMT-2 (Ours) 56.8 83.1 89.2 8.8 1.0
Table 14: Test results on MSVD dataset. Results are collected from articles and
Model YouCook2 text video
R@1 R@5 R@10 MnR MdR
Text-Video Embedding [miech19howto100m] 8.2 24.5 35.3 24.0
COOT [coot] 16.7 52.3
UniVL [univl] 28.9 57.6 70.0 4.0
TACo [yang2021taco] 29.6 59.7 72.7 4.0
MDMMT-2 (Ours) 32.0 64.0 74.8 12.7 3.0
Table 15: Test results on YouCook2 dataset. Results are collected from articles and
Model TGIF text video
R@1 R@5 R@10 MnR MdR
W2VV++ [w2vv++] 9.4 22.3 29.8
SEA [sea] 11.1 25.2 32.8
LAFF [laff] 24.5 45.0 54.5
MDMMT-2 (Ours) 25.5 46.1 55.7 94.1 7.0
Table 16: Test results on TGIF dataset. Results are collected from articles and

5 Conclusions

We performed a refined study of each conceptual part of transformer application for the text-to-video retrieval task. The analysis of the prior knowledge allows to choose optimal existing backbone experts. Combining different types of data sources allows to significantly increase the overall training data amount. Also we suggest a multi-stage training procedure without experts fine-tuning, which prevents their overfitting to a particular domain. Usage of the expanded data and optimal experts leads to a great increase in the generalization ability. It allows to obtain a model, which simultaneously performs well in multiple domains and benefits with the domains diversity increasing. We demonstrate an incredible novelty – possibility to obtain SOTA results in different domains by a same model, instead of preparing a domain-specific model for each. In particular, we obtained new SOTA results in MSR-VTT, LSMDC, MSVD, YouCook2 and TGIF with a single model trained only once.