Multi-Modal Transformer for Video Retrieval
The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at http://thoth.inrialpes.fr/research/MMT.READ FULL TEXT VIEW PDF
Multi-modal machine learning (ML) models can process data in multiple
The rapid growth of video on the internet has made searching for video
Video editing tools are widely used nowadays for digital design. Althoug...
We cast visual retrieval as a regression problem by posing triplet loss ...
Existing dominant approaches for cross-modal video-text retrieval task a...
We argue that the next frontier in natural language understanding (NLU) ...
Multimodal ML models can process data in multiple modalities (e.g., vide...
Multi-Modal Transformer for Video Retrieval
Video is one of the most popular forms of media due to its ability to capture dynamic events and its natural appeal to our visual and auditory senses. Online video platforms are playing a major role in promoting this form of media. However, the billions of hours of video available on such platforms are unusable if we cannot access them effectively, for example, by retrieving relevant content through queries.
In this paper, we tackle the tasks of caption-to-video and video-to-caption retrieval. In the first task of caption-to-video retrieval, we are given a query in the form of a caption (e.g., “How to build a house”) and the goal is to retrieve the videos best described by it (i.e., videos explaining how to build a house). In practice, given a test set of caption-video pairs, our aim is to provide, for each caption query, a ranking of all the video candidates such that the video associated with the caption query is ranked as high as possible. On the other hand, the task of video-to-caption retrieval focuses on finding among a collection of caption candidates the ones that best describe the query video.
A common approach for the retrieval problem is similarity learning 
, where we learn a function of two elements (a query and a candidate) that best describes their similarity. All the candidates can then be ranked according to their similarity with the query. In order to perform this ranking, the captions as well as the videos are represented in a common multi-dimensional embedding space, wherein similarities can be computed as a dot product of their corresponding representations. The critical question here is how to learn accurate representations of both caption and video to base our similarity estimation on.
The problem of learning representation of text has been extensively studied, leading to various methods [34, 18, 25, 7, 3], which can be used to encode captions. In contrast to these advances, learning effective video representation continues to be a challenge, and forms the focus of our work. This is in part due to the multimodal and temporal nature of video. Video data not only varies in terms of appearance, but also in possible motion, audio, overlaid text, speech, etc. Leveraging cross-modal relations thus forms a key to building effective video representations. As illustrated in Fig. 1, cues jointly extracted from all the constituent modalities are more informative than handling each modality independently. Hearing a motor sound right after seeing someone starting a bike tells us that the running bike is the visible one and not a background one. Another example is the case of a video of “a crowd listening to a talk”, neither of the modalities “appearance” or “audio” can fully describe the scene, but when processed together, higher level semantics can be obtained.
Recent work on video retrieval does not fully exploit such cross-modal high-level semantics. They either ignore the multi-modal signal , treat modalities separately , or only use a gating mechanism to modulate certain modality dimensions . Another challenge in representing video is its temporality. Due to the difficulty in handling variable duration of videos, current approaches [16, 14]
discard long-term temporal information by aggregating descriptors extracted at different moments in the video. We argue that this temporal information can be important to the task of video retrieval. As shown in Fig.1, a video of “someone walking to an object” and “someone walking away from an object” will have the same representation once pooled temporally, however, the movement of the person relative to the object is potentially important in the query.
We address the temporal and multi-modal challenges posed in video data by introducing our multi-modal transformer. It performs the task of processing features extracted from different modalities at different moments in video and aggregates them in a compact representation. Building on the transformer architecture, our multi-modal transformer exploits the self-attention mechanism to gather valuable cross-modal and temporal cues about events occurring in a video. We integrate our multi-modal transformer in a cross-modal framework, as illustrated in Fig. 2, which leverages both captions and videos, and estimates their similarity.
In this work, we make the following three contributions: (i) First, we introduce a novel video encoder architecture for retrieval: Our multi-modal transformer processes effectively multiple modality features extracted at different times. (ii) We thoroughly investigate different architectures for language embedding, and show the superiority of the BERT model for the task of video retrieval. (iii) By leveraging our novel cross-modal framework we outperform prior state of the art for the task of video retrieval on MSRVTT , ActivityNet  and LSMDC  datasets. It is also the winning solution in the CVPR 2020 Video Pentathlon Challenge .
We present previous work on language and video representation learning, as well as on visual-language retrieval.
Language representations. Earlier work on language representations include bag of words  and Word2Vec . A limitation of these representations is capturing the sequential properties in a sentence. LSTM 
was one of the first successful deep learning models to handle this. More recently, the transformer architecture has shown impressive results for text representation by implementing a self-attention mechanism where each word (or wordpiece ) of the sentence can attend to all the others. The transformer architecture, consisting of self-attention layers alternatively stacked with fully-connected layers, forms the base of the popular language modeling network BERT . Burns et al.  perform an analysis of the different word embeddings and language models (Word2Vec , LSTM , BERT , etc.) used in vision-language tasks. They show that the pretrained and frozen BERT model  performs relatively poorly compared to a LSTM or even a simpler average embedding model. In this work, we show that for video retrieval, a pretrained BERT outperforms other language models, but it needs to be finetuned.
Video representations. With a two-stream network, Simonyan et al.  have used complementary information from still frames and motion between frames to perform action recognition in videos. Carreira et al.  incorporated 3D convolutions in a two-stream network to better attend the temporal structure of the signal. S3D  is an alternative approach, which replaced the expensive 3D spatio-temporal convolutions by separable 2D and 1D convolutions. More recently, transformer-based methods, which leverage BERT pretraining , have been applied to S3D features in VideoBERT  and CBT . While these works focus on visual signals, they have not studied how to encode the other multi-modal semantics, such as audio signals.
Visual-language retrieval. Harwath et al.  perform image and audio-caption retrieval by embedding audio segments and image regions in the same space and requiring high similarity between each audio segment and its corresponding image region. The method presented in  takes a similar approach for image-text retrieval by embedding images regions and words in a joint space. A high similarity is obtained for images that have matching words and image regions.
For videos, JSFusion  estimates video-caption similarity through dense pairwise comparisons between each word of the caption and each frame of the video. In this work, we instead estimate both a video embedding and a caption embedding and then compute the similarity between them. Zhang et al.  perform paragraph-to-video retrieval by assuming a hierarchical decomposition of the video and paragraph. Our method do not assume that the video can be decomposed into clips that align with sentences of the caption. A recent alternative is creating separate embedding spaces for different parts of speech (e.g., noun or verb) . In contrast to this method, we do not pre-process the sentences but encode them directly through BERT.
Another work  leverages the large number of instructional videos in the HowTo100M dataset, but does not fully exploit the temporal relations. Our work instead relies on longer segments extracted from HowTo100M videos in order to learn temporal dependencies and address the problem of misalignment between speech and visual features. Mithun et al. [19, 20] use three experts (Object, Activity and Place) to compute three corresponding text-video similarities. These experts however do not collaborate together as their respective similarities are simply summed together. A related approach  uses precomputed features from experts for text to video retrieval, where the overall similarity is obtained as a weighted sum of each expert’s similarity. A recent extension  to this mixture of experts model uses a collaborative gating mechanism for modulating each expert feature according to the other experts. However, this collaborative gating mechanism only strengthens (or weakens) some dimensions of the input signal in a single step, and is therefore not able to capture high level inter-modality information. Our multi-modal transformer overcomes this limitation by attending to all available modalities over multiple self-attention layers.
Our overall method relies on learning a function to compute the similarity between two elements: text and video, as shown in Fig. 2. We then rank all the videos (or captions) in the dataset, according to their similarity with the query caption (or video) in the case of text-to-video (or video-to-text) retrieval. In other words, given a dataset of video-caption pairs , the goal of the learnt similarity function , between video and caption , is to provide a high value if , and a low one if . Estimating this similarity (described in Section 3.3) requires accurate representations for the video as well as the caption. Fig. 2 shows the two parts focused on producing these representations (presented in Sections 3.1 and 3.2 respectively) in our cross-modal framework.
The video-level representation is computed by our proposed multi-modal transformer (MMT). MMT follows the architecture of the transformer encoder presented in . It consists of stacked self-attention layers and fully collected layers. MMT’s input is a set of embeddings, all of the same dimension . Each of them embeds the semantics of a feature, its modality, and the time in the video when the feature was extracted. This input is given by:
In the following, we describe those three components.
Features . In order to learn an effective representation from different modalities inherent in video data, we begin with video feature extractors called “experts” [19, 31, 16, 14]. In contrast to previous methods, we learn a joint representation leveraging both cross-modal and long-term temporal relationships among the experts. We use pretrained experts . Each expert is a model trained for a particular task that is then used to extract features from video. For a video , each expert extracts a sequence of features.
The features extracted by our experts encode the semantics of the video. Each expert outputs features in . In order to project the different expert features into a common dimension , we learn linear layers (one per expert) to project all the features into .
A transformer encoder produces an embedding for each of its feature inputs, resulting in several embeddings for an expert. In order to obtain a unique embedding for each expert, we define an aggregated embedding
that will collect and contextualize the expert’s information. We initialize this embedding with a max pooling aggregation of all the corresponding expert’s features as. The sequence of input features to our video encoder then takes the form:
Expert embeddings . In order to process cross-modality information, our MMT needs to identify which expert it is attending to. We learn embeddings of dimension to distinguish between embeddings of different experts. Thus, the sequence of expert embeddings to our video encoder takes the form:
Temporal embeddings . They provide temporal information about the time in the video where each feature was extracted to our multi-modal transformer. Considering videos of a maximum duration of seconds, we learn embeddings of dimension . Each expert feature that has been extracted in the time range will be temporally embedded with . For example, a feature extracted at 7.4s in the video will be temporally encoded with temporal embedding . We learn two additional temporal embeddings and , which encode aggregated features and unknown temporal information features (for experts whose temporal information is unknown), respectively. The sequence of temporal embeddings of our video encoder then takes the form:
Multi-modal Transformer. The video embeddings defined as the sum of features, expert and temporal embeddings in (1), as shown in Fig. 3, are input to the transformer. They are given by: MMT contextualises its input and produces the video representation . As illustrated in Fig. 2, we only keep the aggregated embedding per expert. Thus, our video representation consists of the output embeddings corresponding to the aggregated features, i.e.,
The advantage of our MMT over the state-of-the-art collaborative gating mechanism  is two-fold: First, the input embeddings are not simply modulated in a single step but iteratively refined through several layers featuring multiple attention heads. Second, we do not limit our video encoder with a temporally aggregated feature for each expert, but provide all the extracted features instead, along with a temporal encoding describing at what moment of the video they were extracted from. Thanks to its self-attention modules, each layer of our multi-modal transformer is able to attend to all its input embeddings, thus extracting semantics of events occurring in the video over several modalities.
We compute our caption representation in two stages: first, we obtain an embedding of the caption, and then project it with a function into different spaces as . For the embedding function , we use a pretrained BERT model . Specifically, we extract our single caption embedding from the [CLS] output of BERT. In order to match the size of this caption representation with that of video, we learn for function as many gated embedding modules  as there are video experts. Our caption representation then consists of embeddings, represented by .
We compute our final video-caption similarity , as a weighted sum of each expert ’s video-caption similarity . It is given by:
where represents the weight for the th expert. To obtain these mixture weights, we follow  and process our caption representation through a linear layer and then perform a softmax operation, i.e.,
where are the weights of the linear layer. The intuition behind using a weighted sum is that a caption may not describe all the inherent modalities in video uniformly. For example, in the case of a video with a person in a red dress singing opera, the caption “a person in a red dress” provides no information relevant for audio. On the contrary, the caption “someone is singing” should focus on the audio modality for computing similarity. Note that can all be precomputed offline for each caption and for each video, and therefore the retrieval operation only involves dot product operations.
We train our model with the bi-directional max-margin ranking loss :
where is the batch size, , the similarity score between video and caption , and is the margin. This loss enforces the similarity for true video-caption pairs to be higher than the similarity of negative samples or , for all , by at least .
HowTo100M . It is composed of more than 1 million YouTube instructional videos, along with automatically-extracted speech transcriptions, which form the captions. These captions are naturally noisy, and often do not describe the visual content accurately or are temporally misaligned with it. We use this dataset only for pre-training.
MSRVTT . This dataset is composed of 10K YouTube videos, collected using 257 queries from a commercial video search engine. Each video is 10 to 30s long, and is paired with 20 natural sentences describing it, obtained from Amazon Mechanical Turk workers. We use this dataset for training from scratch and also for fine-tuning. We report results on the train/test splits introduced in  that uses 9000 videos for training and 1000 for test. We refer to this split as “1k-A”. We also report results on the train/test split in  that we refer to as “1k-B”. Unless otherwise specified, our MSRVTT results are with “1k-A”.
ActivityNet Captions . It consists of 20K YouTube videos temporally annotated with sentence descriptions. We follow the approach of , where all the descriptions of a video are concatenated to form a paragraph. The training set has 10009 videos. We evaluate our video-paragraph retrieval on the “val1” split (4917 videos). We use ActivityNet for training from scratch and fine-tuning.
LSMDC . It contains 118,081 short video clips ( 4–5s) extracted from 202 movies. Each clip is annotated with a caption, extracted from either the movie script or the audio description. The test set is composed of 1000 videos, from movies not present in the training set. We use LSMDC for training from scratch and also fine-tuning.
Metrics. We evaluate the performance of our model with standard retrieval metrics: recall at rank (R@
, higher is better), median rank (MdR, lower is better) and mean rank (MnR, lower is better). For each metric, we report the mean and the standard deviation over experiments with 3 random seeds. In the main paper, we only report recall@5, median and mean ranks, and refer the reader to the supplementary material for additional metrics.
Pre-trained experts. Recall that our video encoder uses pre-trained experts models for extracting features from each video modality. We use the following seven experts. Motion features are extracted from S3D  trained on the Kinetics action recognition dataset. Audio features are extracted using VGGish model  trained on YT8M. Scene embeddings are extracted from DenseNet-161  trained for image classification on the Places365 dataset . OCR features are obtained in three stages. Overlaid text is first detected using the pixel link text detection model. The detected boxes are then passed through a text recognition model trained on the Synth90K dataset. Finally, each character sequence is encoded with word2vec  embeddings. Face features are extracted in two stages. An SSD face detector is used to extract bounding boxes, which are then passed through a ResNet50 trained for face classification on the VGGFace2 dataset. Speech transcripts are extracted using the Google Cloud Speech to Text API, with the language set to English. The detected words are then encoded with word2vec. Appearance features are extracted from the final global average pooling layer of SENet-154 
trained for classification on ImageNet. For scene, OCR, face, speech and appearance, we use the features publicly released by, and compute the other features ourselves.
For each dataset, we run a grid search on the corresponding validation set to estimate the hyperparameters. We use the Adam optimizer for all our experiments, and set the margin of the bidirectional max-margin ranking loss to 0.05. We also freeze our pre-trained expert models.
When pre-training on HowTo100M, we use a batch size of 64 video-caption pairs, an initial learning rate of 5e-5, which we decay by a multiplicative factor 0.98 every 10K optimisation steps, and train for 2 million steps. Given the long duration of most of the HowTo100M videos, we randomly sample 100 consecutive words in the caption, and keep 100 consecutive seconds of video data, closest in time to the selected words.
When training from scratch or finetuning on MSRVTT or LSMDC, we use a batch size of 32 video-caption pairs, an initial learning rate of 5e-5, which we decay by a multiplicative factor 0.95 every 1K optimisation steps. We train for 50K steps. We use the same settings when training from scratch or finetuning on ActivityNet, except for 0.90 as the multiplicative factor.
To compute our caption representation
, we use the “BERT-base-cased” checkpoint of the BERT model and finetune it with a dropout probability of 10%. To compute our video representation, we use MMT with 4 layers and 4 attention heads, a dropout probability of 10%, a hidden size of 512, and an intermediate size of 3072.
For datasets with short videos (MSRVTT and LSMDC), we use all the 7 experts and limit video input to 30 features per expert, and BERT input to the first 30 wordpieces. For datasets containing longer videos (HowTo100M and ActivityNet), we only use motion and audio experts, and limit our video input to 100 features per expert and our BERT input to the first 100 wordpieces. In cases where an expert is unavailable for a given video, e.g., no speech was detected, we set the aggregated feature
to a zero vector. We refer the reader to the supplementary material for a study of the model complexity.
We will first show the advantage of pretraining our model on a large-scale, uncurated dataset. We will then perform ablations on the architecture used for our language and video encoders. Finally, we will present the relative importance of the pretrained experts used in this work, and compare with related methods.
Pretraining. Table 1 shows the advantage of pretraining on HowTo100M, before finetuning on the target dataset (MSRVTT in this case). We also evaluated the impact of pretraining on ActivityNet and LSMDC; see Table 5 and Table 6.
|Pretraining without finetuning (zero-shot setting)||all words|
|w/o stop words|
Training from scratch on MSRVTT
|w/o stop words|
|Pretraining then finetuning on MSRVTT||all words|
|w/o stop words|
Language encoder. We evaluated several architectures for caption representation, as shown in Table 2. Similar to the observation made in , we obtain poor results from a frozen, pretrained BERT. Using the [CLS] output from a pretrained and frozen BERT model is in fact the worst result. We suppose this is because the output was not trained for caption representation, but for a very different task: next sentence prediction. Finetuning BERT greatly improves performance; it is the best result. We also compare with GrOVLE  embeddings, frozen or finetuned, aggregated with a max-pooling operation or a 1-layer LSTM and a fully-connected layer. We show that pretrained BERT embeddings aggregated by a max-pooling operation perform better than GrOVLE embeddings processed by a LSTM (best results from  for the text-to-clip task).
We also analysed the impact of removing stop words from the captions in Table 1. In a zero-shot setting, i.e., trained on HowTo100M, evaluated on MSRVTT without finetuning, removing the stop words helps generalize, by bridging the domain gap—HowTo100M speech is very different from MSRVTT captions. This approach was adopted in . However, we observe that when finetuning, it is better to keep all the words as they contribute to the semantics of the caption.
Video encoder. We evaluated the influence of different architectures for computing video embeddings on the MSRVTT 1k-A test split.
In Table 3(a), we evaluate variants of our encoder architecture and its input. Similar to , we experiment with directly computing the caption-video similarities on each max-pooled expert features, i.e., no video encoder (NONE in the table). We compare this with the collaborative gating architecture (COLL)  and our MMT variant using only the aggregated features as input. For the first two variants without MMT, we adopt the approach of  to deal with missing modalities by re-weighting . We also show the superior performance of our multi-modal transformer in contextualising the different modality embeddings compared to the collaborative gating approach. We argue that our MMT is able to extract cross-modal information in a multi-stage architecture compared to collaborative gating, which is limited to modulating the input embeddings. Table 3(a) also highlights the advantage of providing MMT with all the extracted features, instead of only aggregated ones. Temporally aggregating each expert’s features ignores information about multiple events occurring in a same video (see the last three rows). As shown by the influence of ordered and randomly shuffled features on the performance, MMT has the capacity to make sense of the relative ordering of events in a video.
Table 3(b) shows the importance of initialising the expert aggregation feature . Since the output of our video encoder is extracted from the “agg” columns, it is important to initialise them with an appropriate representation of the experts’ features. The transformer being a residual network architecture, initializing input embeddings with a zero vector leads to a low performance. Initializing with max pooling aggregation of each expert performs better than mean pooling. Finally, we analyze the impact of the size of our multi-modal transformer model in Table 3(c). A model with 4 layers and 4 attention heads outperforms both a smaller model (2 layers and 2 attention heads) and a larger model (8 layers and 8 attention heads).
Comparison of the different experts. In Figure 4, we show an ablation study when training our model on MSRVTT using only one expert (left), using all experts but one (middle), or gradually adding experts by greedy search (right). In the case of using only one expert, we note that the motion expert provides the best results. We attribute the poor performance of OCR, speech and face to the fact that they are absent from many videos, thus resulting in a zero vector input to our video encoder. While the scene expert shows a decent performance, if used alone, it does not contribute when used along other experts, perhaps due to the semantics it encodes being captured already by other experts like appearance or motion. On the contrary, the audio expert alone does not provide a good performance, but it contributes the most when used in conjunction with the others, most likely due to the complementary cues it provides, compared to the other experts.
Comparison to prior state of the art. We compare our method on three datasets: MSRVTT (Table 4), ActivityNet (Table 5) and LSMDC (Table 6). While MSRVTT and LSMDC contain short video-caption pairs (average video duration of 13s for MSRVTT, one-sentence captions), ActivityNet contains much longer videos (several minutes) and each video is captioned with multiple sentences. We consider the concatenation of all these sentences as the caption. We show that our method obtains state-of-the-art results on all the three datasets. The gains obtained through MMT’s long term temporal encoding are particularly noticeable on the long videos of ActivityNet.
|Text Video||Video Text|
|Text Video||Video Text|
|Text Video||Video Text|
|CCA  (rep. by )||21.7||33||-||-||-||-|
We introduced multi-modal transformer, a transformer-based architecture capable of attending multiple features extracted at different moments, and from different modalities in video. This leverages both temporal and cross-modal cues, which are crucial for accurate video representation. We incorporate this video encoder along with a caption encoder in a cross-modal framework to perform caption-video matching and obtain state-of-the-art results for video retrieval. As future work, we would like to improve temporal encoding for video and text.
We thank the authors of  for sharing their codebase and features, and Samuel Albanie, in particular, for his help with implementation details. This work was supported in part by the ANR project AVENUE.
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation9(8) (Nov 1997)
Zhang, Y., Jin, R., Zhou, Z.H.: Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics1, 43–52 (2010)
Zhou, B., Lapedriza, À., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Analysis and Machine Intelligence40, 1452–1464 (2018)
As shown below, using multiple modalities does not impact the number of parameters significantly. Interestingly, majority of the parameters correspond to the BERT caption encoding module. We also note that the difference in the video encoder comes from the projections. The number of parameters of a transformer encoder is independent of the number of input embeddings, as are the parameters of a CNN from the image size.
Our cross-modal architecture using 7 modalities has: 133.3M parameters, including caption encoder: 112.9M, video encoder: 20.4M (Projections: 3.3M, MMT: 17.1M). Our cross-modal architecture using 2 modalities has: 127.3M parameters, including caption encoder: 109.6M (decrease compared to 7 modalities due to using less gated embedding modules), video encoder: 17.7M (Projections: 0.6M, MMT: 17.1M).
Training our full cross-modal architecture from scratch on MSRVTT takes about 4 hours on a single V100 16GB GPU.
If we replace our multi-modal transformer by collaborative gating , we reduce the number of parameters from 133.3M to 123.9M. However, the gain in inference time is minimal, from 1.1s to 0.8s, and is negligible compared to feature extraction, as detailed below.
Inference time for 1k videos and 1k text queries from MSRVTT on a single V100 GPU is as follows: approximately 3000s to extract features of 7 experts on 1k videos (480s just for S3D motion features), 1.1s to process videos with MMT, 0.9s to process 1k captions with BERT+gated embedding modules, 0.05s to compute similarities and rank the video candidates for the 1k queries.
Here, we report our results for the additional metrics R@1, R@10, R@50. Table 7 complements the results reported for the MSRVTT  dataset in Table 4 of the main paper. Similarly, Table 8 and Table 9 report the additional evaluations for Table 5 and Table 6 of the main paper on ActivityNet  and LSMDC  datasets respectively. We observe that the results on these additional metrics are in line with the conclusions of the main paper.
|Text Video||Video Text|
|Text Video||Video Text|
|Text Video||Video Text|
|CCA  (rep. by )||7.5||21.7||31.0||33||-||-||-||-||-||-|