Support-set bottlenecks for video-text representation learning

The dominant paradigm for learning video-text representations – noise contrastive learning – increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs. We posit that this last behaviour is too strict, enforcing dissimilar representations even for samples that are semantically-related – for example, visually similar videos or ones that share the same depicted action. In this paper, we propose a novel method that alleviates this by leveraging a generative model to naturally push these related samples together: each sample's caption must be reconstructed as a weighted combination of other support samples' visual representations. This simple idea ensures that representations are not overly-specialized to individual samples, are reusable across the dataset, and results in representations that explicitly encode semantics shared between samples, unlike noise contrastive learning. Our proposed method outperforms others by a large margin on MSR-VTT, VATEX and ActivityNet, for video-to-text and text-to-video retrieval.


CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations

Contrastive learning allows us to flexibly define powerful losses by con...

A Theoretical Analysis of Contrastive Unsupervised Representation Learning

Recent empirical works have successfully used unlabeled data to learn fe...

Probabilistic Representations for Video Contrastive Learning

This paper presents Probabilistic Video Contrastive Learning, a self-sup...

Temporal Contrastive Learning with Curriculum

We present ConCur, a contrastive video representation learning method th...

Neighborhood Contrastive Learning for Novel Class Discovery

In this paper, we address Novel Class Discovery (NCD), the task of unvei...

Contrastive Principal Component Learning: Modeling Similarity by Augmentation Overlap

Traditional self-supervised contrastive learning methods learn embedding...

Design of the topology for contrastive visual-textual alignment

Pre-training weakly related image-text pairs in the contrastive style sh...

1 Introduction

Noise contrastive learning (Gutmann & Hyvärinen, 2010) is emerging as one of the best approaches to learn data representations both for supervised (Khosla et al., 2020) and unsupervised regimes (Chen et al., 2020c). The idea is to learn a representation that discriminates any two data samples while being invariant to certain data transformations. For example, one might learn a representation that identifies a specific image up to arbitrary rotations (Misra & van der Maaten, 2020). In a multi-modal setting, the transformations can separate different modalities, for example, by extracting the audio and visual signals from a video. The resulting noise contrastive representation associates audio and visual signals that come from the same source video, differentiating others (Patrick et al., 2020).

The noise contrastive approach is motivated by the fact that the transformations that are applied to the data samples leave their ‘meaning’ unchanged. For example, rotating an image does not change the fact that it contains a cat or not (Gidaris et al., 2018). However, in most cases, we expect to find many data samples that share the same content without being necessarily related by simple transformations (e.g. think of any two images of cats). Existing noise contrastive formulations are unaware of these relationships and still try to assign different representations to these samples (Wu et al., 2018), despite the fact that they are semantically equivalent. If the representation is learned for a downstream task such as semantic video retrieval, this might degrade performance.

This suggest that there might be other learning signals that could complement and improve pure contrastive formulations. In this paper, we explore this idea in the case of learning from two modalities: videos and text, in the form of video transcripts or captions. Given a state-of-the-art contrastive formulation that learns from these two modalities, we investigate complementary pretext objectives to improve it. First, we consider the (instance) captioning

task, namely mapping a video to the corresponding text, casting this as a conditional stochastic text generation problem. We show that this brings only a modest benefit.

We observe that the captioning task is highly sample-specific, as the goal is to produce a caption which describes a specific video and not any other video, and thus it suffers from the same disadvantages (discouraging concept sharing among samples) as contrastive learning. Thus, we propose to address this issue by switching to a different text generation task. The idea is to modify the text generator to take as input a learnable mixture of a support-set of videos, which we call cross-instance captioning. The mixture weights are generated by comparing the learned video representations to captions’ representations in an online way over the batch. The limited set of support samples acts as a bottleneck that encourages extraction of shared semantics. In this manner, the embeddings can associate videos that share similar captions even if the contrastive loss tries to push them apart.

We show that, when the captioning task is added in this manner, it brings a sensible improvement to already very strong video representation learning results, further improving our own state-of-the-art baseline by a significant margin.

Figure 1: Cross-modal discrimination and cross-captioning. Our model learns from two complementary losses: (a) Cross-modal contrastive learning learns strong joint video-text embeddings, but every other sample is considered a negative, pushing away even semantically related captions (orange arrows). (b) We introduce a generative task of cross-captioning, which alleviates this by learning to reconstruct a sample’s text representation as a weighted combination of a support-set, composed of video representations from other samples.

2 Related Works

Learning data representations from unlabelled data has been a long standing goal of machine learning. These approaches are called “self-supervised learning” because the learning signals, termed pretext tasks, are obtained from the data itself. In the image and video domain, pretext tasks include colorization 

(Zhang et al., 2016), rotation (Gidaris et al., 2018), or clustering (Caron et al., 2018; Asano et al., 2020b, a; Ji et al., 2018), while in the natural language domain, masked language modeling (Devlin et al., 2019), and next word prediction (Mikolov et al., 2013; Pennington et al., 2014)

are extremely popular. These pretext tasks can be broadly classified into two classes: generative and discriminative.

Discriminative approaches learns representations by differentiating input samples, using objectives such as the contrastive loss (Hadsell et al., 2006; Gutmann & Hyvärinen, 2010). Discriminative approaches have proven to be particularly successful for image (Wu et al., 2018; He et al., 2020; Misra & van der Maaten, 2020; Chen et al., 2020c) and video (Han et al., 2019; Patrick et al., 2020; Morgado et al., 2020) representation learning. Generative approaches, on the other hand, try to reconstruct its input. GANs (Goodfellow et al., 2014; Donahue & Simonyan, 2019; Radford et al., 2015)

, autoencoders 

(Hinton & Salakhutdinov, 2006) and sequence-to-sequence models (Sutskever et al., 2014) are popular generative models. In this work, we show the importance of combining both discriminative and generative objectives to learn effective video-text representations.

The success of representation learning has also been due to advances in model architectures, such as the Transformer (Vaswani et al., 2017). BERT (Devlin et al., 2019) demonstrated that a transformer architecture pretrained on large-scale textual data can learn transferable text representations that can be fine-tuned on a variety of downstream tasks. Subsequent works (Radford et al., 2019; Raffel et al., 2019; Lewis et al., 2020b; Clark et al., 2020; Lewis et al., 2020a), have improved upon the transformer architecture or training objective to learn even better representations. Inspired by the success of transformers in the NLP domain, several works have leveraged transformers to learn transferable image (Desai & Johnson, 2020; Sariyildiz et al., 2020; Chen et al., 2020a) or multi-modal image-text representations (Li et al., 2019; Lu et al., 2019; Tan & Bansal, 2019; Su et al., 2019; Li et al., 2020; Chen et al., 2019). In this work, we leverage the transformer architecture to better encode and represent text and video.

Large-scale training data has enabled the more effective pretraining of image (Yalniz et al., 2019; Sun et al., 2017), video (Ghadiyaram et al., 2019; Thomee et al., 2016) and textual representations (Raffel et al., 2019). The release of the HowTo100M dataset (Miech et al., 2019), a large-scale instructional video dataset, has spurred significant interest in leveraging large-scale pretraining to improve video-text representations for tasks such as video question-answering (Lei et al., 2018), text-video retrieval (Liu et al., 2019) and video captioning (Zhou et al., 2018). Although semantically rich and diverse, instructional videos from the web are super noisy and therefore a few approaches have been proposed to combat this. A few works (Sun et al., 2019b; Zhu & Yang, 2020; Sun et al., 2019a) extend the BERT model to accept both visual and textual tokens to learn high-level semantic video-text representations. Other works have leveraged the contrastive loss (Miech et al., 2020) and show that using the raw audio (Rouditchenko et al., 2020; Alayrac et al., 2020) and other modalities (Gabeur et al., 2020) can be used to better align and improve video-text representations. While all these approaches rely on a contrastive objective, VidTranslate (Korbar et al., 2020) shows that a generative objective can also be used to learn joint video-text representations. In contrast to Korbar et al. (2020), we show that combining contrastive and generative objectives to pre-train video-text representations on large-scale data such as HowTo100M is very effective.

3 Method

Figure 2: (a) Our cross-modal framework with the discriminative (contrastive) objective and the generative objective. The model learns to associate video-text pairs in a common embedding space with text and video encoders (top). Meanwhile, the text must also be reconstructed as a weighted combination of video embeddings from a support-set (bottom), selected via attention, which enforces representation sharing between different samples. (b) Weights matrices (attention maps) used in each cross-captioning objective (see section 3.1.2).

We consider the problem of learning multimodal representations from a corpus of video-text pairs , where is a video and is its corresponding text (caption or transcription). Our goal is to learn a pair of representation maps and , with outputs in a -dimensional embedding space , where semantically similar instances are close to each other.

3.1 Objective for Learning Multimodal Representations

We consider two learning objectives, also illustrated in Figure 2. The first is the contrastive objective, pushing embeddings and to be close if text and video come from the same sample and pushing them apart otherwise. This assumes that every sample is its own class and does not benefit from modelling similiarities across

instances. The second objective is generative captioning. In its most basic variant, it maximizes the probability of generating the text

given the corresponding video . However, we suggest that variants that explicitly promote concept sharing between instances will result in better downstream performance, in tasks such as video retrieval. These variants, illustrated in Figure 2, have in common that the caption is reconstructed from a learned weighted combination over other videos . This is a form of attention (Bahdanau et al., 2014) which encourages the network to learn about which videos share similar semantics, compensating for the contrastive loss and grouping them implicitly.

In the following, we denote with a batch of multi-modal samples, i.e. a finite collection of video-text pairs . For simplicity, we denote the batch as .

3.1.1 Contrastive objective

To define the contrastive objective, let

be the similarity measure between vectors

and . Following Faghri et al. (2018), we adopt the hinge-based triplet ranking loss with hard negative mining:


where is the correlation margin between positive and negative pairs and is the hinge function. In our experiments, we set .

3.1.2 Cross-captioning objectives

In the conventional captioning, the decoder seeks to optimize the negative log-likelihood of a text sequence given its corresponding video :


Here, the log-likelihood is obtained via auto-regressive decoding (Vaswani et al., 2017) from an intermediate video embedding . For the cross-captioning objective, we modify this loss to condition the generation process on a weighted average of the embeddings of the other videos in the batch, which we call the support-set. The weights themselves, which can be interpreted as a batch-wise attention, are obtained as a softmax distribution with temperature over batch indices based on the video embeddings, as follows:


By default, the summation in the softmax is conducted over a support set containing all indices except . In the experiments, we consider the following attention types for reconstruction. Identity captioning () generates the caption from the corresponding video and reduces to the standard captioning objective, eq. 2. Full support () considers all videos as possible candidates for captioning. Hybrid captioning sets the weights in eq. 3 as the average of the weights for identity captioning and full support. Cross-captioning () considers all but the video that one wishes to caption. This variant forces the network to extract all information required for captioning from other videos in the batch. Figure 2 compares graphically these attention mechanisms.

Considering both discriminative and generative objectives for learning multimodal representations, our full objective is where balances two objectives. We set to ensure similar magnitudes for both losses in our experiments. In the training phase, we use Adam (Kingma & Ba, 2015) to minimize our loss. At inference time, we directly use and to encode video and text representations for retrieval.

3.2 Model Architecture

We now discuss the details of the encoders and decoder components in our architecture, illustrated in fig. 2. For the text decoder in eq. 3 and 2, we use a pre-trained T-5 decoder (Raffel et al., 2019).

For the video representation , we use a video encoder followed by a multi-layer transformer pooling head . The encoder concatenates the output of pretrained ResNet-152 (He et al., 2016) and R(2+1)D-34 (Tran et al., 2018) networks applied to individual video frames, resulting in a code ] where is the maximum duration of a video clip. For the pooling head , we consider a transformer architecture to attend to important context and summarize it into a fixed-length representation . For this, we follow MMT (Gabeur et al., 2020), but with two important differences. First, while MMT uses 7 expert features that results in

the sequence length, we only use a transformer to attend to early-fused motion and appearance features as the video representation, thus significantly reducing the sequence length and computational cost. Second, instead of stacking 6 transformer layers to encode the visual stream as in MMT, we only use a shallow two-layer transformer architecture with additional pre-encoders, further increasing model efficiency. As temporal 1D-convolutional neural networks (CNNs) 

(LeCun et al., 1998) were shown to effectively capture temporal dependencies in videos (Dong et al., 2019), we integrate CNNs into our transformer pooling heads to better capture video temporal signals. In more detail, we compute by chaining two transformer layers, each of the type:


Here is a pre-encoder that refines the video representation; we found empirically that a 1D CNN works well for this purpose. Then, we apply multi-head self-attention (MHA) (Vaswani et al., 2017)

followed by a feed-forward network (FNN) with batch normalization (BN) 

(Ioffe & Szegedy, 2015). The architecture maps the input sequence to a new ‘contextualized’ sequence of representation vectors; we take the first one as .

The text representation decomposes in the same way as . The text encoder uses a pretrained T-5 network resulting in a code ], where is the maximum length of a sentence. The pooling head follows the same design as the video case, but

is set to a recurrent neural network (RNN) instead of a CNN. Please refer to the appendix for details.

In practice, for computational reasons, we use eq. 3 to finetune the parameters of all networks except the video encoder , which is fixed.

4 Experiments

We validate empirically the ability of our method to learn better representations for the downstream tasks of text-to-video and video-to-text retrieval. First, in sec. 4.2 we ablate various model components on the MSR-VTT dataset. Then, in sec. 4.3 we show that our best model significantly outperforms state-of-the-art retrieval systems on three datasets, MSR-VTT, ActivtyNet and VATEX. Finally, in sec. 4.4 we analyse qualitatively the effect of the attention mechanism used during training.

4.1 Experimental Setup


HowTo100M (Miech et al., 2019)

is a large-scale instructional video collection of 1.2 million YouTube videos, along with automatic speech recognition transcripts. We use this dataset for our pre-training experiments.

MSR-VTT (Xu et al., 2016) contains 10,000 videos, where each video is annotated with 20 descriptions. We report results on the 1k-A split (9,000 training, 1,000 testing) as in Liu et al. (2019). VATEX (Wang et al., 2019) is a multilingual (Chinese and English) video-text dataset with 34,911 videos. We use the official training split with 25,991 videos and report on the validation split as in HGR (Chen et al., 2020b). The ActivityNet Caption (Krishna et al., 2017) dataset consists of densely annotated temporal segments of 20K YouTube videos. We use the 10K training split to train from scratch/ finetune the model and report the performance on the 5K ‘val1’ split.

Evaluation Metrics.

To measure the text-to-video and video-to-text retrieval performance, we choose Recall at K (R@K) and Median Rank (MedR), which are common metrics in information retrieval.

Feature source
R(2+1)D-18 + R-152
(a) Video Encoder. Stronger features and combination improves performance.
Temporal reduction
Multi-Head Attn
(b) Feature Aggregation. Learning temporal attention yields strong gains over pooling.
Text Encoder
W2V (GloVe)
(c) Text Encoder. Stronger encoding of text improves retrieval.
InfoNCE (inter+intra)
InfoNCE (inter)
Triplet (inter+intra)
Triplet (inter)
(d) Contrastive Loss. Inter-modal Triplet loss yields the best performance.
Size 8 16 32 64 128 256 512 2048 (w/memory)
R@1/5 18.5/45.6 20.7/49.9 25.2/54.6 27.2/55.7 28.0/56.1 26.9/55.0 25.3/53.5 26.8/54.7
(e) Support-set Size. Retrieval degrades when reconstructing from too small and too large sets.
Table 1: Model Architecture and Training Details Ablation. TextVideo retrieval performance on MSR-VTT. Recall@, and Median Recall are shown.
Table 2: Effect of learning objectives. TextVideo retrieval on MSR-VTT.

4.2 Ablations

In Tab. 1, we first only ablate the cross-modal retrieval part of our network architecture, while the generative objectives are analysed in Tab. 2.

Video Encoder. In Tab. 1(a), we show the effect of the choice of visual input features. We find that for text-to-video retrieval at Recall at 1 and 5 (), features obtained from a video R(2+1)D-34 ResNet achieve and higher performance compared to only image-frame based features from a ResNet-152. A further and can be gained by concatenating both features, yielding the strongest of .

Feature Aggregation. While the features from both video and image-based visual encoders have reduced spatial extent after a fully-connected layer, the temporal dimension can be reduced in various ways. In Tab. 1(b)

, we find that our multi-head, parameterized attention reduction yields strong gains over the mean- or max-pooling baselines of over

for . This shows that learning attention over the temporal dimension of fixed feature sets can give strong gains even without fine-tuning the encoder.

Text Encoder. In Tab. 1(c), we find decent gains of and for R@1,5 for using T5-base, instead of T5-small. We do not use the T-5-Large model, as in Korbar et al. (2020), due to the prohibitively large relative model size increase of +220%.

Contrastive Loss.

To validate the choice of a triplet loss in eq. 1, in Tab. 1(c), we compare the results of the InfoNCE contrastive loss (Oord et al., 2018) with a triplet loss, with both the intra and inter-intra modality variants. We find that InfoNCE (Oord et al., 2018) loss does not work well in our case, likely due to the difficulty in tuning this loss to have the right combination of temperature and batch-size.

Captioning Objective. In Tab. 2, we show the effect of the different variants of our learning objective eq. 3. First, we find that the naive addition of a reconstruction objective (“Identity”) does not improve the contrastive-only baseline (“None”) much. Considering reconstruction from other videos improves the performance more. In particular, the “Hybrid” variant, which combines “Identity” and “Full” (sec. 3.1.2) improves Recall at 1 and 5 from and to and , respectively. However, the best result by far () is obtained forcing captions to be reconstructed only from other videos, via our cross-instance attention mechanism (“Cross”). This variant cannot use information contained in a video to generate the corresponding caption and thus entirely relies on the model to discover meaningful relationship between different videos. This newly-proposed scheme seems to have the most beneficial effect for semantic retrieval.

Support-Set Size. Lastly, in Tab. 1(e), we show the effect of the size of the support set used for cross-instance captioning. We find that our reconstruction loss indeed acts as a bottleneck, with both smaller and very large sizes degrading the performance.

TextVideo VideoText
Md Md
Random Baseline
JSFusion (Yu et al., 2018)
HT100M (Miech et al., 2019)
JPoSE (Wray et al., 2019)
CE (Liu et al., 2019)
MMT (Gabeur et al., 2020)
VidTranslate (Korbar et al., 2020)
HT100M (Miech et al., 2019)
NoiseEstimation (Amrani et al., 2020)
AVLnet (Rouditchenko et al., 2020)
MMT (Gabeur et al., 2020)
Table 3: Retrieval performance on the MSR-VTT dataset. Models in the second group are additionally pretrained on HowTo100M.
TextVideo VideoText
Md Md
Random Baseline
VSE (Kiros et al., 2014)
VSE++ (Faghri et al., 2018)
Dual (Dong et al., 2019)
HGR (Chen et al., 2020b)
Ours 44.6 81.8 89.5 1.0 58.1 83.8 90.9 1.0
Table 4: Retrieval performance on the VATEX dataset
TextVideo VideoText
Md Md
Random Baseline
FSE(Zhang et al., 2018)
CE (Liu et al., 2019)
HSE (Zhang et al., 2018)
MMT (Gabeur et al., 2020)
Ours 93.5 93.5
MMT-pretrained (Gabeur et al., 2020)
Ours-pretrained 29.2 61.6 94.7 3.0 94.8 2.0
Table 5: Retrieval performance on ActivityNet

4.3 Comparison to State-of-the-Art

In this section, we compare the results of our method to other recent text-to-video and video-to-text retrieval approaches on various datasets. In Tab. 5, 4 and 3, we show the results of our model applied to text-to-video and video-to-text retrieval on MSR-VTT, VATEX and ActivityNet with and without pre-trainig on HowTo100M. Without pre-training, our method outperforms all others in all metrics and datasets. In particular, for the VATEX dataset, our retrieval performance at recall at 1 and 5 is and , exceeding recent state-of-the-art methods (Chen et al., 2020b) by a margin of . For ActivityNet, our model outperforms MMT by a margin of 4% at recall at 1. With pre-training on HowTo100M, our performance further increases across the board. Notably, unlike MMT which uses 7 features, our model uses only 2 features and achieves state-of-the-art in most metrics.

Figure 3: Support-set attention map. Attention scores of all pairs in a batch (top-left square) and a subset of rows/columns (other squares) on VTT.

4.4 Analysis

In order to better understand the effect of our learning objective, we visualize the soft attention of our best-performing cross-instance reconstruction model in fig. 3. As we can see in the top-left square, which shows the pairwise attention between all pairs of videos in the batch, it is highly focused, with the model mostly attending one or two other instances in the batch.

For the first video’s caption reconstruction (second row), we find that the model solely attends to another musical performance video that is in the batch, ignoring the others. For the second video (third row), the model focuses on another sample that shows the sea but differs in most other aspects since there are no semantically-equivalent clips in the batch. The third video shares a similar scenario. These examples show that the bottleneck is effective at forcing the model to avoid memorising the video-caption association of each clip in isolation, and attempt to match other clips more broadly, since an exact (or very close) match is not guaranteed.

5 Conclusion

In this work, we studied classic contrastive learning methods such as the triplet loss to learn video-text representations for cross-model retrieval. We suggested that the contrastive approach might pull apart videos and captions even when they are semantically equivalent, which can hinder downstream retrieval performance. To mitigate this effect, we propose to consider a captioning pretext task as an additional learning objective. In particular, we show that cross-instance captioning can encourage the representation to pull together videos that share a similar caption, and are thus likely to be equivalent for retrieval. Leveraging these ideas, our model achieves state-of-the-art performance on the text-to-video and video-to-text retrieval tasks, on three datasets.

While we demonstrated these ideas in the specific case of text-to-video retrieval, they can in principle generalize to any setting that utilizes a contrastive loss, including self-supervised learning, provided that it is possible to learn reasonable conditional generators of a modality or data stream given another.


  • Alayrac et al. (2020) Jean-Baptiste Alayrac, A. Recasens, Rosália G. Schneider, R. Arandjelović, Jason Ramapuram, J. Fauw, Lucas Smaira, S. Dieleman, and Andrew Zisserman. Self-supervised multimodal versatile networks. ArXiv, abs/2006.16228, 2020.
  • Amrani et al. (2020) Elad Amrani, Rami Ben-Ari, Daniel Rotman, and Alex Bronstein. Noise estimation using density estimation for self-supervised multimodal learning. arXiv preprint arXiv:2003.03186, 2020.
  • Asano et al. (2020a) Yuki M. Asano, Mandela Patrick, Christian Rupprecht, and Andrea Vedaldi. Labelling unlabelled videos from scratch with multi-modal self-supervision, 2020a.
  • Asano et al. (2020b) Yuki M Asano, Christian Rupprecht, and Andrea Vedaldi. Self-labelling via simultaneous clustering and representation learning. In ICLR, 2020b.
  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • Caron et al. (2018) Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze.

    Deep clustering for unsupervised learning of visual features.

    In ECCV, 2018.
  • Chen et al. (2020a) Mark Chen, Alec Radford, Rewon Child, Jeff Wu, and Heewoo Jun. Generative pretraining from pixels. In ICML, 2020a.
  • Chen et al. (2020b) Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. Fine-grained video-text retrieval with hierarchical graph reasoning. 2020b.
  • Chen et al. (2020c) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020c.
  • Chen et al. (2019) Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Learning universal image-text representations. arXiv preprint arXiv:1909.11740, 2019.
  • Clark et al. (2020) Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. Electra: Pre-training text encoders as discriminators rather than generators, 2020.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pp. 248–255, 2009.
  • Denkowski & Lavie (2014) Michael Denkowski and Alon Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pp. 376–380, 2014.
  • Desai & Johnson (2020) Karan Desai and Justin Johnson. Virtex: Learning visual representations from textual annotations. arXiv preprint arXiv:2006.06666, 2020.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In ACL, 2019.
  • Donahue & Simonyan (2019) Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. In NeurIps, 2019.
  • Dong et al. (2019) Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. Dual encoding for zero-example video retrieval. In CVPR, 2019.
  • Faghri et al. (2018) Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. Vse++: Improving visual-semantic embeddings with hard negatives. In BMVC, 2018.
  • Gabeur et al. (2020) Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. Multi-modal transformer for video retrieval. In ECCV, 2020.
  • Ghadiyaram et al. (2019) Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. Large-scale weakly-supervised pre-training for video action recognition. In CVPR, 2019.
  • Gidaris et al. (2018) Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. ICLR, 2018.
  • Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.
  • Gutmann & Hyvärinen (2010) Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, 2010.
  • Hadsell et al. (2006) Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
  • Han et al. (2019) Tengda Han, Weidi Xie, and Andrew Zisserman. Video representation learning by dense predictive coding. In ICCV Workshops, 2019.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pp. 770–778, 2016.
  • He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning, 2020.
  • Hinton & Salakhutdinov (2006) Geoffrey E Hinton and Ruslan R Salakhutdinov.

    Reducing the dimensionality of data with neural networks.

    Science, 2006.
  • Hou et al. (2019) Jingyi Hou, Xinxiao Wu, Wentian Zhao, Jiebo Luo, and Yunde Jia. Joint syntax representation learning and visual cue translation for video captioning. In ICCV, 2019.
  • Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • Ji et al. (2018) Xu Ji, João F. Henriques, and Andrea Vedaldi. Invariant information clustering for unsupervised image classification and segmentation, 2018.
  • Khosla et al. (2020) Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning, 2020.
  • Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • Kiros et al. (2014) Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
  • Korbar et al. (2020) Bruno Korbar, Fabio Petroni, Rohit Girdhar, and Lorenzo Torresani. Video understanding as machine translation, 2020.
  • Krishna et al. (2017) Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In CVPR, 2017.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lei et al. (2018) Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. Tvqa: Localized, compositional video question answering. In EMNLP, pp. 1369–1379, 2018.
  • Lewis et al. (2020a) Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Armen Aghajanyan, Sida Wang, and Luke Zettlemoyer. Pre-training via paraphrasing, 2020a.
  • Lewis et al. (2020b) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL, 2020b.
  • Li et al. (2020) Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, and Ming Zhou. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI, 2020.
  • Li et al. (2019) Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language, 2019.
  • Li et al. (2018) Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. Jointly localizing and describing events for dense video captioning. In CVPR, 2018.
  • Lin (2004) Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004.
  • Liu et al. (2019) Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. Use what you have: Video retrieval using representations from collaborative experts. In BMVC, 2019.
  • Lu et al. (2019) Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIps, 2019.
  • Miech et al. (2019) Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
  • Miech et al. (2020) Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 2020.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  • Misra & van der Maaten (2020) Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In CVPR, 2020.
  • Morgado et al. (2020) Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. Audio-visual instance discrimination with cross-modal agreement. arXiv preprint arXiv:2004.12943, 2020.
  • Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002.
  • Patrick et al. (2020) Mandela Patrick, Yuki M. Asano, Polina Kuznetsova, Ruth Fong, João F. Henriques, Geoffrey Zweig, and Andrea Vedaldi. Multi-modal self-supervision from generalized data transformations, 2020.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    , pp. 1532–1543, 2014.
  • Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
  • Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
  • Rouditchenko et al. (2020) Andrew Rouditchenko, Angie Boggust, David Harwath, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, et al. Avlnet: Learning audio-visual language representations from instructional videos. arXiv preprint arXiv:2006.09199, 2020.
  • Sariyildiz et al. (2020) Mert Bulent Sariyildiz, Julien Perez, and Diane Larlus. Learning visual representations with caption annotations, 2020.
  • Su et al. (2019) Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-linguistic representations, 2019.
  • Sun et al. (2017) Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta.

    Revisiting unreasonable effectiveness of data in deep learning era.

  • Sun et al. (2019a) Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. Learning video representations using contrastive bidirectional transformer, 2019a.
  • Sun et al. (2019b) Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. In ICCV, 2019b.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In NeurIps, pp. 3104–3112, 2014.
  • Tan & Bansal (2019) Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. In EMNLP, 2019.
  • Thomee et al. (2016) Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
  • Tran et al. (2018) Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIps, 2017.
  • Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015.
  • Wang et al. (2019) Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4581–4591, 2019.
  • Wray et al. (2019) Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. Fine-grained action retrieval through multiple parts-of-speech embeddings. In ICCV, 2019.
  • Wu et al. (2018) Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
  • Xu et al. (2016) Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A large video description dataset for bridging video and language. In CVPR, 2016.
  • Yalniz et al. (2019) I. Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, and Dhruv Mahajan.

    Billion-scale semi-supervised learning for image classification, 2019.

  • Yu et al. (2018) Youngjae Yu, Jongseok Kim, and Gunhee Kim. A joint sequence fusion model for video question answering and retrieval. In ECCV, 2018.
  • Zhang et al. (2018) Bowen Zhang, Hexiang Hu, and Fei Sha. Cross-modal and hierarchical modeling of video and text. In ECCV, pp. 374–390, 2018.
  • Zhang et al. (2016) Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016.
  • Zhang et al. (2020) Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. Object relational graph with teacher-recommended learning for video captioning. In CVPR, 2020.
  • Zhou et al. (2018) Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. End-to-end dense video captioning with masked transformer. In CVPR, 2018.
  • Zhu & Yang (2020) Linchao Zhu and Yi Yang. Actbert: Learning global-local video-text representations. In CVPR, 2020.

6 Appendix

The appendix is organized as follows: First, we provide more details about our model. Then we introduce the datasets and the experimental setup. Finally, we provide additional qualitative and quantitative experimental results for video-text retrieval and captioning.

6.1 Model Details

Implementation details and hyper parameters.

For our text encoder, we use the T5-base model pre-trained on the “Colossal Clean Crawled Corpus” (C4) (Raffel et al., 2019). We use its corresponding text tokenizer and encode a sentence into a sequence of 1024 dimensional vectors.

For our visual encoder, our model utilizes only the motion and the appearance features. For the motion feature, we use a 34-layer, R(2+1)-D (Tran et al., 2018) model pre-trained on IG65M (Ghadiyaram et al., 2019) and apply a spatial-temporal average pooling over the last convolutonal layer, resulting in a 512-dimensional vector. For the appearance feature, we use the 2048-dimension flattened pool-5 layer of the standard ResNet152 (He et al., 2016)

pre-trained on Imagenet 

(Deng et al., 2009). We extract features at a rate of 1 feature per second and simply concatenate the two features, resulting in a 2560-dimension visual input stream. Noteworthily, instead of using 9 and 7 different types of visual features as in CE (Liu et al., 2019) and MMT (Gabeur et al., 2020), we use only the above 2 features and achieve on par or superior performance. Also, with early fusion, our model does not suffer from additional computation required for the extended sequence length in MMT. For the text decoder, we use the T5-base model decoder, also pre-trained on C4.

Figure 4: Transformer pooling head.

As illustrated in Fig. 4, our transformer pooling head is composed of a pre-encoder, a multi-head self-attention, and a FFN layer. For pre-encoders, we use a one-layer MLP with a -dimensional output for mapping video features into the common embedding space. We use 1024-dimension bi-directional GRU as the text pre-encoder. For the 1D-CNN prior, we use kernels with size as the visual and text pre-encoders. We set the embedding dimension to 1024 and use 4 attention heads in the transformer pooling layers. The hidden dimension of FFN is 2048.

Training and Inference time.

Pre-training on 1.2 million HowTo100M videos takes around 160 GPU hours (NVIDA V100) for 20 epochs. We speed up the pre-training process by distributing the workload over 8 GPUs. We use 1 GPU for the fine-tuning or training from scratch experiments. For the MSR-VTT 1k-A split, it takes 12 GPU hours to train our full model on 180K video-text pairs for 20 epochs. For Vatex, it takes 32 GPU hours to train on 260K video-text pairs for 30 epochs. For ActivityNet, it takes 2.5 GPU hours to train on 10K video-text paris for 28 epochs.

For inference, the encoding speed is around 250-300 video/sec and 200-250 text query/sec. The overall text-to-video search speed on 5,000 video-text pairs (5,000 text queries over 5,000 videos) is 30-34 seconds including encoding. The speed of text-to-video retrieval is similar to video-to-text retrieval.

6.2 Experiment Details

The margin of the max-margin loss is 0.2. We use the Adam (Kingma & Ba, 2015) optimizer with a initial learning rate and clip gradients greater than 0.2 during the training phase. Dropout rate is 0.3 for all datasets besides ActivityNet (0.0).

As the average video/text lengths and videos available are quite different across datasets, we adjust our training scheme accordingly. When training on MSR-VTT, ActivtyNet and Vatex, batch-size is set to 64. For MSR-VTT training, we sample and truncate videos to 32 seconds, text to 100 tokens and train for 20 epochs. For Vatex, videos are at most 64 seconds and we train for 30 epochs. For ActivtityNet training, videos are at most 512 seconds and 256 tokens for the text part. We train for 28 epochs on ActivityNet. For fine-tuning HowTo100M pre-trained model, we reduce training epochs into quarters.

6.3 Dataset Details

HowTo100M (Miech et al., 2019) is a large-scale instructional video collection of 1.2 million Youtube videos, along with automatic speech recognition transcripts. There are more than 100 million clips (ASR segments) defined in HowTo100M. We use this dataset for pretraining.

MSR-VTT (Xu et al., 2016) contains 10,000 videos, where each video is annotated with 20 descriptions. For retrieval experiments and ablation studies, we follow the training protocol and defined in Gabeur et al. (2020); Miech et al. (2019); Liu et al. (2019) and evaluate on text-to-video and video-to-text search tasks on the 1k-A testing split with 1,000 video or text candidates defined by Yu et al. (2018). For captioning task, we evaluate on the standard testing split with 2,990 videos.

VATEX (Wang et al., 2019) is a multilingual (Chinese and English) video-text dataset with 34,911 videos. We use the official split with 25,991 videos for training. As the testing annotations are private in VATEX, we follow the protocol in Chen et al. (2020b) to split the validation set equally (1,500 validation and 1,500 testing videos) for model selection and testing. For each video, 10 English and 10 Chinese descriptions are available, and we only use the English annotations.

ActivityNet Dense Caption dataset consists densely annotated temporal segments of 20K YouTube videos. Following  Zhang et al. (2018); Gabeur et al. (2020), we concatenate descriptions of segments in a video to construct “video-paragraph” for retrieval and captioning. We use the 10K training split to train from scratch/ finetune the model and report the performance on the 5K ’val1’ split.

6.4 Additional Qualitative Results

We provide addition qualitative text-to-video retrieval results on MSR-VTT, VATEX, ActivityNet in Fig. 5. Given a text query, in most cases, our model successfully retrieves the correct videos marked in green.

(c) ActivityNet
Figure 5: Examples of top-3 TextVideo retrieval results and similarities on the MSR-VTT, VATEX, and ActivityNet testing set. Only one correct video (colored in green) for each text query ib the top.

6.5 Video Captioning Experiments

To measure captioning/text generation performance, we report BLEU4 (Papineni et al., 2002), METEOR (Denkowski & Lavie, 2014), Rogue-L (Lin, 2004) and CIDEr (Vedantam et al., 2015) metrics. We report results on the MSR-VTT, VATEX and ActivityNet datasets.

VidTranslate (Korbar et al., 2020)
POS+VCT (Hou et al., 2019)
ORG (Zhang et al., 2020)
Ours, MSR-VTT only
Ours, HT100M + MSR-VTT
Table 6: Captioning performance on the MSR-VTT dataset
Blue@4 METEOR Rogue-L CIDEr
Shared Enc-Dec (Wang et al., 2019)
ORG (Zhang et al., 2020)
Ours, Vatex only 32.8 24.4 49.1 51.2
Ours, HT100M + Vatex
Table 7: Captioning performance on the VATEX dataset
Blue@4 METEOR Rogue-L CIDEr
DENSE (Krishna et al., 2017)
DVC-D-A (Li et al., 2018)
Bi-LSTM+TempoAttn (Zhou et al., 2018)
Masked Transformer (Zhou et al., 2018)
Ours, ActivityNet only
Ours, HT100M + ActivityNet
Table 8: Captioning performance on the ActivtyNet dataset