What we see changes what we know.
What we know changes what we see. Jean Piaget
Vision and language play an important role in the way humans learn to associate visual entities to abstract concepts and vice versa. This has also become the de facto
way to successfully train computer vision models. Indeed, fromclassification where images are categorized based on a fixed list of words to the recent captioning tasks where images or videos are annotated with rich language descriptions, this interplay is one of the driving forces behind recent progress in the field. However, one of the main limitations of this approach is that it requires manually annotating large collections of visual data.
Manual annotation is both cumbersome and expensive. Moreover, for videos, which are the main focus of this work, annotation is also even more challenging due to the ambiguities of choosing the right vocabulary of actions and annotating action intervals in video. This significantly limits the scale at which fully supervised video dataset can be obtained and hence slows down the quest to improve visual representations. Recent work has proposed a promising alternative to this fully supervised approach: leveraging narrated videos that are readily available at scale on the web.
Of particular interest, the recent HowTo100M dataset  contains more than 100 million pairs of video clips and associated narrations. It was automatically collected by querying YouTube for instructional videos. Such videos usually depict someone explaining orally how to perform a complex human activity, e.g. preparing a particular meal or repairing a car. Our objective in this paper is to learn strong video representations using only this narrated material.
End-to-end learning from instructional videos is a highly challenging task. Indeed, these videos are made in general with the goal of maximizing the number of views, and with no specific intention to provide a training signal for machine learning algorithms. This means that the supervision present in the narration is only weak and noisy. Among typical sources of noise, the prominent one by far is the weakalignment between the video and the language: although for the most part the spoken words correlate with what is happening in these videos, this alignment is far from perfect. People might talk about something before actually demonstrating it, but they might also omit to talk about something that is happening because it is clear enough visually. Conversely they could only mention an action without showing it in the case where the step is not essential or trivial to convey with language alone. This is without even considering the irrelevant information given throughout the video (e.g. jokes or credits) as well as the general difficulty of working with spoken language obtained from potentially erroneous speech recognition algorithm as opposed to written text.
In this work, we propose a bespoke training loss, dubbed MIL-NCE as it inherits from Multiple Instance Learning (MIL) andNoise Contrastive Estimation (NCE). Our method is capable of addressing visually misaligned narrations from uncurated instructional videos as illustrated in Figure 1. Equipped with this novel training scheme and a simple joint video and text embedding model, we show that we can successfully train video representations from scratch directly from pixels on the HowTo100M 
dataset. To demonstrate the quality of the learnt representations, we employ an extensive set of evaluation benchmarks on a wide variety of video understanding tasks: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Notably, our learnt video representations outperform fully supervised baselines trained on Kinetics or ImageNet for several of the tasks. We also show improvements over other self-supervised approaches on HMDB51 and UCF101 even without fine-tuning the learnt representations. Finally, by leveraging the joint video and text representations, our off-the-shelf trained model also reaches state-of-the-art results on YouCook2 and CrossTask, without any training on the target datasets.
Contributions. The contributions of this work are threefold. (i) We propose a method to learn a joint text video embedding in an end-to-end fashion from unlabelled, uncurated narrated videos using the recently introduced HowTo100M  dataset. In particular, we introduce a specific loss, dubbed MIL-NCE for Multiple Instance Learning Noise Contrastive Estimation, that enables the learning to cope with the highly misaligned narration descriptions. (ii) We provide a thorough ablation study to quantitatively assess the importance of the different design choices of the approach. (iii) Finally, we demonstrate that the representations thus obtained are competitive with their strongly supervised counterparts on four downstream tasks over eight video datasets.
2 Related work
Learning visual representations from unlabeled videos. As labeling videos is cumbersome, expensive and not scalable, a significant number of prior works have studied the task of learning visual representations from unlabeled videos. Currently, the most effective approach is to collect a large amount of data from social media and use the available metadata as supervision [1, 25]. However, this metadata is often in the form of keywords or tags, rather than (spoken) natural language considered in this work. In addition, the meta data is often platform dependent and rarely publicly available. Self-supervised approaches do not suffer from these issues as the idea is to define a supervised proxy task using labels directly generated from videos. Some of these tasks include: temporal ordering of video clips or frames [23, 44, 52, 82], predicting geometric transformations , predicting motion and appearance , predicting the future, the past or a portion of masked input in the feature space [29, 68, 72], colorizing videos , predicting 3D geometry from synthetic data , predicting the audio in a feature space [7, 41] or tasks leveraging temporal cycle consistency [22, 77]
. In our work, we leverage as supervision for our proxy task, the output of automatic speech recognition (ASR) ran on narrated instructional videos. The nature of this supervision has the potential to also provide semantic information[50, 65], which is often missing in works that only exploit pixel-wise cues. Moreover, most of the top performing prior works only study their method on curated video datasets (e.g. Kinetics ) where labels have been removed. However, this is not truly learning from unlabeled data as these videos have been carefully selected and verified to belong to classes of interests. Caron et al.  further explain the performance gap between training on such curated data versus uncurated ones, truly available at scale. Instead, our approach focuses on the learning of representations only from uncurated videos.
Vision, speech and language. A common alternative to training visual models using manually defined sets of labels is to exploit semantic supervision from natural language or speech. Numerous prior works [18, 20, 26, 27, 40, 49, 53, 58, 60, 84, 75, 76, 79, 80] have used image / video description datasets [46, 61, 63, 83, 86] to learn an embedding space where visual and textual data are close only if they are semantically similar. These methods either rely on manually annotated image / video description datasets, or leverage representations already pre-trained on manually labelled datasets (e.g. ImageNet  or Kinetics ). In contrast, in this work no manually annotated visual data is involved at any stage of our approach. To avoid labelling visual data, several approaches have leveraged audio transcripts obtained from narrated videos using automatic speech recognition (ASR) as a way to supervise video models for object detection [3, 15, 54], captioning [33, 69], classification [2, 42, 47, 85], summarization  or retrieval  using large-scale narrated video datasets such as How2  or HowTo100M . Others [10, 30] have investigated learning from narrated videos by directly using the raw speech waveform instead of generating transcriptions. Most related to us is the work of Miech et al.  who trained a joint video and text embedding from uncurated instructional videos . However, as opposed to our work, they do not model any misalignment issue encountered when training on such videos and rely on visual representations pretrained on Kinetics-400 and ImageNet. Building on this work, Sun et al.  have used a contrastive bidirectional transformer (CBT) to learn long term contextual video representations from instructional videos. All these works use a visual representation pre-trained on either Kinetics or ImageNet when training on such narrated videos. In contrast, the key innovation of our work is that we demonstrate learning a generic video representation as well as a joint video-text embedding from scratch, without pre-training on manually annotated video or image datasets.
Multiple instance learning for video understanding. Multiple instance learning methods have been employed in many weakly-supervised video understanding problems including: person recognition in movies using scripts [11, 48, 59], weakly supervised action classification [45, 66] and localization [16, 21, 78], co-reference resolution of characters in TV series  or object tracking 
. These methods often rely on some form of max-pooling (i.e. MIL-SVM ) or discriminative clustering (i.e. DIFFRAC ) to resolve the label ambiguities, and have used mostly linear (or shallow) models. In this work, we present MIL-NCE, a new approach marrying the noise contrastive estimation (NCE) framework  with multiple instance learning . We show that MIL-NCE is well-suited to learn deep visual representations from scratch using weak and noisy training signals available in uncurated instructional videos.
3 Leveraging Uncurated Instructional Videos
This section describes the proposed approach to train joint video and text embeddings from unlabeled narrated videos in an end-to-end fashion. To start with, we are given pairs of video clips and associated narrations. In practice, a pair is composed of a short 3.2 seconds video clip (32 frames at 10 FPS) together with a small number of words (not exceeding 16) that correspond to what the person is saying in the video. For example, someone might be sanding wood while mentioning the action “sanding down” or the object “sander” as illustrated in Figure 1(a). Given this input, our goal is to learn a joint embedding space where similarity between the narration and video embedding is high when the text and visual content are semantically similar and low otherwise, and we wish to learn this starting from raw pixels in the video and text descriptions. As illustrated in Figure 1, this is a very challenging problem due to the often severely misaligned visual descriptions.
In this work, we address this issue by introducing the MIL-NCE objective:
where represents a video clip and a narration. and are the two embedding functions that respectively operate over video and text. Given a specific sample -th, we construct to be a valid set of positive video/narration candidate pairs (see Figure 2) while conversely refers to an associated set of negative video/narration pairs. This objective function comes down to maximizing the ratio of the sum of the positive candidate scores from to the sum of the scores of all negatives sampled from , where the score is measured by the exponentiated dot product of the corresponding video and language embeddings, and .
In the following, we describe more precisely the motivation behind the MIL-NCE objective (1). First, section 3.1 introduces the chosen probabilistic model for joint text and video embedding. Given that model, Section 3.2 details the choice behind the training objective (1) explaining how it is specifically adapted to handle the misalignment noise inherent in narrated videos in comparison with existing approaches.
3.1 A simple joint probabilistic model
In the following, stands for a video clip and for a narration. Given a set of pairs of video clips and associated narrations sampled from the joint data distribution , our goal is to learn a joint embedding space where semantically related videos and texts are close and far away otherwise.
Formally, we learn two parametrized mappings: maps a video clip into a
-dimensional vector, and maps a narration into the same -dimensional vector space,
. We assume that we can estimate up to a constant factor the joint probability of a pair of video and narrationby exponentiating the dot product of the two embeddings:
In this work, takes the form of a CNN that runs over a fixed-length clip. For , we consider simple sentence based models that transform a set of words into a single vector. Note, for simplicity and with a slight abuse of notation, we refer to (or ) as both a function and the parameters that define it. Also, we will refer to (2) as simply , i.e. we keep the dependence in and implicit for the clarity of simpler equations. More details about the exact architecture of the models are provided in Section 4.
3.2 Learning from uncurated data: MIL-NCE
Recall that our goal is to learn a joint video and text representation only from uncurated narrated videos. In this section, we start by detailing why this is a highly challenging endeavor due to misalignments present in that data. Next, we explain how the introduced MIL-NCE objective (1) enables to learn despite that noise. Finally, we contrast our proposed approach to similar works in the self-supervised domain.
Misalignment in narrated videos. In , the authors estimate that around 50% of clip-narration pairs from the HowTo100M dataset are not aligned. In fact, people are likely to describe an event after or before performing it in the video as illustrated in Figure 1. This visual misalignment makes it more challenging to learn video representations than with manually annotated and aligned labels.
How to learn despite noisy supervision? To address the aforementioned issues, we propose to consider multiple options for matching a video and a narration instead of only comparing a single video with a single narration as done in . Let’s consider the example illustrated in Figure 1(a). Given a clip , narrations that happen close in time within the same video can be considered as positive candidates. By doing so, the chance that spoken words correlate with what is happening in the video increases. In that case, we would like to match at least one of the narrations with video . Given the probabilistic model (2), a natural way to express this is by computing the joint probability of happening with any of the . Because we can make the assumption that ’s are mutually exclusive (i.e. neighbouring narrations are never repeated twice), this can be expressed mathematically by (2) as follows:
This is a MIL like extension; but note that it allows multiple ’s to be matched with a single video , i.e. it does not restrict the match to only one from the set . More generally, and symmetrically, the case where several video clips are candidates for a given narration can also be envisioned. Hence, for generality, we assume that instead of having a single pair , we have a set of candidate positive pairs , and we can simply repurpose (3) as . We denote by the training set of candidate positives deduced from the original training set . With this extension, we have the tools to address the misalignment problem. Practical details about how to construct are given in Section 4.1
How to train this model? MIL-NCE. We wish to learn a video representation based on the previously described probabilistic model . However, this is challenging as one cannot directly apply standard generative techniques such as maximum likelihood due to the intractability of computing the normalization constant over all possible pairs of videos and narrations. Instead, we rely on a discriminative technique, namely the noise-contrastive estimation (NCE) approach [28, 37], that has recently been shown to be effective in the context of feature learning [31, 55]. The core idea is to directly optimize the unnormalized probabilistic model (3
) to discriminate between data obtained from the true joint distributionand some artificially generated noise data, a.k.a. “negatives”. In this work, we use the softmax version of NCE :
and replacing the probability of a single positive match, , with our MIL like extension, , gives our proposed MIL-NCE training objective (1). Given this, we can simply estimate the parameters of our model by maximizing the objective (1), where is a specific set of negatives for the -th sample. Next, we discuss how our approach differs from previous related work.
NCE objectives for self-supervised learning.
NCE objectives for self-supervised learning.NCE has recently been successfully applied to self-supervision. In particular, CPC [31, 55] introduces the InfoNCE loss to enforce the model to maximize the conditional probability of some targets (e.g. the bottom part of the image) conditioned on some context (e.g. the top part of the image). Differently from CPC, which creates an asymmetric set of negatives by fixing the context and only sampling negative targets, we instead use NCE to model the symmetric joint probability between text and video (2). For that reason, we construct so that it contains both negatives for video and narration . In Section 4, we describe precisely how is obtained as well as evaluate the benefit of this symmetric approach.
We first describe implementation details of our method in Section 4.1. The eight datasets used in our evaluation are outlined in Section 4.2. We present a thorough ablation study emphasizing key ingredients of our approach in Section 4.3. Finally, we compare our learnt representations to previous self-supervised and fully-supervised methods in Section 4.4.
|Global avg pool||1111024|
Linear + ReLU
4.1 Implementation details
Model and Inputs. For the 3D CNN backbone, we use the standard I3D implementation from . We use the Google News self-supervised pre-trained word2vec (d=300) embedding from  for our word representation. Each video clip at training contains 32 frames sampled at 10 fps (3.2 seconds) with a 200x200 resolution (224x224 at test time). For each narration, we take a maximum of 16 words. More details about the model architecture and input dimensions are provided in Table 1. A detailed illustration of the architecture is also given in Appendix B.
Visual representations evaluation. We evaluate our visual representations at two different semantic levels. First, we use the output of the I3D Global avg pool (see Table 1), to evaluate our representation for action recognition, action segmentation and action localization. Next, the output of the last I3D Linear layer (see Table 1), which maps the video to the joint text-video semantic space, is used in conjunction with the output of the language model for the text-video retrieval tasks.
Training dataset. We train our model using the HowTo100M  narrated video dataset. It consists of more than 1.2M videos accompanied with automatically generated speech transcription. We use the provided transcription to create pairs of video / caption defined by each caption time stamp. Note that while the original dataset  consists of 136M pairs, we only used 120M pairs to comply with the YouTube wipe out policy. Each video shorter than 5 seconds is extended symmetrically in time so that the duration is at least 5 seconds. Then we randomly sample, a fixed length clip of 3.2 seconds within each video at training. For each clip-narration training pair sampled, we construct the bag of positive candidate pairs by considering the nearest captions in time to as depicted in Figure 1(a). For example, if we set the number of positive candidate pairs to 3, we would have where and are the 2 closest narrations in time to . For the negative set , we first fix and take negative narrations from other samples within the batch and symmetrically fix the narration and take negative videos from other other samples in the batch.
Optimization. We use the ADAM  optimizer with an initial learning rate of with linear warm up of 5k steps. The learning rate is decayed twice by a factor of 10. We train our model using Cloud TPUs v3 111https://cloud.google.com/tpu/, each Cloud TPU having a batch size of 128 videos. Given the high computational load required for training on HowTo100M, we run ablation studies on 4 Cloud TPUs and train our model for 500k steps ( 3 days). For our final evaluation in Section 4.4, we pick the best parameters based on our ablation study and then use 64 Cloud TPUs for 400k steps (also
3 days) as we observed that training on bigger batch size, and thus more epochs, had a positive impact on performance.
4.2 Downstream tasks
To show the generality of our learnt representations, we perform evaluation on five diverse downstream tasks using eight datasets described below.
Action Recognition: HMDB-51 , UCF-101 , Kinetics-700 . We evaluate our video-only representation on the traditional HMDB-51 / UCF-101 as well as the recent Kinetics-700 action recognition tasks.
Text-to-Video retrieval: YouCook2 , MSR-VTT . We use the YouCook2 and MSR-VTT text-to-video retrieval benchmarks to evaluate our off-the-shelf learnt joint text-video representation. We follow the same evaluation protocol as described in . We report the retrieval performance using the recall at K (R@K) metric (with K=1,5,10) which measures the percentage of clips retrieved at the top K (the higher the better). We also report the median rank (MedR) of videos to be retrieved (the lower the better).
Action Localization: YouTube-8M  Segments. We evaluate our video representation on YouTube-8M Segments222https://research.google.com/youtube8m, a subset of the YouTube-8M  with precise temporal annotation. We follow the YouTube-8M Segments challenge evaluation protocol and report the mAP metric.333https://www.kaggle.com/c/youtube8m-2019/overview/evaluation
Action Step Localization: CrossTask . We use the recently released CrossTask instructional video dataset to evaluate our off-the-shelf learnt joint text-video representation on the task of action step localization. We perform the same evaluation protocol as in  and report the average recall (CTR) metric for the localization task.
4.3 Ablation studies
We perform the ablation studies on the following downstream tasks: MSR-VTT R@10 (MR10), YouCook2 R@10 (YR10), HMDB-51 and UCF-101 recognition accuracy on split 1 and CrossTask average recall (CTR). This subset of downstream tasks has been chosen for their simplicity of evaluation and because they cover a wide range of tasks.
Which loss is better for learning the joint embedding ?
In this ablation study (Table 1(a)), we compare different losses for matching the text and video embeddings in the standard single-instance learning setting where we pair each video clip to its closest narration in time. We compare the NCE based approach (ours) to the frequently used max margin ranking loss [18, 20, 32, 38, 49, 53, 75, 76, 79, 80] and a binary classification loss (i.e. sigmoid cross entropy loss) that has shown to be effective in video-audio matching [6, 7]. Overall, the NCE loss outperforms other losses or works similarly on all five tested datasets.
The more negatives, the better. We keep the same single-instance learning setting and assess the quality of our representations trained with different number of sampled negative examples per positive pair in Table 1(b). We can see that the overall performance increases with the number of negatives. For the rest of the ablation studies, we use 512 negative samples per positive.
How many positive candidates pairs to consider ? We evaluate the benefit of going from the single-instance learning approach to the proposed multiple-instance based approach in Table 1(c). In this experiment, we vary the number of positive candidate training pairs for each video clip from 1 (i.e. single-instance learning setting) up to 33 candidates. Adding candidates significantly improves the performance upon the single-instance learning baseline. Moreover, we observe a trade-off between having too many candidates and not having enough of them, as we reach the best results by considering 3 to 5 positive candidates. We believe that adding too many contextual narrations increases the chance for irrelevant ones as they are sampled further in time from the considered video clip. For the rest of the paper we fix the number of positive candidate pairs to 5.
MIL-NCE vs other MIL based approaches. In Table 1(d), we compare our MIL-NCE approach with methods that can also handle multiple possible candidate captions at training time. The max-pool based approach [4, 7, 56] (Max+NCE) only optimizes over the clip-caption pair with the highest similarity score among the positive candidates. On the other hand, the attention-based approach  (Attn+NCE) computes cross-modal attention weights between all the clip-caption pairs and perform a weighted average of the similarity scores in order to consider the most relevant positive candidate pairs. More details about these baselines are provided in Appendix A. Finally, we also compare to the single-instance learning baseline where we concatenate all of the candidate narrations as one longer narration (Cat+NCE). Our proposed MIL-NCE method outperforms these two standard approaches on five out of six tasks. Figure 3 illustrates examples of selected pairs from a hold-out set of HowTo100M videos, using MIL-NCE.
Symmetric or asymmetric negative sampling ? Recall that given a pair of video/narration , we create in a symmetric manner by sampling negative narrations for the video and negative videos for the narration . Table 1(e) compares that approach to asymmetric alternatives: (i) by fixing the video and only sampling negative captions and (ii) by fixing the narration and only sampling negative videos . Overall, the best results are achieved when sampling jointly the negatives , i.e. when we equally sample both video and narration negatives.
Which language model? Finally, we also experiment with different language models (1 layer LSTM  or GRU , 1 layer and 8 attention heads Transformer  and NetVLAD with 32 clusters ) and compare them to our simple model (see Table 1) in Table 1(f). Even though our language model is similar to simple bag-of-word approach, it performs in average better and is more consistent over the five tasks than the other models. In particular, our model significantly outperforms the other language models on the text-to-video retrieval tasks (YR10 and MR10), where language plays the most important role. We believe that a sophisticated language understanding is not key for our learning task. Instead, most of the time, detecting and matching the main keywords in each narration is enough.
|Shuffle and Learn *||S3D||✗||35.8||68.7|
|Wang et al. ||C3D||✗||33.4||61.2|
|Fernando et al. ||AlexNet||✗||32.5||60.3|
4.4 Comparison to the state-of-the-art
Video only representation.
In Table 3, we evaluate our learnt representations on the HMDB-51  and UCF-101  action recognition benchmarks by extracting averaged pooled Mixed_5c features from the HowTo100M pretrained I3D .
More specifically, we compare to self-supervised approaches, which similarly to our work, do not make use of any annotated video nor image dataset when training the visual representation.
For AVTS , we report performance with the same I3D  backbone as ours.
Our learnt representation significantly outperforms prior work.
Importantly, we achieve state-of-the-art over self-supervised approaches on HMDB-51 only by
training a linear classifier on top of the frozen representation
only by training a linear classifier on top of the frozen representation. In contrast, all the other top performing approaches require fine-tuning their representation. This result is significant as it demonstrates the generalization of our representation to diverse sets of actions despite being trained on uncurated instructional videos. Finally, fine-tuning I3D leads to further improvements.
. We split videos in subsequent clips of 1.5 second and represent them by concatenating three features: the local representation from I3D, the global average pooled representation across the entire video and the relative positional embedding of the video clip. We train a logistic regression to predict the label for each clip. We compare our HowTo100M pretrained I3D network to an I3D fully-supervised on Kinetics-400, Kinetics-700 as well as a ResNet-50 fully supervised on ImageNet. We also compare to the state-of-the-art approach on COIN, CBT, which relies on a fully supervised S3D  trained on Kinetics-600. Our learnt representation performs better than representations trained on Kinetics-400, Kinetics-700 or ImageNet. Moreover, our method also significantly outperforms the state-of-the-art CBT  despite their use of fully-supervised representation trained on Kinetics-600 and a Transformer model.
We also report performance on the recently released YouTube-8M Segments dataset in Table 3(b). Since no results have been published for this benchmark yet, we only compare the classification performance using different fully-supervised representations (i.e. I3D trained on Kinetics-400 / Kinetics-700 or ResNet-50 trained on ImageNet). Here again, our learnt representation outperforms all of the fully-supervised counterparts despite the domain gap between YouTube-8M and uncurated instructional videos.
Finally, in Table 3(c) we investigate the benefit of initializing an I3D model with our learned weights for a large-scale action recognition dataset (i.e. Kinetics-700). We compare to a randomly initialized I3D and one inflated from an Inception network pretrained on ImageNet . We obtain a improvement over a randomly initialized I3D, indicating that our representation provides a good initialization. More interestingly, it is also a better initialization than an I3D inflated from an ImageNet pretrained network.
Joint text-video representation. We report text-to-video retrieval results on the YouCook2 (Table 4(a)) and MSR-VTT (Table 4(b)) using our off-the-shelf model trained on HowTo100M. Note that our model has not seen any YouCook2 nor MSR-VTT annotated videos, hence for fair comparison on the MSR-VTT dataset we only compare to prior work  that did not finetune on MSR-VTT. On YouCook2, our off-the-shelf model significantly outperforms all prior work. More specifically, it performs better than prior state-of-the-art  which uses visual representation trained on Kinetics-400 + ImageNet and trains the joint text-video representation on both HowTo100M and YouCook2. On MSR-VTT, our method performs slightly better than  but without using any manually annotated dataset. Finally, we also evaluate our off-the-shelf model trained on HowTo100M on the CrossTask  action localization benchmark in Table 3(d). We perform localization via a video-to-text retrieval approach similarly to . Our method outperforms state-of-the-art approaches on this benchmark, here again, without using manual supervision.
We have addressed the challenging task of learning visual representations from scratch using uncurated instructional videos. Our approach did not rely on any manually annotated video nor image dataset. To deal with highly misaligned narrations and videos, we have introduced MIL-NCE, a multiple instance learning approach derived from the noise contrastive estimation framework. We have applied MIL-NCE to the uncurated HowTo100M dataset and obtained strong visual representations that outperformed self-supervised as well as fully-supervised representations on many downstream tasks. More generally, we believe MIL-NCE can be applicable in many multiple instance learning problems where representation learning is key.
We would like to thank: Relja Arandjelović, Pauline Luc, Gunnar Sigurdsson for helpful discussions. The project was partially supported by Antoine Miech Google PhD fellowship, the ERC grant LEAP (No. 336845), the CIFAR Learning in Machines&Brains program, and the European Regional Development Fund under the project IMPACT (reg. no. CZ.02.1.01/0.0/0.0/15_003/0000468).
-  (2016) YouTube-8M: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675. Cited by: §2, §4.2.
-  (2016) Unsupervised learning from narrated instruction videos. In CVPR, Cited by: §2, 3(d).
-  (2019) Toward self-supervised object detection in unlabeled videos. arXiv preprint arXiv:1905.11137. Cited by: §2.
-  (2003) Support vector machines for multiple-instance learning. In NIPS, Cited by: §2, §4.3.
-  (2016) NetVLAD: CNN architecture for weakly supervised place recognition. In CVPR, Cited by: §4.3.
-  (2017) Look, listen and learn. In ICCV, Cited by: §4.3.
-  (2018) Objects that sound. In ECCV, Cited by: §2, §4.3, §4.3.
-  (2009) Visual tracking with online multiple instance learning. In CVPR, Cited by: §2.
-  (2007) DIFFRAC: a discriminative and flexible framework for clustering. In NIPS, Cited by: §2.
-  (2019) Grounding spoken words in unlabeled video. In CVPRW, Cited by: §2.
-  (2013) Finding Actors and Actions in Movies. In ICCV, Cited by: §2.
-  (2019) Unsupervised pre-training of image features on non-curated data. In ICCV, Cited by: §2.
-  (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, Cited by: §2, §2, §4.1, §4.4, §4.4.
-  (2019) A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987. Cited by: §4.2.
-  (2017) Discover and learn new objects from documentaries. In CVPR, Cited by: §2.
-  (2018) A flexible model for training action localization with varying levels of supervision. In NeurIPS, Cited by: §2.
On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. arXiv preprint arXiv:1409.1259. Cited by: §4.3.
-  (2018) Webly supervised joint embedding for cross-modal image-text retrieval. In ACM International Conference on Multimedia, Cited by: §2, §4.3.
-  (1997) Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence 89 (1-2), pp. 31–71. Cited by: §2.
-  (2019) Dual encoding for zero-example video retrieval. In CVPR, Cited by: §2, §4.3.
-  (2009) Automatic annotation of human actions in video. In ICCV, Cited by: §2.
-  (2019) Temporal cycle-consistency learning. In CVPR, Cited by: §2.
Self-supervised video representation learning with odd-one-out networks. In CVPR, Cited by: §2, Table 3.
Geometry guided convolutional neural networks for self-supervised video representation learning. In CVPR, Cited by: §2, Table 3.
-  (2019) Large-scale weakly-supervised pre-training for video action recognition. In CVPR, Cited by: §2.
-  (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV. Cited by: §2.
-  (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In ECCV, Cited by: §2.
-  (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In AISTATS, Cited by: §2, §3.2.
-  (2019) Video representation learning by dense predictive coding. arXiv preprint arXiv:1909.04656. Cited by: §2, Table 3.
-  (2018) Jointly discovering visual objects and spoken words from raw sensory input. In ECCV, Cited by: §2.
-  (2019) Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272. Cited by: §3.2, §3.2.
Localizing moments in video with natural language. ICCV. Cited by: §4.3.
-  (2019) A case study on combining asr and visual features for generating instructional video captions. arXiv preprint arXiv:1910.02930. Cited by: §2.
-  (1997) Long short-term memory.. In Neural Computing, Cited by: §4.3.
-  (2018) Attention-based deep multiple instance learning. arXiv preprint arXiv:1802.04712. Cited by: §4.3.
-  (2018) Self-supervised spatiotemporal feature learning by video geometric transformations. arXiv preprint arXiv:1811.11387. Cited by: §2, Table 3.
-  (2016) Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410. Cited by: §3.2.
-  (2014) Deep fragment embeddings for bidirectional image sentence mapping. In NIPS, Cited by: §4.3.
-  (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
-  (2015) Associating neural word embeddings with deep image representations using fisher vectors. In CVPR, Cited by: §2, 4(a).
-  (2018) Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, Cited by: §2, §4.4, Table 3.
-  (2019) Mining youtube-a dataset for learning fine-grained action concepts from webly supervised video data. arXiv preprint arXiv:1906.01012. Cited by: §2.
-  (2011) HMDB: a large video database for human motion recognition. In ICCV, Cited by: §4.2, §4.4.
-  (2017) Unsupervised representation learning by sorting sequences. In ICCV, Cited by: §2, Table 3.
-  (2011) Handling label noise in video classification via multiple instance learning. In ICCV, Cited by: §2.
-  (2014) Microsoft coco: Common objects in context. In ECCV, Cited by: §2.
-  (2015) What’s cookin’? interpreting cooking videos using text, speech and vision. NAACL. Cited by: §2.
-  (2017) Learning from Video and Text via Large-Scale Discriminative Clustering. In ICCV, Cited by: §2.
-  (2018) Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. arXiv:1804.02516. Cited by: §2, §4.3.
-  (2019) HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In ICCV, Cited by: §1, §1, §1, §2, §2, §3.2, §3.2, §4.1, §4.2, §4.4, 3(d), 4(a), 4(b).
-  (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §4.1.
-  (2016) Shuffle and learn: unsupervised learning using temporal order verification. In ECCV, Cited by: §2, Table 3.
-  (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ICMR, Cited by: §2, §4.3.
-  (2019) Grounding object detections with transcriptions. arXiv preprint arXiv:1906.06147. Cited by: §2.
-  (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §3.2, §3.2.
-  (2015-06) Is object localization for free? - weakly-supervised learning with convolutional neural networks. In CVPR, Cited by: §4.3.
-  (2019) Multimodal abstractive summarization for how2 videos. arXiv preprint arXiv:1906.07901. Cited by: §2.
-  (2016) Jointly modeling embedding and translation to bridge video and language. In CVPR, Cited by: §2.
-  (2015) It’s in the bag: stronger supervision for automated face labelling. In ICCV Workshop, Cited by: §2.
-  (2017) Enhancing video summarization via vision-language embedding. In CVPR, Cited by: §2.
-  (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, pp. 2641–2649. Cited by: §2.
-  (2014) Linking people in videos with “their” names using coreference resolution. In ECCV, Cited by: §2.
-  (2017) Movie description. IJCV. Cited by: §2.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. IJCV. Cited by: §2.
-  (2018) How2: a large-scale dataset for multimodal language understanding. In Proceedings of the Workshop on Visually Grounded Interaction and Language (ViGIL), Cited by: §2, §2.
-  (2012) Similarity constrained latent support vector machine: an application to weakly supervised action classification. In ECCV, Cited by: §2.
-  (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §4.2, §4.4.
-  (2019) Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743. Cited by: §2, §2, §4.2, §4.4, Table 3, 3(a).
-  (2019) Videobert: a joint model for video and language representation learning. In ICCV, Cited by: §2.
-  (2019) COIN: a large-scale dataset for comprehensive instructional video analysis. In CVPR, Cited by: §4.2, §4.4.
-  (2017) Attention is all you need. In NIPS, Cited by: §4.3.
-  (2016) Anticipating visual representations from unlabeled video. In CVPR, Cited by: §2.
-  (2018) Tracking emerges by colorizing videos. In ECCV, Cited by: §2.
-  (2019) Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In CVPR, Cited by: §2, Table 3.
-  (2018) Learning two-branch neural networks for image-text matching tasks. PAMI. Cited by: §2, §4.3.
-  (2016) Learning deep structure-preserving image-text embeddings. In CVPR, pp. 5005–5013. Cited by: §2, §4.3.
-  (2015) Unsupervised learning of visual representations using videos. In ICCV, Cited by: §2.
-  (2016) Towards weakly-supervised action localization. arXiv preprint arXiv:1605.05197. Cited by: §2.
-  (2019) Fine-grained action retrieval through multiple parts-of-speech embeddings. In ICCV, Cited by: §2, §4.3.
-  (2017) Sampling matters in deep embedding learning. ICCV. Cited by: §2, §4.3.
-  (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In ECCV, Cited by: §4.4.
-  (2019) Self-supervised spatiotemporal learning via video clip order prediction. In CVPR, Cited by: §2, Table 3.
-  (2016) MSR-vtt: a large video description dataset for bridging video and language. In CVPR, Cited by: §2, §4.2.
-  (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework.. In AAAI, Vol. 5, pp. 6. Cited by: §2.
-  (2014) Instructional videos for unsupervised harvesting and learning of action examples. In ACM, Cited by: §2.
-  (2018) Towards automatic learning of procedures from web instructional videos. In AAAI, Cited by: §2, §4.2.
-  (2019) Cross-task weakly supervised learning from instructional videos. In CVPR, Cited by: §4.2, §4.4, 3(d).
Appendix A Max+NCE and Attn+NCE baselines
We use the same notation as in the main paper for this section.
Max+NCE. This baseline aims at reproducing the standard max-pool based approach often used in multiple instance learning, but here combined with the NCE loss. Formally, this can be written as maximizing the following objective:
Intuitively, this corresponds to choosing the best positive candidate pair among all pairs according to the model.
Attn+NCE. This other baseline aims at selecting best candidate pairs via a cross-modal soft-attention mechanism between the clips and narrations. The cross-modal attention mechanism is defined as follows:
where and are two parametrized functions. In practice and are sharing parameters with (respectively ) except for the last ‘Linear’ layer (see Figure 4). Given that cross-modal attention mechansim, the Attn+NCE objective is:
The intuition behind this approach is to allow the model to have a separate selection mechanism for the positive candidate pairs.
Appendix B Model architecture
Figure 4 provides an illustration of the video model and text model used in the main paper.