Communicating about the visual world using language is a key ability of humans as intelligent beings. A three year old child can manipulate objects, observe its own actions and describe them to others using language; while adults can learn new skills by reading books or watching videos. This interplay between video and language extends naturally to artificial agents that need to understand the visual world and communicate about it with people. Examples of tasks that still represent a significant challenge for current artificial systems include text-to-video retrieval [21, 28, 48, 49, 57], text-based action or event localization , video captioning [30, 55], and video question answering [45, 57]. Yet, progress on these problems is important for a host of applications from searching video archives to human-robot communication.
A common approach to model visual concepts described with language is to learn a mapping of text and video into a shared embedding space, where related text fragments and video clips are close to each other [11, 28, 31, 32, 53]. Learning a good representation often requires a large set of paired video clips and text captions. In fact, given the huge variability of video scenes and their textual descriptions, learning a generic embedding space may require millions of paired video clips and text captions. However, existing datasets ( MSR-VTT , DiDeMo , EPIC-KITCHENS ), are on the scale of tens to hundreds of thousands of such pairs that have been annotated manually. Manual collection of such datasets is expensive and hard to scale. It is also subjective since video annotation can often be an ill-defined task with low annotator consistency .
In this work, we explore a different source of supervision to obtain paired video clips and text captions for learning joint representations of video and language. We observe that narrated instructional videos are available in large quantities ( on YouTube) and provide a large amount of visual and language data. In particular, instructional videos [1, 26, 62] often contain narration with an explicit intention of explaining the visual content on screen. To leverage this rich source of data, we collect a new large-scale dataset containing 136 million video clips sourced from 1.22 million narrated instructional videos depicting humans performing more than 23,000 different tasks. Each clip is paired with a text annotation in the form of an automatically transcribed narration.
Contributions. The contributions of this work are three-fold. First, we collect a new dataset of close-captioned video clips, HowTo100M, that is orders of magnitude larger than any other existing video-text datasets (Section 3). Second, we show that such data can be used to learn powerful video-language representations. Our model (Section 4), trained on HowTo100M, sets a new state-of-the-art for text-based action localization and text-to-video retrieval on existing datasets of instructional videos, YouCook2  and CrossTask . Finally, we explore the ability of models trained on our data to transfer to non-instructional videos. In particular, we demonstrate that models pretrained on HowTo100M can be successfully transferred by fine tuning on the MSR-VTT dataset (generic Youtube videos) and the LSMDC dataset (movies).
2 Related work
A significant number of computer vision applications rely on a joint understanding of visual and textual cues. These applications include automatic image and video captioning[16, 30, 54, 55], visual question answering [6, 25, 45, 57], visual content retrieval based on textual queries [28, 50, 57], temporal localization of events in videos using natural language [11, 22] or video summarization with natural language .
Vision, language and speech. A common approach to model vision and language is learning a joint embedding space where visual and textual cues are adjacent if and only if they are semantically similar [7, 21, 28, 31, 32, 53, 48, 49, 51]. Most of these works rely on medium scale well annotated datasets in which descriptive captions are collected for each video clip. This process is costly as it requires considerable human annotation effort making these datasets hard to scale (see Table 1). In this work, we train a joint video and language model without a single manually annotated video description by leveraging automatically transcribed narrated videos. Using the spoken text from narrated videos to supervise vision models has seen some recent interest [1, 4, 9, 26, 39, 56]. Harwath  utilize the raw speech waveform to supervise the visual model, however, their method does not scale as annotators were paid to record audio descriptions for thousands of images. Chen  use subtitles from documentaries to automatically obtain object labels, but their focus is on learning object detectors rather than text-video embeddings and their dataset contains only 9 documentary movies, compared to about 15 years of video content considered in this work.
|ANet Captions ||100k||100k||20,000||849h||Youtube||2017|
Learning from instructional videos. Instructional videos are rising in popularity in the context of learning steps of complex tasks [1, 12, 35, 36, 40, 62], visual-linguistic reference resolution [13, 14], action segmentation in long untrimmed videos  and joint learning of object states and actions . Related to our work, [1, 26, 56] also consider automatically generated transcription of narrated instructional videos as a source of supervision. However as opposed to our work, these works typically extract from transcriptions only a small number of predefined labels.
Numerous datasets of web instructional videos were proposed over the past years [1, 26, 39, 41, 44, 61, 62]. Among the first to harvest instructional videos, Sener  use WikiHow, an encyclopedia of how to articles, to collect 17 popular physical tasks, and obtain videos by querying these tasks on YouTube. In a similar vein, COIN  and CrossTask  datasets are collected by first searching for tasks on WikiHow and then videos for each task on YouTube. We use the same approach for collecting HowTo100M. The main distinction between our dataset and previous efforts is the unprecedented scale both in terms of variety (more than 23,000 tasks from 12 different domains) and size (136 million clips sourced from 1.2 million instructional videos).
Large scale data for model pretraining.
The use of large scale and potentially noisy data from the web is an exciting prospect to pretrain language and vision models. In natural language processing, Bert, GPT 
, and GPT-2 are examples of language models trained on large-scale data that achieve state-of-the-art for many tasks. In fact, training GPT-2 on WebText  a dataset of 40GB of text from Reddit achieves state-of-the-art even in zero-shot settings. In vision, [24, 43]
explore the use of image metadata such as Instagram hashtags to pretrain image classifiers.
We are inspired by these works and focus our efforts on learning a strong embedding for joint understanding of video and language. We demonstrate that our video-language embedding learned from millions of YouTube videos not only outperforms previous work on tasks related to instructional videos without fine-tuning, but also generalizes well to non-instructional videos with some fine-tuning. We release our dataset, feature extraction pipeline, and model parameters as a resource that the video and language community can build on.
3 The HowTo100M dataset
We collect a new dataset of narrated videos with an emphasis on instructional videos where content creators teach complex tasks. This ensures that most narrations describe the observed visual content. HowTo100M features 1.22 million videos from YouTube, with activities from domains such as cooking, hand crafting, personal care, gardening, . Each video is associated with a narration available as subtitles that are either written manually or are the output of an Automatic Speech Recognition (ASR) system.
3.1 Data collection
Visual tasks. With an aim to obtain instructional videos that describe how to perform certain activities, we first start by acquiring a large list of activities using WikiHow222https://www.wikihow.com – an online resource that contains 120,000 articles on How to … for a variety of domains ranging from cooking to human relationships structured in a hierarchy. We are primarily interested in “visual tasks” that involve some interaction with the physical world ( Making peanut butter, Pruning a tree) as compared to others that are more abstract ( Ending a toxic relationship, Choosing a gift). To obtain predominantly visual tasks, we limit them to one of 12 categories (listed in Table 2). We exclude categories such as Relationships and Finance and Business, that may be more abstract.
We further refine the set of tasks, by filtering them in a semi-automatic way. In particular, we restrict the primary verb to physical actions, such as make, build and change, and discard non-physical verbs, such as be, accept and feel. This procedure yields 23,611 visual tasks in total.
Instructional videos. We search for YouTube videos related to the task by forming a query with how to preceding the task name ( how to paint furniture). We choose videos that have English subtitles - either uploaded manually, generated automatically by YouTube ASR, or generated automatically after translation from a different language by YouTube API.
We improve the quality and consistency of the dataset, by adopting the following criteria. We restrict to the top 200 search results, as the latter ones may not be related to the query task. Videos with less than 100 views are removed as they are often of poor quality or are amateurish. We also ignore videos that have less than 100 words as that may be insufficient text to learn a good video-language embedding. Finally, we remove videos longer than 2,000 seconds.
As some videos may appear in several tasks, we de-duplicate videos based on YouTube IDs. However, note that the dataset may still contain duplicates if a video was uploaded several times or edited and re-uploaded. Nevertheless, this is not a concern at our scale.
3.2 Paired video clips and captions
Subtitles are often organized as a list of text chunks (lines), and need not form complete sentences. Each line is associated with a time interval in the video, typically the duration in which the line is uttered. We select each line of the subtitles as a caption, and pair it with the video clip from the time interval corresponding to the line. We show some examples from our clip-caption pairs in Figure 2.
Different from other datasets with clip-caption pairs ( MSR-VTT), our captions are not manually annotated, but automatically obtained through the narration. Thus, they can be thought of as weakly paired. Typical examples of incoherence include the content producer asking viewers to subscribe to their channel, talking about something unrelated to the video, or describing something before or after it happens. Furthermore, our captions are often incomplete, lack punctuation, or are grammatically incorrect sentences, as they come from continuous narration and often ASR. We have manually inspected 400 randomly sampled clip-caption pairs and found that in 51 %, at least one object or action mention in the caption is visually seen in the video clip.
|Food and Entertaining||11504||497k||54.4M|
|Home and Garden||5068||270k||29.5M|
|Hobbies and Crafts||4273||251k||29.8M|
|Cars & Other Vehicles||810||68k||7.8M|
|Pets and Animals||552||31k||3.5M|
|Holidays and Traditions||411||27k||3.0M|
|Personal Care and Style||181||16k||1.6M|
|Sports and Fitness||205||16k||2.0M|
|Education and Communications||239||15k||1.6M|
|Arts and Entertainment||138||10k||1.2M|
|Computers and Electronics||58||5k||0.6M|
Statistics. The initial set of visual tasks are obtained by focusing on 12 WikiHow categories. Table 2 shows the number of collected WikiHow tasks and corresponding videos and clips per category. Figure 8 shows the first two levels of the WikiHow hierarchy: the twelve categories and their subcategories along with the number of chosen tasks and corresponding videos in our dataset.
We compare the sizes of existing clip-caption paired datasets in Table 1. HowTo100M is several orders of magnitude larger than existing datasets and contains an unprecedented duration (15 years) of video data. However, unlike previous datasets, HowTo100M does not have clean annotated captions.
As the videos contain complex activities, they are relatively long with an average duration of 6.5 minutes. On average, a video produces 110 clip-caption pairs, with an average duration of 4 seconds per clip and 4 words (after excluding stop-words) per caption. Figure 9 shows the distribution of nouns and verbs in the captions.
Our data collection procedure assumes that searching with How to queries on YouTube would result in mostly instructional videos. We verify this by randomly selecting 100 videos and labeling their type. 71% of the videos are found to be instructional, 12% are vlogs, and another 7% are product reviews or advertisements. Note that vlogs, reviews and ads may also contain correspondences between visual content and narration. In particular, we noticed that objects shown on screen are often mentioned in narration. We do not discard such non-instructional videos, as they may still be useful for the learning the joint embedding. Additional statistics are provided in Appendix A.
4 Text-video joint embedding model
We now present our model to learn a joint text-video embedding from the automatically paired video clips and captions in our dataset. More formally, we are given a set of video clips and associated captions . We denote by and the and dimensional feature representation of a video clip and caption , respectively. Given this, our goal is to learn two mapping functions: and that respectively embed video and caption features into a common
-dimensional space, such that the cosine similarity
is high when caption describes the video clip , and low otherwise.
In this work, we use the class of non-linear embedding functions used in , which are given by:
where , , , are learnable parameters, is an element-wise sigmoid activation and is the element-wise multiplication (Hadamard product). In practice, , and resulting in a model composed of 67M parameters. Note that the first term on the right-hand side in Equations (2) and (3) is a linear fully-connected layer and the second term corresponds to a context gating function 
with an output ranging between 0 and 1, which role is to modulate the output of the linear layer. As a result, this embedding function can model non-linear multiplicative interactions between the dimensions of the input feature vector which has proven effective in other text-video embedding applications.
Loss. We train our embedding model using the max-margin ranking loss [17, 28, 48, 49, 58]. At each iteration of our training algorithm, we sample a mini-batch of caption-clip training pairs , and update the model parameters with a gradient step of the following loss:
where is the similarity score (1) between video clip and caption , is a set of negative pairs for caption-clip and is the margin. The first term in Equation (7) corresponds to the ranking loss when sampling a negative caption, while the second term corresponds to sampling a negative video clip. We fix in practice. Our model parameters are updated using Adam  with a learning rate of . Implementation details of the loss are provided in Appendix B.
Sampling strategy. Similar to , we apply an intra-video negative sampling strategy to define . We show in Section 5.3 that this approach is critical for good performance. More precisely, half of our negative pairs , are selected such that the video clip and the caption belong to the same original YouTube video (as ), while the other half are sampled from other YouTube videos. We apply intra-negative sampling to ensure that the learned embedding focuses on relevant aspects of the video clip ( the hands of the person showing how to knead dough) rather than irrelevant background features ( the kitchen). In appendix C, we also provide an empirical analysis of the positive pair sampling strategy. We show that even though the training data is noisy, our attempts to automatically select positive pairs during training did not yield improvements so far.
In this section, we demonstrate that a strong joint representation for video and text can be learned from our unlabeled HowTo100M dataset. We provide experimental results for a variety of domains ranging from instructional videos in CrossTask , cooking videos in YouCook2 , generic YouTube videos in MSR-VTT  to movie video clips in LSMDC . Specifically, we evaluate our learned embedding on the tasks of localizing steps in instructional videos of CrossTask  and text-based video retrieval on YouCook2 , MSR-VTT  and LSMDC  datasets.
Our key findings are the following: (i) For instructional video datasets, such as CrossTask  and YouCook2 , our off-the-shelf embedding trained on HowTo100M significantly outperforms state-of-the-art models trained on much smaller and manually-annotated datasets. (ii) On generic YouTube videos (MSR-VTT ), our HowTo100M embedding provides competitive retrieval performance compared to state-of-the-art methods trained on MSR-VTT. Moreover, we show that fine-tuning our pre-trained embedding model on just a fifth of annotated videos from MSR-VTT outperforms state-of-the-art. (iii) We show that the fine-tuning our embedding on the LSMDC dataset enables generalization to movie videos and scripts despite the large domain gap. (iv) Finally, we demonstrate the importance of scale in HowTo100M to learn better joint video-text embeddings.
|Negative sampling||M (R@10)||L (R@10)||Y (R@10)||C (AVG Recall)|
5.1 Implementation details
We extract frame-level and video-level features with pre-trained 2D and 3D CNNs. 2D features are extracted with the ImageNet pre-trained Resnet-152 at the rate of one frame per second. 3D features are extracted with the Kinetics  pre-trained ResNeXt-101 16-frames model  to obtain 1.5 features per second. We aggregate features from longer video clips by the temporal max-pooling and concatenate 2D and 3D features to form a single 4096 dimensional vector for each video clip.
Text pre-processing. We preprocess transcribed video narrations by discarding common English stop-words. For the word representations, we use the GoogleNews pre-trained word2vec embedding model .
Training time. Once the video and text features are extracted, training our embedding model on the full HowTo100M dataset is relatively fast and takes less than three days on a single Tesla P100 GPU.
5.2 Datasets and evaluation setups
Action step localization.
We evaluate localization of action steps in instructional videos on the recent CrossTask dataset . CrossTask includes 18 tasks and 2.7k instructional videos with manually annotated action segments. Each video may contain multiple segments, corresponding to different actions. It also provides an ordered list of action steps with short natural language descriptions for each task. We apply our model trained only on HowTo100M to the problem of step localization by computing similarity between every frame in the video and the action label names of CrossTask. In order to compare to , we follow a similar inference procedure. We use the same recall metric as in , which is defined by the number of step assignments that fall into the correct ground truth interval, divided by the total number of steps in the video. Videos from the test set of CrossTask are removed from the HowTo100M training set to ensure that they are not observed at training time.
Text-based video retrieval. We also evaluate our learned embedding on the task of video clip retrieval using natural language queries. Given a textual description, the goal is to retrieve representative video clips from a large pool of videos. We evaluate our learned embedding using the standard recall metrics R@1, R@5, R@10 and the median rank (Median R). We provide experimental results for the following domain-specific video description datasets.
YouCook2  is a cooking video dataset collected from YouTube. It features 89 different recipes and 14k video clips all annotated with textual descriptions collected from paid human workers. Since no descriptions are provided for the test set clips, we evaluate YouCook2 clip retrieval task on the validation clips (3.5k in total). Note that we have taken care to remove the few validation YouCook2 videos that are also present in HowTo100M.
MSR-VTT  is a dataset of generic videos collected from 257 popular video queries depicting 20 categories (including music, sports, movie, ) from YouTube. It contains 200k unique video clip-caption pairs, all annotated by paid human workers. We evaluate our model on the MSR-VTT clip retrieval test set used in  as performance of several other methods is reported on it.
LSMDC  is a dataset of movie clips. It features 101k unique video clip-caption pairs. All clips are associated with a description that either comes from the movie script or the audio description. We evaluate our model on the official LSMDC test set333https://sites.google.com/site/describingmovies/lsmdc-2016/movieretrieval that contains 1000 video-caption pairs.
5.3 Study of negative pair sampling strategy
We first study the effect of alternative strategies for sampling negative caption-video clip pairs when training our embedding. Table 3 shows that using negatives from the same video (intra-negatives) is beneficial as compared to randomly sampling them from other YouTube videos. The improvement is particularly significant on YouCook2 and CrossTask which are more fine-grained datasets than MSR-VTT and LSMDC. For the rest of the paper, we report numbers using our model trained with the intra-negative sampling strategy.
5.4 Scale matters
A natural question is whether the large scale of our dataset is truly required to achieve high performance. To answer this, we train our embedding model on smaller subsets of our dataset. These smaller subsets of HowTo100M are created by gradually decreasing the allowed Youtube search rank (see the paragraph on data collection in Section 3.1 for more details) for training videos. We experiment with the following rank thresholds: top 2 (15k videos), top 3 (28k videos), top 5 (52k videos), top 10 (104k videos), top 20 (197k videos), top 40 (364k videos), top 80 (648k videos) and top 200 (entire HowTo100M dataset). This process ensures that we subsample training videos that are more likely to be relevant to the queried task as we reduce the size of the training dataset. Figure 3 shows average recall on CrossTask and the R@10 clip retrieval results on LSMDC, MSR-VTT and YouCook2 when varying the size of the training dataset. There is a clear and consistent improvement over all evaluated tasks with the gradual increase in the amount of training data. Interestingly, we do not observe any saturation, hence we can expect further improvements by collecting even more readily-available and unlabeled video data.
|Fully-supervised upper-bound ||19.1||25.3||38.0||37.5||25.7||28.2||54.3||25.8||18.3||31.2||47.7||12.0||39.5||23.4||30.9||41.1||53.4||17.3||31.6|
|Ours trained on HowTo100M only||33.5||27.1||36.6||37.9||24.1||35.6||32.7||35.1||30.7||28.5||43.2||19.8||34.7||33.6||40.4||41.6||41.9||27.4||33.6|
5.5 Comparison with state-of-the-art
CrossTask. We compare our off-the-shelf embedding trained on HowTo100M against methods proposed by Alayrac  and Zhukov  which is the current state-of-the-art on CrossTask for weakly supervised methods. Note that Zhukov  have access to the ordered list of action labels at the task level and narrations are the only form of supervision during training. We also report the fully-supervised upper-bound from  obtained with a model that has been trained on action segments with ground truth annotation. The results are shown in Table 4. Our approach significantly outperforms the state-of-the-art, even though it has not been specifically designed for the task of step localization in videos. The improvement made by our method is consistent across all tasks (with the exception of Make Meringue), showing that the trained model is not biased towards any specific domain. The recall is above 30% for most tasks with the significant improvement observed for the “Add Oil to a Car” task (6.4% to 30.7% boost in recall). Note that our method also outperforms the fully-supervised upper bound  on average. Thus, we conclude that training on a large amount of narrated videos is better than training a step localization model on a small but carefully annotated training set.
|HGLMM FV CCA ||YouCook2||4.6||14.3||21.6||75|
|Ours||PT: HowTo100M FT: YouCook2||8.2||24.5||35.3||24|
YouCook2  does not provide an official benchmark nor any reported number for clip retrieval. As a consequence, we have applied a state-of-the-art text-video embedding model from Klein  (HGLMM FV CCA) on YouCook2 using our features. We also report results of our model trained on YouCook2 instead of HowTo100M in Table 5. First, we notice that our off-the-shelf model trained on HowTo100M significantly outperforms both the exact same model directly trained on YouCook2 and . Furthermore, fine-tuning our model pre-trained on HowTo100M on YouCook2 results in a significant improvement of 13.7 % in R@10 against . In conclusion, we show that the off-the-shelf HowTo100M trained model can outperform state-of-the-art on this domain specific instructional video dataset. Moreover, we demonstrate that our model can reap further benefits from fine-tuning.
|Ours||PT: HowTo100M FT: MSR-VTT||14.9||40.2||52.8||9|
MSR-VTT. We compare our model trained on (i) HowTo100M only, (ii) MSR-VTT only and (iii) pre-trained on HowTo100M and then fine-tuned on MSR-VTT against prior work that directly uses MSR-VTT for training (most of them have been reproduced in ) in Table 6. Our off-the-shelf HowTo100M model outperforms [18, 20, 47, 58, 59] that are directly trained on MSR-VTT. Here again, after fine-tuning the HowTo100M pre-trained model on MSR-VTT, we observe a significant improvement over the state-of-the-art JSFusion  trained on MSR-VTT. However, as opposed to instructional videos (CrossTask) and cooking videos (YouCook2), training our model directly on MSR-VTT performs better than our off-the-shelf model trained on HowTo100M. We believe this is due to MSR-VTT videos being generic Youtube videos that are different from the instructional or VLOG type of videos that dominate HowTo100M. In Figure 4, we also investigate the impact on performance at various amounts of supervision when fine-tuning our pre-trained model. It shows that state-of-the-art performance  can be attained with only of MSR-VTT samples. This has great practical implications as comparable performance can be obtained using significantly reduced annotation (and its implied cost).
|Ours||PT: HowTo100M FT: LSMDC||7.1||19.6||27.9||40|
LSMDC. Finally, we compare to state-of-the-art on LSMDC in Table 7. This dataset is even more challenging as movie clips are quite distinct from HowTo100M videos. We compare against several other prior works that have been reproduced in  and are trained directly on LSMDC. Here again, we see that pre-training our model on HowTo100M and fine-tuning it on LSMDC also provides improvements upon a model directly trained on LSMDC. This finding is interesting and shows that a HowTo100M pre-trained model can still be useful when fine-tuned on videos from a different domain.
5.6 Cross-dataset fine-tuning evaluation
In this section, we evaluate the advantage of HowTo100M for pre-training compared to pre-training on other smaller datasets. Figure 5 shows evaluation on YouCook2, MSR-VTT and LSMDC clip retrieval (R@10) using no pre-training (No PT), using pre-training on YouCook2, MSR-VTT, LSMDC and HowTo100M datasets while fine-tuning to the target dataset. For all evaluated datasets, pre-training on HowTo100M prior to fine-tuning on the target dataset consistently yields best results.
5.7 Qualitative results
Figure 6 illustrates examples of retrieved video clips from HowTo100M using our trained joint text-video embedding. Our learned representation is successful at retrieving videos for fine-grained queries such as Orchids or finding differences between Cut paper and Cut wood. A demo of the retrieval system is available at the project webpage: https://www.di.ens.fr/willow/research/howto100m/.
We introduced a novel video description dataset: HowTo100M, which contains more than 130M video clips, extracted from 1.2M narrated web videos of people performing complex visual tasks. Our data collection method is fast, scalable and does not require any manual annotation. We use this dataset to learn a joint text-video embedding by leveraging more than 130M video clip-caption pairs. We have shown through various experiments that our learned embedding can perform better compared to models trained on existing carefully annotated but smaller video description datasets. We plan to release our dataset, pre-trained models and code to stimulate further research in joint video and text understanding.
The project was partially supported by Antoine Miech Google PhD fellowship, the MSR-Inria joint lab, the Louis Vuitton - ENS Chair on Artificial Intelligence, the ERC grant LEAP (No. 336845), the CIFAR Learning in Machines&Brains program, and the European Regional Development Fund under the project IMPACT (reg. no. CZ.02.1.01/0.0/0.0/15_003/0000468).
-  (2016) Unsupervised learning from narrated instruction videos. In CVPR, Cited by: §1, §2, §2, §2, §5.5, Table 4.
-  (2017) Joint discovery of object states and manipulation actions. In ICCV, Cited by: §2.
-  (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, Cited by: §5.1.
-  (2017) Discover and learn new objects from documentaries. In CVPR, Cited by: §2.
-  (2018) Scaling egocentric vision: the epic-kitchens dataset. In ECCV, Cited by: §1, Table 1.
-  (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP, pp. 457–468. Cited by: §2.
-  (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV. Cited by: §2.
-  (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In CVPR, Cited by: §5.1.
-  (2018) Jointly discovering visual objects and spoken words from raw sensory input. In ECCV, Cited by: §2.
-  (2016) Deep Residual Learning for Image Recognition. In CVPR, Cited by: §5.1.
Localizing moments in video with natural language. ICCV. Cited by: Appendix B, §1, §1, Table 1, §2, §4.
-  (2016) Connectionist temporal modeling for weakly supervised action labeling. In ECCV, Cited by: §2.
-  (2017) Unsupervised visual-linguistic reference resolution in instructional videos. In CVPR, Cited by: §2.
-  (2018) Finding ”it”: weakly-supervised reference-aware visual grounding in instructional video. In CVPR, Cited by: §2.
-  (2018) BERT: pre-training of deep bidirectional transformers for language understanding. preprint. Cited by: §2.
DenseCap: fully convolutional localization networks for dense captioning. In CVPR, Cited by: §2.
-  (2014) Deep fragment embeddings for bidirectional image sentence mapping. In NIPS, Cited by: §4.
-  (2017) Temporal tessellation: a unified approach for video analysis. In ICCV, Cited by: §5.5, Table 6, Table 7.
-  (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.
-  (2014) Unifying visual-semantic embeddings with multimodal neural language models. TACL. Cited by: §5.5, Table 6, Table 7.
-  (2015) Associating neural word embeddings with deep image representations using fisher vectors. In CVPR, Cited by: §1, §2, §5.5, Table 5.
-  (2017) Dense-captioning events in videos. In ICCV, Cited by: Table 1, §2.
-  (2016) TGIF: A New Dataset and Benchmark on Animated GIF Description. In CVPR, Cited by: Table 1.
-  (2018) Exploring the limits of weakly supervised pretraining. In ECCV, Cited by: §2.
Ask your neurons: a neural-based approach to answering questions about images. In ICCV, Cited by: §2.
-  (2015) What’s cookin’? interpreting cooking videos using text, speech and vision. NAACL. Cited by: §1, §2, §2, §2.
-  (2017) Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905. Cited by: §4.
-  (2018) Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. arXiv:1804.02516. Cited by: §1, §1, §2, §2, §4, §4.
Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §5.1.
-  (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In CVPR, pp. 1029–1038. Cited by: §1, §2.
-  (2016) Jointly modeling embedding and translation to bridge video and language. In CVPR, Cited by: §1, §2.
-  (2017) Enhancing video summarization via vision-language embedding. In CVPR, Cited by: §1, §2, §2.
-  (2018) Improving Language Understandingby Generative Pre-Training. preprint. Cited by: §2.
-  (2019) Language models are unsupervised multitask learners. preprint. Cited by: §2.
-  (2017) Weakly supervised action learning with rnn based fine-to-coarse modeling. In CVPR, Cited by: §2.
-  (2018) Action sets: weakly supervised action segmentation without ordering constraints. In CVPR, Cited by: §2.
-  (2015) A dataset for movie description. In CVPR, Cited by: Table 1.
-  (2017) Movie description. IJCV. Cited by: Table 1, §5.2, §5.
-  (2018) How2: a large-scale dataset for multimodal language understanding. In Proceedings of the Workshop on Visually Grounded Interaction and Language (ViGIL), Cited by: Table 1, §2, §2.
-  (2018) Unsupervised learning and segmentation of complex activities from video. In CVPR, Cited by: §2.
-  (2015-12) Unsupervised semantic parsing of video collections. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
-  (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, Cited by: Table 1.
Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, Cited by: §2.
-  (2019) COIN: a large-scale dataset for comprehensive instructional video analysis. In CVPR, Cited by: §2.
-  (2016) MovieQA: understanding stories in movies through question-answering. In CVPR, Cited by: §1, §2.
-  (2015) Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:1503.01070. Cited by: Table 1.
-  (2016) Learning language-visual embedding for movie understanding with natural-language. arXiv preprint arXiv:1609.08124. Cited by: §5.5, Table 6, Table 7.
Learning two-branch neural networks for image-text matching tasks. PAMI. Cited by: §1, §2, §4.
-  (2016) Learning deep structure-preserving image-text embeddings. In CVPR, pp. 5005–5013. Cited by: §1, §2, §4.
-  (2018) Learning to compose topic-aware mixture of experts for zero-shot video captioning. In AAAI, Cited by: §2.
-  (2017) Sampling matters in deep embedding learning. ICCV. Cited by: §2.
-  (2016) MSR-vtt: a large video description dataset for bridging video and language. In CVPR, Cited by: §1, Table 1, §5.2, §5, §5.
-  (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework.. In AAAI, Vol. 5, pp. 6. Cited by: §1, §2.
-  (2016) Image captioning with semantic attention. In CVPR, pp. 4651–4659. Cited by: §2.
Video paragraph captioning using hierarchical recurrent neural networks. In CVPR, pp. 4584–4593. Cited by: §1, §2.
-  (2014) Instructional videos for unsupervised harvesting and learning of action examples. In ACM, Cited by: §2, §2.
-  (2018) A joint sequence fusion model for video question answering and retrieval. In ECCV, Cited by: §1, §2, §5.2, §5.5, §5.5, Table 6, Table 7.
-  (2016) Video captioning and retrieval models with semantic attention. In ECCV LSMDC2016 Workshop, Cited by: §4, §5.5, Table 6, Table 7.
-  (2017) End-to-end concept word detection for video captioning, retrieval, and question answering. In CVPR, Cited by: §5.5, Table 6, Table 7.
-  (2018) Towards automatic learning of procedures from web instructional videos. In AAAI, Cited by: §2.
-  (2018) Towards automatic learning of procedures from web instructional videos. In AAAI, Cited by: §1, Table 1, §2, §5.2, §5.5, §5, §5.
Cross-task weakly supervised learning from instructional videos. In CVPR, Cited by: §1, §1, §2, §2, §5.2, §5.5, Table 4, §5, §5.
Overview of Appendix
Appendix A Additional details of the HowTo100M dataset
Our HowTo100M dataset is based on the hierarchy of WikiHow444https://www.wikihow.com/ tasks. The HowTo100M spans a total of 23,611 tasks. Here we visualize the first two levels of the WikiHow hierarchy – the twelve categories and their subcategories, the number of underlying tasks and corresponding videos are illustrated in Figure 8.
HowTo100M comes with transcribed narrations which often describe the content of the videos. Figure 9 shows frequencies of nouns and verbs in transcribed video narrations. We used the MaxEnt Treebank POS Tagger to obtain the nouns and verbs. Please see the figure captions for additional analysis.
Appendix B Ranking loss implementation details
In the main paper, we have defined our mini-batch ranking loss as:
We explain next how is constructed to improve computational efficiency.
At each training iteration, we first sample unique YouTube video ids. We then sample with replacement a number of clip-caption pairs from each of these videos. Therefore, we are left with a mini-batch containing clip-caption pairs, with and in practice. In order to not waste computation efforts, we use every sampled mini-batch pair as a negative anchor, .
Doing so, the proportion of negative examples coming from the same video (intra-video) is while the proportion of negatives from different videos (inter-video) is . A problem with this is that the ratio between intra and inter video negative examples depends on the number of unique videos sampled and the amount of clip-caption pairs collected per video (respectively and ). To address this, we follow 
by re-weighting the inter-video and intra-video contributions inside the triplet loss. For example, in order to sample intra-video triplets with probability(and inter-video triplets with probability ), one can equivalently weight the intra-video triplet losses by: (thus ensuring a ratio between intra-video and inter-video negative examples of ). This allows us to fix the intra-video to inter-video negative sampling ratio regardless of and . Formally, we define the following weighting function:
We then use this weighing function to define the loss:
Appendix C Sampling strategy for positive pairs
As discussed in the main paper, narrations need not necessarily describe what is seen in the video. As a consequence, some captions from HowTo100M do not correlate with their corresponding video clips (see Figure 7). To deal with this noisy data, we tried a sampling strategy for positive pairs that aims to discard non-relevant video-caption pairs during training. Inspired by multiple instance learning, our idea is to select a subset of top scoring clip-caption training pairs within each video.
In particular, given a video with video clip-caption pairs , we first compute the similarity scores of all the pairs: using the current model parameters. We then use a pre-defined max-pool rate of the highest scoring positive training pairs within each video. For example, at we retain the high scoring half of all pairs for training.
Table 8 shows results of our positive sampling strategy when varying the max pool rate with evaluation on video clip retrieval. For example, means that no sampling strategy is applied as we keep all pairs as potential candidates. Interestingly, in our case, carefully selecting the positive pairs does not improve our model as the best results are obtained with . Note that decreasing the max pool rate also decreases the number of triplet losses computed within a mini-batch by the same rate. To show that the number of triplet losses computed for each mini-batch does not impact the overall performance, we have performed a sanity check experiment in Table 9 in which we also replaced the max pool sampling by random sampling of pairs for . The results with random sampling at are very similar to the results obtained with no max pool sampling (r=1.0) as shown in Table 8, which confirms our finding that our model is relatively robust to the noisy positive pairs. We think this could be attributed to the fact our model is shallow and is trained on a large amount of data.
|Max pool rate (r)||M (R@10)||L (R@10)||Y (R@10)|
|1.0 (no max pool)||29.6||14.0||24.8|
|MP rate||RS rate||M (R@10)||L (R@10)||Y (R@10)|