Learning Video Representations from Textual Web Supervision

by   Jonathan C. Stroud, et al.
University of Michigan

Videos found on the Internet are paired with pieces of text, such as titles and descriptions. This text typically describes the most important content in the video, such as the objects in the scene and the actions being performed. Based on this observation, we propose to use such text as a method for learning video representations. To accomplish this, we propose a data collection process and use it to collect 70M video clips shared publicly on the Internet, and we then train a model to pair each video with its associated text. We fine-tune the model on several down-stream action recognition tasks, including Kinetics, HMDB-51, and UCF-101. We find that this approach is an effective method of pretraining video representations. Specifically, it leads to improvements over from-scratch training on all benchmarks, outperforms many methods for self-supervised and webly-supervised video representation learning, and achieves an improvement of 2.2


End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Annotating videos is cumbersome, expensive and not scalable. Yet, many s...

ActBERT: Learning Global-Local Video-Text Representations

In this paper, we introduce ActBERT for self-supervised learning of join...

Bi-Calibration Networks for Weakly-Supervised Video Representation Learning

The leverage of large volumes of web videos paired with the searched que...

Oops! Predicting Unintentional Action in Video

From just a short glance at a video, we can often tell whether a person'...

Boosting Video Representation Learning with Multi-Faceted Integration

Video content is multifaceted, consisting of objects, scenes, interactio...

Unsupervised Semantic Action Discovery from Video Collections

Human communication takes many forms, including speech, text and instruc...

Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos

Human actions often induce changes of object states such as "cutting an ...

1 Introduction

Video representations are typically learned in a fully-supervised fashion. For this approach to be successful, we require large amounts of labeled data, typically on the order of hundreds of thousands of labels. Acquiring these labels can cost tens of thousands of hours of human time to annotate [23, 10], and furthermore, when datasets become large, the benefit of gathering more labels appears to diminish [26]. At a certain point, it becomes too costly to simply label more data to improve performance. In this regime, we look to alternative sources of supervision to learn video representations without costly manual labels.

In our work, we draw this supervision from textual metadata available publicly on the Internet. Specifically, we use web videos from popular sites, where videos are associated with freeform text in the form of titles, descriptions, tags, and channel/creator names. These four pieces of textual metadata provide rich information about each video’s content. Frequently, they describe the exact types of information which labelers are asked to annotate in labeled datasets, such as objects, scenes, and human actions. For example, consider the title, “Learning how to swim!” or the channel name “PotteryMaker”. Both of these indicate what actions will take place in their respective videos, and we can leverage this information to learn representations, in much of the same way we use labels in supervised learning.

The primary idea behind our approach is to use these pieces of text directly. This stands in contrast to recent work [20] in which the metadata is used indirectly, by using it to infer a class label for each example. Class labels seem like a natural choice for webly-supervised learning, as these are the most common form of supervision in strongly-supervised learning. However, class labels come from a closed vocabulary, while text is open-ended and therefore is necessarily more descriptive. Consider the title “Outdoor free-climbing in Yosemite”. If we reduce this title to the class label “rock climbing”, we are ignoring important information about the scene and the specific type of action, potentially missing out on valuable supervisory signal. In our experiments, we demonstrate that using text, and using multiple sources of text, translates into improved downstream performance. We compare our method with other webly-supervised approaches, showing that our method produces video representations which improve downstream performance by 2.2% on HMDB-51 [35] (Section 5).

Another advantage of this approach is that the amount of available data is immense; e.g. over 500 hours of content is uploaded every minute to YouTube alone [24], and each video is labeled with text. To leverage this data, we propose a data collection process called Web Videos and Text (WVT). In our approach, we use a text-based video search engine to query for common words and collect a large, uncurated video dataset and their matched pieces of text. Using this process, we collect 70 million videos, and the resulting dataset, WVT-70M, is 100 times larger than Kinetics [31]. This dataset is, to the best of our knowledge, the largest existing video dataset for webly-supervised learning (Section 3).

Our goal with this data is to learn video representations—feature vectors which encode a video clip—which are then useful for downstream tasks. To learn these representations, we propose a training scheme in which the video representation are used to pair each video with its associated metadata. We use powerful 3D Convolutional Neural Network (3D CNN) architectures to produce these representations, and train the video representations end-to-end on WVT-70M (Section 

4). We evaluate the representations’ effectiveness by fine-tuning them on a suite of downstream tasks. We find that pre-training with this approach significantly improves downstream performance, and that webly-supervised pre-training is complementary to strongly-supervised pre-training (Section 5).

Our key finding is that textual metadata is a rich source of supervision which can be acquired freely from public sources. Specifically, in this work, we make the following contributions:

  • We propose a data collection process (WVT), which uses text-based search to gather a large, uncurated dataset of web video clips and their associated metadata, including titles, descriptions, tags, and channel names.

  • We propose a method for learning video representations by learning to match these representations with their associated metadata.

  • We demonstrate that our approach outperforms other webly-supervised and self-supervised approaches, achieving an improvement of 2.2% on HMDB-51.

2 Related Work

Webly-Supervised Learning. Many prior works have leveraged webly-labeled data for visual representation learning, both for images as well as videos. In general, these approaches use metadata found on the Internet to infer weak labels for a set of images or videos, and they differ in how these weak labels are created. The most commonly-used approach is to use image search results, and label each image with the query that was used to find it [64, 16, 54, 6, 12, 14, 22, 11, 19, 34]. Another approach is to use captions, and label each image with key words present in the caption [49, 29, 38, 47]. Other approaches use user-defined keywords or tags [21, 27, 42, 20] or algorithmically-generated topics [1, 30] to the same end. These approaches have consistently demonstrated that webly-supervised learning is scalable and that it improves performance on downstream tasks, suggesting that webly-acquired class labels provide a valuable source of supervision.

A key observation in our work is that one does not need to infer class labels in order to learn from webly-acquired metadata. In our approach, we instead use the textual metadata directly, allowing for richer information to be used as supervision. This approach is similar to that of concurrent work [39] which uses titles as a form of textual supervision. Our work differs in that we also use other forms of metadata, such as descriptions. In addition, this prior work uses curated data from Kinetics-400, while we introduce an uncurated dataset as our source of videos. These videos provide a more realistic reflection of the webly-supervised videos available in the wild.

Unsupervised and Self-Supervised Learning. Our work is also related to methods of unsupervised and self-supervised learning, which do not use metadata from the Internet, and instead only use the video and its associated audio. Video (without audio) is already a valuable source of self-supervision, and varied approaches have successfully leveraged supervision from clip and frame ordering [46, 17, 37, 65, 67, 32], geometry [18, 28], motion [52, 36]

, colorization 

[61], cycle consistency [15, 63], and video prediction [43, 41, 60, 59, 62]. Generally, these approaches are outperformed by those leveraging supervision mined from external metadata, or from the audio channel.

Audio is a convenient and strong source of supervision: convenient because videos are almost always paired with an audio channel, and strong because the audio is tightly correlated with what is happening in the video. Prior works have leveraged ambient sound [51, 5, 4, 69, 33, 50, 53], dialogue [48], and narration [68, 2, 70, 71, 45, 44, 3], all of which of which serve as useful signals. Those approaches using narration typically do so with instructional videos, such as in the recent HowTo100M dataset [45], since instructional videos typically contain narration which describes the actions being performed. These approaches, like ours, reap the benefits offered by rich, descriptive supervision. However, they rely on a specific genre of video content (instructional videos), which poses a potential limitation. Our work, by contrast, can work with any genre of videos.

3 Data Collection

Figure 1: Examples of video frames and metadata collected using WVT. Metadata typically contains references to actions (mowing, bbqing) as well as objects (grill, lipstick, bacon) which are present in the scene. We collect four types of metadata for each video: titles, descriptions, tags, and channel names. Metadata is truncated where necessary for ease of visualization. All videos used under CC BY 2.0 license.

To benchmark our approach, we propose a data collection process in which we search for common action categories using a text-based web-video search engine. We begin by manually selecting the set of action categories; in our experiments we use the 700 action categories in Kinetics-700 [8]. We choose these categories because they cover a broad range of human actions, and also because this allows for fair comparison with fully-supervised approaches which pre-train on Kinetics (since the specific class categories used are known to have an effect on downstream performance [20]). We use the class names as search terms and collect the resulting videos from the web. We then apply two selection criteria to filter videos. First, we discard videos which are less than 10 seconds long, since we use 10-second clips during training (a choice also made to match Kinetics). Second, we discard videos which were uploaded in the past 90 days, because older videos are less likely to be deleted in the future, allowing for improved reproducibility of our experiments. In total, we collect 100K videos from each of the 700 queries, resulting in a dataset of 70M videos. From each video, we randomly select a 10-second clip to download and later use for training.

Each video is paired with four pieces of textual metadata: its title, description, tags, and channel name. These were chosen for two reasons. First, these pieces of text are all manually written by the user, as opposed to being automatically generated. This is desirable because the inner workings of automatically generated metadata (such as YouTube’s “topics” [1]) are unknown, and could potentially be generated via content-based models trained on our target datasets, allowing these labels to leak into the training set. By relying only on manually-annotated metadata, we avoid this potential issue. Second, from manual inspection, we see that these pieces of text consistently contain informative references to content in the video. These references are written deliberately by the user, who generally will choose a title, description, and tags which help other users find their video. The user will also select a channel name (an identifier used to represent the user) which is informative, typically one which is indicative of the types of videos that the channel contains. The channel name provides context which the other signals may not, for example, a channel for guitar lessons, “Jeff’s Guitar Lessons”, may not explicitly say “guitar lesson” in each video title, but the channel name makes this obvious. For some examples of videos and their metadata, see Figure 1.

Like many approaches towards webly-supervised learning, we rely on a search engine to collect data. This again raises the question of “leakage” from the test set into the training set: if content-based models (possibly trained on our target datasets) are used to generate the search results, does this introduce the possibility of labels (in the form of search terms) leaking into our training set? In our case, no, since we do not use the search terms as labels, labels cannot leak into the training set through the search engine. This still allows for the possibility that the videos are indirectly “curated”, that is, the resulting videos may be more neatly divided into class categories than what could be achieved without content-based search. However, it is still standard practice to use search results for “uncurated” data collection [45], because search provides a practical method for acquiring large amounts of data from the Internet.

We note that WVT is a data collection process, and the datasets used in this work (denoted WVT-X) are not intended to serve as static datasets. This is made possible by the fact that our data collection process is entirely automatic, and does not rely on any manual annotation. Therefore, it is possible for WVT to be repeated flexibly at any scale and with any desired action vocabulary, depending on the needs of the downstream tasks. This provides an additional advantage over web-scale labeled datasets (such as Kinetics [31]), in which videos can be deleted by their owners at any time. When a labeled video is deleted, it leads to a decay in the number of labeled videos, but in our case, no videos are labeled, so we can simply repeat WVT to account for the lost videos.

In Table 1, we compare WVT-70M to other webly-supervised datasets for video representation learning. In terms of the number of videos, WVT-70M is on par with the largest datasets in prior work, with 5M more unique source videos than [20]. We acknowledge that, conceptually, any of these prior datasets could be scaled to much larger sizes simply by collecting more data, making dataset size a dubious method of comparison. However, it is still important to study how these methods behave when scaled to extreme dataset sizes, and therefore our experiments on 70M videos are a valuable contribution in this space. These experiments are particularly important because there are non-trivial issues associated with scaling webly-supervised learning to extreme dataset sizes. The key issue is that we use search results to collect data, and the quality of these results declines as we move deeper into the search rankings to collect more videos.

Dataset #Videos Duration (hrs) Supervision
Sports-1M [30] 1.1M 15K Topics
Youtube-8M [1] 8M 500K Topics
HowTo100M [45] 1.2M 136K Speech
IG-Kinetics [20] 65M 72K Hashtags
WVT-70M (ours) 70M 194K Text Metadata
Table 1: Datasets for webly-supervised video representation learning. WVT-70M contains 70 million clips, each from a unique source video, and each video is paired with textual metadata.
Figure 2: Scaling properties of WVT. Left: Rate of missing descriptions and tags, and number of tags. Both descriptions and tags are empty for a large number of videos, at all dataset sizes. Right: Mean length (in words) of each metadata type. Descriptions and tags tend to get shorter with larger dataset sizes, but titles and channel names tend to get longer.

To analyze the scaling properties of WVT, we collect increasingly-large subsets of the dataset and measure indicators of their quality, shown in Figure 2. The dataset size is scaled up as one would do in practice, by selecting more and more of the top search results from each query, rather than by performing a random sample from the full WVT-70M dataset. The indicators measure, for each piece of metadata, the mean length (in words), the rate of missing-ness (for descriptions and tags, which can be omitted by the user), and the mean number of tags. We find that search results are imbalanced in terms of how these indicators are distributed. Specifically, descriptions and tags tend to get shorter with larger dataset sizes, but titles and channel names in fact get longer. We also find that the percentage of videos which have any tags or a description stays relatively constant, but the average number of unique tags drops. These analyses indicate that the quality of descriptions and tags tend to decrease, that is, they get shorter and therefore less descriptive, for larger dataset sizes. Notably, we do not see the same for titles or channel names, indicating that these may be a more reliable source of supervision at the largest dataset sizes. This is reflected in our experiments in Section 5.2, where we find that using all sources of metadata is helpful for smaller dataset sizes, but that these additional sources of metadata reduce performance when scaled to the largest dataset sizes.

Implementation Details. Since Kinetics videos are also collected from the Internet, we discard videos from WVT which appear in the Kinetics validation or test sets. Since many videos do not contain a description or tags, we code the missing information as an empty string, rather than discarding these videos. We perform all searches in English, so WVT contains primarily (though not exclusively) English-language videos and metadata. However, our approach is extensible to any language.

4 Model

Figure 3: Model architecture for webly-supervised learning from textual metadata. We encode the video using S3D-G [66], and the metadata using BERT [13]. We then train the video representation by matching it with the correct metadata representation.

At a high level, our approach (Figure 3) learns video representations by creating representations of the video’s metadata, and encouraging the video representations to match these metadata representations. The video representation is a vector , and the metadata representation is a vector , where the vector dimensions and are dependent on the models used to extract each representation and do not need to be the same.

Intuitively, the video and its metadata contain similar information, and therefore their representations and should contain similar information. However, the information contained in the video and its metadata are not exactly the same. The video will always contain information which is not present in the metadata. For example, the description of a rock climbing video will not list every hold the climber uses on their route. Likewise, the text will provide context which is not present in the video, such as listing the time and location where the video was shot. With our approach, we leverage this observation by encouraging the video representations to be similar, but not the same as, the corresponding metadata representation.

Specifically, the video representations are trained by predicting

the metadata representations. We predict the metadata representations from the video representations by applying a simple linear transformation, that is

, where and . We then apply a ranking loss which penalizes if is similar to the metadata representation for another video . That is, , where is a distance metric, and is the minimum allowable margin between and . In our experiments, we set as the cosine distance, , and choose the margin to be with the validation set.

Negative Examples. For the loss, we require a “negative” metadata representation , that is, one drawn from a different video than . We draw the negative example from another video in the dataset uniformly at random. In addition, we use multiple negative examples for each positive example, and take the mean of their respective losses to get the loss,


In practice, we use , giving a ratio of 1 positive example for every 15 negative examples. We do not perform any hard-negative mining; we find that uniformly sampled negatives are sufficient. These negative examples are taken from the same batch of SGD training for convenience of implementation.

Multiple Sources of Metadata. When using more than one source of metadata for pre-training, we compute separate metadata representations for each source. Then, for each source, we apply a different set of linear transformation parameters to the video representation , to compute a source-specific . We then separately compute a loss for each source as in Equation 1. The final loss is the sum of these losses.

End-to-End Training. We train the video representation end-to-end with the linear transformation parameters and . Since our goal is to learn video representations, not text representations, we do not train the metadata representations end-to-end. Instead, we use a pre-trained state-of-the-art text feature extractor to generate these embeddings (Section 4.2).

We train the model using stochastic gradient descent, with Nesterov momentum of 0.9 

[57] and a weight decay of 1e-5. We apply dropout with a rate of 0.5 to the video features. We use a batch size of 2048 split into chunks of 16 videos across each of 128 accelerators, trained synchronously. The learning rate schedule begins with 1500 warmup steps (exponentially increasing from .001 to 1.0), followed by a cosine-decaying [40]

schedule for the remaining steps. We train on 70M videos for only 140K steps in total, which translates into just over 4 full epochs. Due to the accelerators and large batch size, this model takes less than 4 days to train.

4.1 Video Representation

We create the video representation using a 3D Convolutional Neural Network (3D CNN) which operates directly on the RGB video frames. The input to the 3D CNN is therefore a tensor which represents the video clip. To get the video representation, we take the final hidden layer of the network and (when necessary) mean-pool across the spatial and temporal dimensions, resulting in a vector of length .

In our work, we use S3D-G [66] as the backbone 3D CNN architecture. We choose this architecture because it outperforms the commonly-used I3D architecture [9] at lower computational cost. We do not train on larger-capacity models such as R(2+1)D-152 (118M params, 10x that of S3D-G) due to the significant computational cost of training such a model on 70M videos. In addition, our goal in this work is to demonstrate the utility of textual metadata, rather than of any particular backbone 3D CNN. Prior work has shown that pre-training a higher-capacity model on a large dataset leads to a similar change in accuracy as pre-training a lower-capacity model [20], suggesting that our results could be also be applied to larger-capacity models. However, this comparison is beyond the scope of this work.

During training (both pre-training and fine-tuning), we apply the 3D CNN on 64-frame clips drawn uniformly at random from the video at 25fps. We resize the frames to 256px on the shortest edge, and then take a random

crop. We additionally perform random brightness, contrast, and flipping augmentation. During inference, we use 250-frame clips (using circular padding where necessary), and take a center


4.2 Metadata Representation

For each piece of textual metadata, we create a metadata representation using BERT [13], a state of the art text encoder. BERT returns a 768-dimensional embedding for each token in the text, and we take the mean of these token-level embeddings to get a single 768-dimensional representation of the metadata, that is, .

Specifically, we use the multilingual, cased version of BERT which was pre-trained on 104 languages, and has 12 layers and 110M parameters. We use the multilingual version because non-English text appears in WVT-70M. Since our goal is to learn video representations, we do not fine-tune the BERT model. This also significantly alleviates the computational cost of training; otherwise fine-tuning the text model would dominate the computational cost.

When computing features for tags (where each video can have zero to many tags), we compute a BERT embedding for each individual tag and take the mean of the results. For videos with no tags, we replace it with an empty string. Each of the three other pieces of metadata (titles, descriptions, and channel names) are treated the same.

5 Experiments

For many of our experiments, we use a subset of the full 70M-video dataset. These subsets are denoted by the approximate number of videos they include: 500K, 1M, 6M, 12M, 40M, and 70M. These subsets are not selected at random, instead each subset is chosen by selecting a smaller number of the top search results from each query, such that the 500K subset contains approximately the top 700 results per query (recall that the 70M dataset contains 100K results per query). This reflects the way that such a method could be used in practice; one would search for queries relevant to their particular downstream task and collect as many of the top search results as they can, subject to space or bandwidth constraints.

We do not segment WVT-70M into a validation or test split, and instead evaluate our learned model purely by its performance on downstream tasks. We evaluate on four downstream video classification tasks:

HMDB-51. HMDB-51 [35] is an action recognition dataset consisting of short video clips associated with one of 51 classes. It contains 7000 videos, and is commonly used as a benchmark for video representation learning. We report results on the first test split, except where otherwise noted. When fine-tuning on HMDB-51, we use a learning rate of 1e-3 with a cosine decay schedule, a weight decay of 1e-4, and we train for 1000 iterations.

UCF-101. UCF-101 [55] is a similar action recognition dataset consisting of video clips associated with one of 101 classes. It is larger than HMDB-51, consisting of over 13,000 videos. We report results on the first test split, except where otherwise noted. When fine-tuning on UCF-101, we use a learning rate of 1e-3 with a cosine decay schedule, a weight decay of 1e-3, and we train for 1000 iterations.

Kinetics-400, 600, 700. Kinetics is a widely-used action recognition dataset consisting of 10-second clips drawn from videos annotated with action categories [31]. Kinetics-400, 600, and 700 are increasingly larger versions of the dataset, containing 400, 600, and 700 action categories, respectively [7, 8]. Kinetics contains over 545,000 videos, and due to its scale, it is commonly used to pre-train video representations. We compare against Kinetics as a pre-training scheme, in addition to using it as a downstream task.

Kinetics videos can be deleted by their uploaders at any time, and afterwards can no longer be recovered by researchers. Therefore, Kinetics gradually deteriorates over time, which generates discrepancies between both training and evaluation performed at different times. Our experiments were conducted using a snapshot of the Kinetics dataset collected in February 2020, when Kinetics-400 contained 225K of the original 247K training examples (-8.9%), Kinetics-600 contained 378K of the original 393K training examples (-3.8%), and Kinetics-700 contained 541K of the original 545K training examples (-0.7%).

5.1 Different Forms of Metadata

We collect four types of metadata for each video: the title, description, tags, and channel name (Section 3). We observe that each type of metadata contains a different level of detail and is affected by different sources of noise (Figure 1). Therefore, we expect the different types of metadata to have different impacts on downstream performance. We investigate which of these are the most useful for pre-training in Table 2. For these experiments, we pre-train the model on WVT-500K and fine-tune on HMDB-51.

Supervision HMDB-51
Scratch 27.9
Titles 43.2
Descriptions 37.7
Tags 36.2
Channel Name 29.1
Titles + Desc. 43.9
Titles + Desc. + Tags 46.5
All 50.0

Table 2: Sources of metadata used. Left: Their effect on downstream performance, as measured on HMDB-51. Each source of metadata contributes individually to the final accuracy. For these experiments, we pre-train on WVT-500K. All reported accuracies are on HMDB-51 split 1. Right: Additional examples of metadata, demonstrating complementary information. One source of metadata is not usually sufficient to fully understand the video content. All metadata used under CC BY 2.0 license.

We find that all types of metadata are useful sources of supervisory signal for pre-training. Titles are the most effective, achieving an increase in downstream accuracy of 15.3% over a from-scratch baseline. Channel names are the least effective, resulting in only a 1.2% improvement over the baseline. However, we find that these sources of supervision provide complementary signals, and that we achieve the best performance by including all of them during pre-training. This achieves a down-stream accuracy of 50.0% on HMDB-51, a 22.1% improvement over the from-scratch baseline.

In addition, these experiments can be used to show the relative utility of webly-supervised learning and fully-supervised learning. These experiments are conducted using WVT-500K, which is approximately the same size as Kinetics-700 (545K videos). For comparison, a fully-supervised model pre-trained on Kinetics-700 achieves 67.4% accuracy on HMDB-51, a 17.4% improvement over training on all four sources of metadata. As expected, web supervision suffers from noise and therefore is not as effective, video for video, as supervised pre-training. However, web supervision does not incur any labeling cost, making it an effective option for pre-training.

5.2 Scaling to 70M Videos

To demonstrate the scalability of our method, we apply it to increasingly large subsets of the full 70M-video dataset in Figure 4. We compare two metadata configurations for this experiment: (1) only titles, and (2) all metadata. We find that the titles-only approach scales significantly better than the all-metadata approach; although using all metadata leads to higher downstream accuracy with 500K pre-training videos, this is reversed when using more than 1M pre-training videos. This is likely due to the poor scaling properties of tags and descriptions as shown in Figure 2, and suggests that too much noise can become a burden on training.

Dataset Iters HMDB-51 Scratch N/A 27.9 500K 20K 43.2 1M 25K 50.5 6M 30K 58.9 12M 50K 63.2 40M 100K 65.2 70M 140K 67.4 K700 30K 67.4

Figure 4: Performance of our approach on HMDB-51 (split 1) for increasingly larger pre-training dataset sizes, compared to a baseline model trained from scratch and a model pre-trained on Kinetics-700. Left: Comparison of titles-only and all-metadata approaches. Titles-only scales better than all-metadata. Right: Number of pre-training iterations and resulting accuracy. K700 = Kinetics-700. Our approach with 70M videos matches that of fully-supervised pre-training.

For the titles-only approach, we find that using more pre-training data sharply improves performance. Using all 70M videos for pre-training achieves an HMDB-51 accuracy of 67.4%, a 13.2% improvement over using 500K videos. In addition, this accuracy is the same as that of an equivalent model trained on Kinetics-700, demonstrating that our approach can match the performance of fully-supervised pre-training, without any labeling cost.

We do not expand the model capacity or adjust the pre-training hyperparameters when scaling to 70M videos. The only difference is the number of pre-training iterations, which we list in Figure 

4. We found that increasing the number of iterations further lowered down-stream performance, for the smaller-scale datasets. Interestingly, we achieve good performance on the 70M dataset using only 4 epochs of training, while on the 500K dataset we require over 80 epochs of training. This suggests that increased model capacity and longer training could further improve performance for the 70M dataset.

Method Data Model HMDB-51 UCF-101
Baseline None S3D-G 27.9 58.5
Video-only Geometry [18] FC FlowNet 23.3 55.1
OPN [37] UCF VGG 23.8 59.6
CMC [58] UCF CaffeNet 26.7 59.1
ClipOrder [67] UCF R(2+1)D 30.9 72.4
O3N [17] UCF AlexNet 32.5 60.3
MASN [62] K400 C3D 33.4 61.2
DPC [25] K400 3D-R34 35.7 75.7
Shuffle&Learn [46]* K600 S3D 35.8 68.7
3DRotNet [28]* K600 S3D 40.0 75.3
CBT [56] K600 S3D 44.6 79.5
Video+Audio AVTS [33] K600 I3D 53.0 83.7
MIL-NCE [44] HT100M S3D 61.0 91.3
XDC [3] IG65M R(2+1)D 63.1 91.5
Webly-Sup. Sports-1M [30] S-1M AlexNet - 65.4
Gan et al. [19] YouTube VGG - 69.3
CPD [39] K400 3D-R34 57.7 88.7
Ours WVT-70M S3D-G 65.3 90.3
Table 3: Comparison with self-supervised and webly-supervised pre-training prior work on HMDB-51 and UCF-101. “Data” refers to the source of pre-training videos, however, these approaches do not use the available labels. All numbers are quoted directly from the original authors. Our results are averaged across all three splits of HMDB-51 and UCF-101. *Reimplemented by [56].

5.3 Comparison with Prior Work

In Table 3, we compare our approach, pre-trained on WVT-70M, against other methods for self-supervised and webly-supervised learning. We strongly outperform all existing methods for self-supervised learning which use video as the only source of supervision, suggesting that the textual metadata provides a supervisory signal that cannot be obtained from video alone. We find that our approach outperforms all prior methods for webly-spervised approaches on HMDB-51, and performs on-par with state-of-the-art methods on UCF-101 which use audio as a primary source of supervision. Notably, we outperform MIL-NCE [44], a recent method for learning video representations from instructional videos in the HowTo100M dataset [45], on HMDB-51 (+4.3%). We also outperform two prior approaches on UCF-101 (+24.9%, +21.0%) which learn video representations using web supervision from YouTube [30, 19].

In Table 4, we present results on Kinetics. We find that our pre-training improves performance by 1-3% over a from-scratch baseline, depending on the particular version of Kinetics. These improvements are much smaller than what we found on HMDB-51 and UCF-101, however, this is to be expected, as Kinetics is already a large-scale dataset, and therefore has less to gain from pre-training. We compare against two prior works on Kinetics, MIL-NCE [44] (which uses supervision from narration) and IG65M [20] (which uses hashtags from Instagram). We find that we outperform MIL-NCE on Kinetics-700, however, we underperform IG65M on Kinetics-400. This suggests that hashtags are a stronger source of supervision than textual metadata. This could be due to a number of factors, such as the relative amount of noise in the two types of signals.

Method Model K400 K600 K700
Baseline S3D-G 68.9 74.3 62.2
Baseline [44] I3D - - 57.0
Baseline [20] R(2+1)D-18 69.3 - -
MIL-NCE [44] I3D - - 61.1
IG65M [20] R(2+1)D-18 76.0 - -
Ours S3D-G 72.0 76.0 63.4
Pre-training HMDB-51
70M 67.4
K700 67.4
70M+K400 72.2
70M+K600 74.5
70M+K700 75.9
Table 4: Experiments on Kinetics. KX = Kinetics-X. Left: Comparison with prior work on webly-supervised learning on Kinetics-400, -600, and -700. We use numbers quoted directly from the authors. Right: Complementary nature of webly-supervised and fully-supervised learning. We pre-train the model on WVT-70M, then fine-tune it on Kinetics, then apply it to HMDB-51 (split 1).

5.4 Complementary Strong- and Web-Supervision

Webly-supervised learning has the capacity to meet the performance of strongly-supervised learning, without any labels (Section 5.2). However, in practice, one would use all sources of supervision available, including labeled datasets. Therefore, we ask whether webly-supervised and strongly-supervised learning can be applied in combination, to further improve the performance on down-stream tasks. We test this in Table 4 by training in a three-step process: first, we pre-train our model on WVT-70M. Then, we fine-tune this model on Kinetics. Finally, we apply the resulting model to HMDB-51.

We find that strongly-supervised learning and webly-supervised learning are indeed complementary. When using both WVT-70M and Kinetics-700 are in combination, the down-stream accuracy on HMDB-51 increases by a further 8.5%. This demonstrates that our method is effective even in situations where labeled data is already plentiful.

6 Conclusions

We demonstrate that textual metadata serves as a useful signal for pre-training video representations, without the need for any manually annotated labels. Specifically, we find that each textual signal is complementary (Section 5.1), and that this approach matches the performance of supervised pre-training when scaled to tens of millions of videos (Section 5.2). We also show that it outperforms competitive approaches for both self-supervised and webly-supervised learning (Section 5.3). Finally, we demonstrate that it is complementary to existing supervised pre-training methods (Section 5.4). These findings suggest that textual metadata can be used as an effective pre-training strategy for a wide variety of downstream tasks.

7 Acknowledgements

This work was completed while JCS was an intern at Google Research. We additionally thank Meghana Thotakuri, Arsha Nagrani, Bryan Seybold, and Austin Meyers for helpful discussions and assistance with experiments.


  • [1] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan (2016) Youtube-8m: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675. Cited by: §2, Table 1, §3.
  • [2] J. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, and S. Lacoste-Julien (2016) Unsupervised learning from narrated instruction videos. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 4575–4583. Cited by: §2.
  • [3] H. Alwassel, D. Mahajan, L. Torresani, B. Ghanem, and D. Tran (2019) Self-supervised learning by cross-modal audio-video clustering. arXiv preprint arXiv:1911.12667. Cited by: §2, Table 3.
  • [4] R. Arandjelovic and A. Zisserman (2017) Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 609–617. Cited by: §2.
  • [5] Y. Aytar, C. Vondrick, and A. Torralba (2016) Soundnet: learning sound representations from unlabeled video. In Advances in neural information processing systems, pp. 892–900. Cited by: §2.
  • [6] A. Bergamo and L. Torresani (2010) Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In Advances in neural information processing systems, pp. 181–189. Cited by: §2.
  • [7] J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman (2018) A short note about kinetics-600. arXiv preprint arXiv:1808.01340. Cited by: §5.
  • [8] J. Carreira, E. Noland, C. Hillier, and A. Zisserman (2019) A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987. Cited by: §3, §5.
  • [9] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §4.1.
  • [10] Carreira, João and Noland, Eric (2020) Note: personal correspondence Cited by: §1.
  • [11] X. Chen and A. Gupta (2015) Webly supervised learning of convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1431–1439. Cited by: §2.
  • [12] X. Chen, A. Shrivastava, and A. Gupta (2013) Neil: extracting visual knowledge from web data. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1409–1416. Cited by: §2.
  • [13] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Figure 3, §4.2.
  • [14] S. K. Divvala, A. Farhadi, and C. Guestrin (2014) Learning everything about anything: webly-supervised visual concept learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3270–3277. Cited by: §2.
  • [15] D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman (2019) Temporal cycle-consistency learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1801–1810. Cited by: §2.
  • [16] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman (2010) Learning object categories from internet image searches. Proceedings of the IEEE 98 (8), pp. 1453–1466. Cited by: §2.
  • [17] B. Fernando, H. Bilen, E. Gavves, and S. Gould (2017)

    Self-supervised video representation learning with odd-one-out networks

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3636–3645. Cited by: §2, Table 3.
  • [18] C. Gan, B. Gong, K. Liu, H. Su, and L. J. Guibas (2018) Geometry guided convolutional neural networks for self-supervised video representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5589–5597. Cited by: §2, Table 3.
  • [19] C. Gan, C. Sun, L. Duan, and B. Gong (2016) Webly-supervised video recognition by mutually voting for relevant web images and web video frames. In European Conference on Computer Vision, pp. 849–866. Cited by: §2, §5.3, Table 3.
  • [20] D. Ghadiyaram, D. Tran, and D. Mahajan (2019) Large-scale weakly-supervised pre-training for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12046–12055. Cited by: §1, §2, Table 1, §3, §3, §4.1, §5.3, Table 4.
  • [21] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. International journal of computer vision 106 (2), pp. 210–233. Cited by: §2.
  • [22] Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In European conference on computer vision, pp. 529–545. Cited by: §2.
  • [23] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al. (2018) Ava: a video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056. Cited by: §1.
  • [24] J. Hale (2019)(Website) External Links: Link Cited by: §1.
  • [25] T. Han, W. Xie, and A. Zisserman (2019) Video representation learning by dense predictive coding. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: Table 3.
  • [26] G. Hyams, D. Malowany, A. Biller, and G. Axler (2019)(Website) External Links: Link Cited by: §1.
  • [27] H. Izadinia, B. C. Russell, A. Farhadi, M. D. Hoffman, and A. Hertzmann (2015)

    Deep classifiers from image tags in the wild

    In Proceedings of the 2015 Workshop on Community-Organized Multimodal Mining: Opportunities for Novel Solutions, pp. 13–18. Cited by: §2.
  • [28] L. Jing and Y. Tian (2018) Self-supervised spatiotemporal feature learning by video geometric transformations. arXiv preprint arXiv:1811.11387 2 (7), pp. 8. Cited by: §2, Table 3.
  • [29] A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache (2016) Learning visual features from large weakly supervised data. In European Conference on Computer Vision, pp. 67–84. Cited by: §2.
  • [30] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: §2, Table 1, §5.3, Table 3.
  • [31] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §1, §3, §5.
  • [32] D. Kim, D. Cho, and I. S. Kweon (2019) Self-supervised video representation learning with space-time cubic puzzles. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 8545–8552. Cited by: §2.
  • [33] B. Korbar, D. Tran, and L. Torresani (2018) Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems, pp. 7763–7774. Cited by: §2, Table 3.
  • [34] H. Kuehne, A. Iqbal, A. Richard, and J. Gall (2019) Mining youtube-a dataset for learning fine-grained action concepts from webly supervised video data. arXiv preprint arXiv:1906.01012. Cited by: §2.
  • [35] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre (2011) HMDB: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pp. 2556–2563. Cited by: §1, §5.
  • [36] Z. Lai and W. Xie (2019) Self-supervised learning for video correspondence flow. arXiv preprint arXiv:1905.00875. Cited by: §2.
  • [37] H. Lee, J. Huang, M. Singh, and M. Yang (2017) Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676. Cited by: §2, Table 3.
  • [38] A. Li, A. Jabri, A. Joulin, and L. van der Maaten (2017)

    Learning visual n-grams from web data

    In Proceedings of the IEEE International Conference on Computer Vision, pp. 4183–4192. Cited by: §2.
  • [39] T. Li and L. Wang (2020) Learning spatiotemporal features via video and text pair discrimination. arXiv preprint arXiv:2001.05691. Cited by: §2, Table 3.
  • [40] I. Loshchilov and F. Hutter (2016) Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §4.
  • [41] W. Lotter, G. Kreiman, and D. Cox (2016) Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104. Cited by: §2.
  • [42] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten (2018) Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196. Cited by: §2.
  • [43] M. Mathieu, C. Couprie, and Y. LeCun (2015) Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440. Cited by: §2.
  • [44] A. Miech, J. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman (2019) End-to-end learning of visual representations from uncurated instructional videos. arXiv preprint arXiv:1912.06430. Cited by: §2, §5.3, §5.3, Table 3, Table 4.
  • [45] A. Miech, D. Zhukov, J. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic (2019) Howto100M: learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2630–2640. Cited by: §2, Table 1, §3, §5.3.
  • [46] I. Misra, C. L. Zitnick, and M. Hebert (2016) Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pp. 527–544. Cited by: §2, Table 3.
  • [47] N. C. Mithun, R. Panda, E. E. Papalexakis, and A. K. Roy-Chowdhury (2018) Webly supervised joint embedding for cross-modal image-text retrieval. In Proceedings of the 26th ACM international conference on Multimedia, pp. 1856–1864. Cited by: §2.
  • [48] A. Nagrani, S. Chen, D. Ross, R. Sukthankar, C. Schmid, and A. Zisserman (2020) Speech2Action: cross-modal supervision for action recognition. CVPR. Cited by: §2.
  • [49] V. Ordonez, G. Kulkarni, and T. L. Berg (2011) Im2text: describing images using 1 million captioned photographs. In Advances in neural information processing systems, pp. 1143–1151. Cited by: §2.
  • [50] A. Owens and A. A. Efros (2018) Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 631–648. Cited by: §2.
  • [51] A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba (2016) Ambient sound provides supervision for visual learning. In European conference on computer vision, pp. 801–816. Cited by: §2.
  • [52] D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan (2017) Learning features by watching objects move. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2701–2710. Cited by: §2.
  • [53] A. Rouditchenko, H. Zhao, C. Gan, J. McDermott, and A. Torralba (2019) Self-supervised audio-visual co-segmentation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2357–2361. Cited by: §2.
  • [54] F. Schroff, A. Criminisi, and A. Zisserman (2010) Harvesting image databases from the web. IEEE transactions on pattern analysis and machine intelligence 33 (4), pp. 754–766. Cited by: §2.
  • [55] K. Soomro, A. R. Zamir, and M. Shah (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2. Cited by: §5.
  • [56] C. Sun, F. Baradel, K. Murphy, and C. Schmid (2019) Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743. Cited by: Table 3.
  • [57] I. Sutskever, J. Martens, G. Dahl, and G. Hinton (2013)

    On the importance of initialization and momentum in deep learning


    International conference on machine learning

    pp. 1139–1147. Cited by: §4.
  • [58] Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: Table 3.
  • [59] C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Anticipating visual representations from unlabeled video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 98–106. Cited by: §2.
  • [60] C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Generating videos with scene dynamics. In Advances in neural information processing systems, pp. 613–621. Cited by: §2.
  • [61] C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy (2018) Tracking emerges by colorizing videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 391–408. Cited by: §2.
  • [62] J. Wang, J. Jiao, L. Bao, S. He, Y. Liu, and W. Liu (2019) Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4006–4015. Cited by: §2, Table 3.
  • [63] X. Wang, A. Jabri, and A. A. Efros (2019) Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2566–2576. Cited by: §2.
  • [64] X. Wang, L. Zhang, X. Li, and W. Ma (2008) Annotating images by mining image search results. IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (11), pp. 1919–1932. Cited by: §2.
  • [65] D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman (2018) Learning and using the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8052–8060. Cited by: §2.
  • [66] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 305–321. Cited by: Figure 3, §4.1.
  • [67] D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang (2019) Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10334–10343. Cited by: §2, Table 3.
  • [68] S. Yu, L. Jiang, and A. Hauptmann (2014) Instructional videos for unsupervised harvesting and learning of action examples. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 825–828. Cited by: §2.
  • [69] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba (2018) The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 570–586. Cited by: §2.
  • [70] L. Zhou, C. Xu, and J. J. Corso (2018) Towards automatic learning of procedures from web instructional videos. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.
  • [71] D. Zhukov, J. Alayrac, R. G. Cinbis, D. Fouhey, I. Laptev, and J. Sivic (2019) Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3537–3545. Cited by: §2.


We present additional analyses and examples of the metadata in the WVT-70M dataset in Tables 5, 6, and 7.

Metadata Num. Unique % Unique
Titles 43.0M 61.5
Descriptions 29.3M 41.9
Tags 34.0M 48.6
Channel Name 21.0M 29.9
Table 5: Number of unique instances for each metadata type in WVT-70M. All metadata types contain repeats though some are repeated more often than others. Many channels are repeated, and we on average collect 3.3 videos per channel.
Metadata Min 25 50 75 Max
Titles 0 2 4 6 158
Descriptions 0 0 3 12 4249
Tags 0 0 0 5 161
Channel Name 0 1 2 2 306
Table 6: Quartiles of length (in words) of each metadata type. All have a long-tailed distribution, meaning that in extreme cases, the metadata may be hundreds or thousands of words long. However, all metadata types also contain examples which are empty or contain zero words. Titles are shortest in the most extreme cases, but longest in the median case.
Metadata Text # of Instances
Titles “Free fire” 92K
“Dance” 50K
“Dancing” 47K
“Baby” 34K
“Bottle flip” 31K
“Free Fire” 29K
“Cute baby” 29K
“Playing games” 27K
“Games” 21K
“Snow” 20K
Tags “PlayStation 4” 752K
“Sony Interactive Entertainment” 695K
“funny” 672K
“video” 547K
“mobile” 539K
“YouTube Capture” 523K
“#PS4Live” 490K
“how to” 467K
“tutorial” 442K
“fun” 371K
Table 7: Top ten most often-repeated titles and tags. For titles, these are descriptive and reflect the content of the video. For tags, these often contain automatically-generated metadata which reflect the method by which the video was uploaded.