Representation learning from videos in-the-wild: An object-centric approach

by   Rob Romijnders, et al.

We propose a method to learn image representations from uncurated videos. We combine a supervised loss from off-the-shelf object detectors and self-supervised losses which naturally arise from the video-shot-frame-object hierarchy present in each video. We report competitive results on 19 transfer learning tasks of the Visual Task Adaptation Benchmark (VTAB), and on 8 out-of-distribution-generalization tasks, and discuss the benefits and shortcomings of the proposed approach. In particular, it improves over the baseline on all 18/19 few-shot learning tasks and 8/8 out-of-distribution generalization tasks. Finally, we perform several ablation studies and analyze the impact of the pretrained object detector on the performance across this suite of tasks.



page 1

page 3

page 13


Self-Supervised Learning of Video-Induced Visual Invariances

We propose a general framework for self-supervised learning of transfera...

Few-Shot Learning for Video Object Detection in a Transfer-Learning Scheme

Different from static images, videos contain additional temporal and spa...

Learning Visual Representations for Transfer Learning by Suppressing Texture

Recent literature has shown that features obtained from supervised train...

A Rationale-Centric Framework for Human-in-the-loop Machine Learning

We present a novel rationale-centric framework with human-in-the-loop – ...

Oops! Predicting Unintentional Action in Video

From just a short glance at a video, we can often tell whether a person'...

Evolving Losses for Unsupervised Video Representation Learning

We present a new method to learn video representations from large-scale ...

A Self-Supervised Framework for Function Learning and Extrapolation

Understanding how agents learn to generalize – and, in particular, to ex...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning transferable visual representations is a key challenge in computer vision. The aim is to learn a representation function once, such that it may be transferred to a plethora of downstream tasks. In the context of image classification, models trained on large amounts of labeled data excel in transfer learning 

[42], but there is a growing concern that this approach may not be effective for more challenging downstream tasks [79]

. Recent advances, such as contrastive self-supervised learning combined with strong data augmentation, present a promising avenue 


Figure 1: We propose to learn from objects in video. An off-the-shelf pretrained object detector tags each frame with bounding boxes and class labels. A contrastive loss encourages pairs of objects with the same class label (positives) to be embedded closer to each other than those from disparate ones (negatives). We augment the hierarchy of frame and shot level contrastive losses with a loss at the object level.

We consider the problem of learning image representations from uncurated videos. While these videos are noisy and unlabeled, they contain abundant natural variations of the present objects. Furthermore, videos decompose temporally into a hierarchy of videos, shots, and frames which can be used to define pretext tasks for self-supervised learning [68]

. We extend this hierarchy to it’s natural continuation, namely, the spatial decomposition of frames into objects. We then use the “video, shot, frame, object” hierarchy to define a more holistic pre-text task. In this setting we are hence given uncurated videos and an off-the-shelf pre-trained object detector, and we propose a method of supplementing the loss function with cues at the object level.

Videos, at the frame and shot level, convey global scene structure, and different frames and shots provide a natural data augmentation of the scene. This makes videos a good fit for contrastive learning losses that rely on heavy data augmentation for learning scene representations [6]

. At the object level, videos also provide rich information about the structure of individual objects. This can be valuable for tasks such as orientation estimation, counting, and object detection. Furthermore, object-centric representations can generalize to scenes constituted as a novel combination of known objects. Intuitively, each occurrence of the object forms a natural augmentation for objects of that class. Finally, one can make use of the fact that the same object appears in consecutive frames to enable representations which are more robust to perturbations and distributions shifts. Contrastive learning in this setting is illustrated in

Figure 1.

Our contributions:

  1. We extend the framework from VIVI [68] to include object level cues using an off-the-shelf object detector.

  2. We demonstrate improvements using object level cues on recent few-shot transfer learning benchmarks and out-of-distribution generalization benchmarks. In particular, the method improves over the baseline on all 18/19 few-shot learning tasks and 8/8 out-of-distribution generalization tasks.

  3. We ablate various aspects of the setup to reveal the source of the observed benefits, including (i) randomizing the object classes and locations, (ii) using only the object labels, (iii) using the detector as a classifier in a semi-supervised setup, (iv) cotraining with

    ImageNet labels, and (v) using larger ResNet models.

2 Related work

Self-supervised image representation learning.

The self-supervised signal is provided through a pretext task (e.g. converting the problem to a supervised problem), such as reconstructing the input [31], predicting the spatial context  [12, 56], learning to colorize the image [80], or predict the rotation of the image [21]. Other popular approaches include clustering [5, 81, 17] and recently generative modeling [14, 41, 15]. A promising recent line of work casts the problem as mutual information maximization of representations of different views of the same image [69]. These views can come from augmentations or corruptions of the same input image [33, 2, 6, 27], or by considering different color channels as separate views [67].

Representation learning from videos.

The order of frames in a video provides a useful learning signal  [54, 45, 18, 74]. Information from temporal ordering can be combined with spatial ordering to infer object relations  [72] or co-occurrence statistics  [34]. Other pretext tasks include predicting the playback rate [77], or clustering [46].

Orthogonal to the pretext tasks, one could use the paradigm of slow feature learning in videos. This line of work dates back to [75], which developed a method to learn slow varying signals in time series, and inspired several recent works [26, 38, 84, 23]. Our loss at the frame level uses insights from slow feature learning in the form of temporal coherence [55, 58].

Tracking patches is an alternative form of supervision which has some similarity with our object level loss. For example, [70, 72, 19]

learn temporally coherent representations for the patches. Our method learns representations for the objects within a fully convolutional neural network. Other approaches have investigated learning specific structures to represent the objects in video  

[52, 25].

Predicting the next frame or learning to synthesize a (future) frame were also considered as pretext tasks [24, 65, 51]. Given that frame prediction requires learning of fine-detailed features, one can predict only the moving pixels [77], or turn to time-agnostic prediction [37].

Object level supervision.

In terms of self-supervision we follow [68] to learn from the natural hierarchy present in the videos and make use of the losses studied therein. In contrast to [68] we incorporate object-level information in the final loss and show that it leads to benefits both for few-shot transfer learning and out-of-distribution generalization. Incorporating the pixel, object, and patch information for learning and improving video representations was also considered in  [73, 70, 84, 83, 19]. In contrast to these works, we do not rely on a strong tracker trained on a similar distribution, but on an off-the-shelf, parameter efficient object detector. Furthermore, we learn representations for images, not videos. Contemporary works also use object information for learning video representations [44]

or for training graph neural networks on videos 


Figure 2: Learning from the natural hierarchy present in the videos. Each video in a dataset consists of multiple shots, each shot consist of multiple frame, which can be used to formulate a contrastive loss for learning image representations [68] (cf. Section 3). We extend this hierarchy to the object level by using an off-the-shelf detector.

3 Method

Self-supervision via video-shot-frame hierarchy.

A video can be decomposed into a hierarchy of shots, frames and objects which is illustrated in Figure 2. For the first two levels in the hierarchy, we follow the setup from [68], named VIVI, which we summarize here.

At the shot level, VIVI learns, in a contrastive manner, to embed shots such that they are predictive of other shots in the same video [69]. At the frame level, VIVI learns to embed frames such that frames from the same shot are closer to each other relative to frames from other shots. In particular, the shot level loss is an instance of the InfoNCE loss [69] between shot representations. VIVI (1) maps frame in shot to a representation in video , (2) aggregates these frame representations into a shot representation , and (3) predicts the representation of the next shot, , given the sequence of preceding shots, , resulting in the loss:


where , called the critic, is used to compute similarity between shots, indicates the total numbers of videos in a mini batch, and indicates the number of prediction steps into the future. In practice, optimization is more stable when contrasting against shot representations from the entire batch of videos.

Contrastive learning is also applied at the frame level based on the intuition that frames within a shot typically contain similar scenes. VIVI learns to assign a higher similarity to frame representations coming from the same shot by applying a triplet loss defined in [62] (cf. Figure 2). In particular, for a fixed frame, frames coming from the same shot are considered as positives, while the frames from other shots are negatives. Denoting positive pairs as and negatives as , the semi-hard loss can be written as [62]:

Extending the hierarchy with object-level losses.

Data augmentation is a key novelty behind recent advances in representation learning [53, 6, 67, 8]. These augmentations are usually obtained by applying synthetic transformations like random cropping, left-right flipping, color distortion, blurring, or adding noise. However, independent non-rigid movement of an object against background, as seen in real video data, is hard to expect from synthetic augmentations.

In order to exploit these natural augmentations, which occur in video, we use a contrastive loss that encourages representations of objects of the same category to be closer together as opposed to representations of different categories (cf. Figure 1). To construct this loss, we apply an off-the-shelf object detector to all frames and extract the bounding boxes and class labels. Given the representations of each bounding box (will be discussed later), we use a triplet loss where objects from the same class form positive pairs, and objects from different classes form negative pairs.

In particular, given the embedding of the b-th bounding box , and the embeddings of the positive and negative (with respect to ), we apply the following loss per frame:


To scale the triplet loss to a large number of boxes, we follow insights from the literature  [83, 43, 64] and use semi-hard negative mining  [62].

An alternative to the contrastive loss is the classic cross-entropy loss – learning a representation such that a linear classifier can recognize the target class. We formalize the former given the empirical evidence from recent research [40], but we present ablation studies in Section 4.

Representations of bounding boxes.

A simple approach to obtain the representation would be to extract all the bounding boxes and feed them through the network. However, this is computationally prohibitive, and we instead propose a method which reuses the feature grid present in ResNet50 models [28] illustrated in Figure 3.

Figure 3: Illustration of the feature grid output by a ResNet50. Object bounding boxes are mapped to feature columns that correspond to the center pixel of that bounding box. Correspondence is illustrated using matching colors.

Consider an image of dimensions , indicating the height, width, and number of channels of the image, respectively. A fully convolutional ResNet50 maps this image to a set of features of dimensions . Given a bounding box whose center is

we represent it by the vector

of size

. This approach is conceptually similar to max pooling as used in Fast-RCNN 

[22]. Given the computational efficiency and the fact that the effective receptive field is concentrated at the center [50], we chose this simple alternative.

Final loss function.

We combine the losses using positive coefficients and as


This formulation enables a study of the benefits of each of the losses and leads to practical recommendations.

Headroom analysis using ImageNet.

In addition to the proposed method, we analyze the benefits of co-training the proposed network with a large labeled data source [68, 4]

. This data source provides a vast quantity of labeled images and should help the model improve the performance on tasks which require fine-grained detail of specific object classes. In particular, we consider an affine map of the representation extracted by the network, followed by a softmax layer and a corresponding cross-entropy resulting in

, where

is a hyperparameter balancing the impact of this additional loss.

4 Experimental setup

4.1 Architectures and training details

Unless otherwise specified, all experiments are performed on a ResNet50 V2 [28]

with batch normalization. For the shot prediction function, we use a LSTM with 256 hidden units. We parameterize the critic function

as a bilinear form. All frames are augmented using the same policy as [66]

, using random cropping, left-right flipping and color distortions. The coordinates of object bounding boxes are recalculated accordingly. All models are trained using batch size 512 for 120K iterations of stochastic gradient descent with a momentum constant of

. The learning rate starts as and decreases by a factor of 10 after 90k and 110k training steps. When cotraining, we train for 100k iterations and decrease the learning rate after 70k, 85k and 95k iterations. Shots and frames are sampled using the same method as [68]: for each video, we sample a sequence of four shots, and we sample eight frames from each shot.

The coefficients and weigh the loss contributions in Equation 3. We set following [68], and , although we found that a wider range of values leads to the same performance (cf. Figure 6 in the Appendix).

Cotraining details.

The experiments on cotraining use group normalization [76] with weight standardization [57], instead of batch normalization, for a fair comparison to [68]. When cotraining, we sample at every step a batch from each dataset — we compute the three-level loss (3) on the sampled videos, and the classification log-loss on the sampled ImageNet images. Cotrained models train with batch size of 512 for videos and 2048 for images for 100k iterations, using the learning rate schedule described above. Images are preprocessed using the inception crop function from [66].


We train on videos from the YT8M dataset and cotrain with ImageNet [9]. The videos are sampled at Hz and we run the detector, a MobileNet [61], with a single shot multi box detector [49], trained on OpenImagesV4 [1]. The detector runs at ms per frame on a V100 GPU. Table 6 in the appendix shows how often common objects are detected in the video frames. As the detector has been trained on OpenImagesV4, we use its 600 category label space for constructing positive and negative pairs for . We use the feature grid from the ResNet block 4 to construct representations for objects in a frame and limit the number of objects in each frame to a maximum of . We discard objects with detection score below , which accounts for approximately % of the detected objects. Figure 5 shows a histogram of the detection scores. Finally, given that the YT8M dataset is a dynamic dataset, our video training set contains those videos still available in May 2020, for a total of million training and one million validation videos. The baselines were re-trained on this new dataset..

Method Dataset Signal VTAB Natural Specialized Structured
Transitive Invariance [72] YouTube 100k Tracklets 44.2 35.0 61.8 43.4
MT [13] ImageNet & SoundNet Tracklets 59.2 51.9 78.9 55.8
Supervised (ResNet50) ImageNet None 68.5 71.3 83.0 58.9
Detector backbone OpenImagesV4 None 61.6 60.0 80.4 53.5
VIVI [68] YT8M None 60.9 55.0 79.5 56.7
OURS YT8M Detector 64.1 59.0 81.6 59.8
Boxes and labels at random YT8M None 60.3 55.2 78.0 55.0
Boxes at random coordinates YT8M Detector 63.4 57.5 81.1 59.7
Distilling from ImageNet YT8M Classifier 63.1 59.6 81.6 57.0
Also predict cross entropy YT8M Detector 64.9 60.5 81.3 60.5
Table 1: Evaluation on the Visual Task Adaptation Benchmark. Each number indicates the average classification accuracy over all data sets in the corresponding category. indicates a statistical significant difference between the baseline [68] and the proposed method. The last 4 methods show results of 4 ablation studies which investigate the benefits of the corresponding training signals.

4.2 Evaluation

We evaluate two aspects of the learned representations: How well they transfer to novel classification tasks, and how robust are the resulting classifiers to distribution shifts.


The main objective of this work is learning image representations that transfer well to novel, previously unseen tasks. To empirically validate our approach, we report the results on the Visual Task Adaptation Benchmark (VTAB), a suite of 19 image classification tasks [79]. The tasks are organized into three categories, Natural

containing commonly used classification datasets (Caltech101, Cifar-100, DTD, Flowers102, Pets, SUN397 and SVHN),

Specialized comprising of images recorded with specialized equipment (Resisc45, EuroSAT, Patch Camelyon, Diabetic Retinopathy), and Structured

containing scene understanding tasks (CLEVR-dist, CLEVR-count, dSPRITES-orient, dSPRITES-pos, sNORB-azimuth, sNORB-elevation, DMLab, KITTI). For more details please refer to


We consider transfer learning in the low data regime, when each task has only 1000 labeled samples available. The evaluation protocol is the same as in [68, 42, 79, 60]: for each dataset we (i) train on samples and validate on samples using our learned model as initialization, (ii) sweep over two learning rates (, ) and two learning rate schedules (10K steps with decay every 3K, or 2.5K steps with decay every 750), and then (iii) pick the learning rate and learning rate schedule according to the highest validation accuracy and retrain the model using all samples. We report statistical significance at the level on a Welch’s two sided

-test based on 12 independent runs. The error bars in the diagrams indicate bootstrapped 95% confidence intervals.


As discussed in Section 3, we were guided by the intuition that the model should learn to be more invariant to natural augmentations. We thus expect our model to be more robust and generalize better to out-of-distribution (OOD) images.

We follow two recent OOD studies [29, 11] and evaluate robustness as accuracy on a suite of 8 datasets measuring various robustness aspects. These datasets are defined on the ImageNet label space: (1) ImageNet-A [30] measures the accuracy on samples from the web that were adversarial to a ResNet50 trained from scratch on ImageNet. (2) ImageNet-C [29] measures the accuracy on samples from ImageNet under perturbations such as blur, pixelation, and compression artifacts. (3) ImageNet-V2 [59] presents a new test set for the ImageNet dataset. (4) ObjectNet [3] consists of images collected by crowd sourcing, where participants were asked to photograph objects in unusual poses and unusual backgrounds. (5-8) ImageNet-Vid, ImageNet-Vid-pm-k, YT-BB-Robust, and YT-BB-Robust-pm-k present frames from video sequences [63]. We measure both accuracy of the anchor frame, denoted as anchor accuracy, and worst-case accuracy in the 20 neighboring frames, denoted as pm-k

We also evaluate our models on the texture-shape data set from [20]. Our method uses a contrastive loss to learn specifically from objects. Learning with our loss encourages objects in different appearances to have similar representations. As such, we hypothesize that our models have higher shape bias, compared to texture bias. [20] provide a dataset to measure the texture-shape bias. The test set consists of images whose texture has been stylized. Each image has a label according to its shape, and a label according to the stylization of its texture. We report the fraction of correct predictions based on shape, as proposed by the authors. For further details we refer to the paper [20].

5 Results

5.1 Transferability

Table 1 shows our results on the Visual Task Adaptation Benchmark (VTAB). We observe statistically significant improvements over the baseline [68] which demonstrate the benefits of supplementing the self-supervised hierarchy with object level supervision. The detailed results are presented in Figure 4.

Figure 4: Relative increase in VTAB accuracy per dataset: blue for Natural, orange for Specialized and green for Structured datasets. Stars indicate statistical significance. Relative increase refers to the increase in accuracy by learning from objects, divided by accuracy of VIVI [68], which only learns at two levels of the hierarchy.

Rows 1 and 2 in Table 1 compare against two prior works on representation learning from videos: Transitive Invariance (TI) [72] and Multi-task Self-Supervised Visual Learning (MT) [13]. TI uses context based self-supervision together with tracking in videos to formulate a pretext task for representation learning and row 1 shows the performance of their pre-trained VGG-16 checkpoint. MT uses a variety of pretext tasks, including motion segmentation, coloring and exemplar learning [16] and row 2 shows the performance of their ResNet101 (up to block 3) checkpoint.

Ablation 1: Randomizing the location and the class.

The object level loss is made possible through additional supervision provided via an object detector pre-trained on OpenImagesV4. The detector contributes to representation learning by annotating object positions and object category labels in video frame and here we ablate these two sources: (i) We evaluate the contribution (1) from knowing the class of an object, but not its coordinates, and (2) when neither the class nor the location are known. The results are detailed in Table 1. Randomizing both the label and the coordinates of the objects destroys all signal from the detector. Row Boxes and labels at random shows the results of this ablation and we observe that the performance is below the VIVI baseline, as expected. In contrast, when we randomize the object locations, but maintain the correct labels, we obtain an improvement over the baseline (row boxes at random). Interestingly, the VTAB score on structured datasets, %, equals the accuracy where both the class and location are known.

Ablation 2: Frame-level labels from a ImageNet-pretrained model.

We further investigate the effectiveness of knowing frame-level labels by obtaining soft-labels using an ImageNet-pretrained model, effectively distilling the ImageNet model on YT8M frames [32]. Its performance is noted in Table 1, row distilling from ImageNet. Interestingly, this distilled model scores higher in natural datasets, but lower in structured datasets than the proposed method. These differences show how various upstream signals affect different downstream tasks differently.

Ablation 3: Distilling the object detector.

We distill a ResNet50 on YT8M where the training instances are cropped objects and the labels assigned by the object detector. The distilled ResNet50 achieves a score of 57.1% VTAB score compared to % of the proposed method. At the same time, using a non-pretrained ResNet of the same capacity achieves % when trained on 1000 downstream labels. Hence, the detector clearly provides a strong training signal, but it can be exploited to a higher degree by coupling it with a self-supervised loss as in the proposed method.

Ablation 4: Semi-supervised learning.

One can also utilize the detector to label the frames and use the labeled data as additional training data [40]. To this end, we apply a linear classifier on the bounding box representations in order to classify the object as one of 600 OpenImagesV4 classes. The predictions of the object detector are treated as ground truth labels and a binary cross-entropy loss is added to the loss in Equation 3. This approach increases the VTAB score from % to %. We also investigated using this loss as a replacement for in Equation 3. However, this performed worse, scoring %, which highlights the advantage of the contrastive formulation.

Method k k k
VIVI 39.1 49.0 58.9 59.6
OURS 39.8 54.5 63.4 66.0
Table 2: Fraction of objects whose nearest neighbor in representation space is an object with the same class label for increasing number of training steps. indicates a statistically significant difference at using Fisher’s exact test for all objects in batches of videos. As training progresses, our method has significantly more objects with matching neighbors than the vanilla VIVI model.
Method mAP(%)
Our method () 40.4
VIVI () 35.1
Only object level () 39.3
Table 3: RetinaNet

performance on the MS-COCO dataset using various pre-trained backbones.

Effect of the contrastive loss.

Lastly, we present a diagnostic for our training procedure. is designed to embed objects of the same class closer together. We verify whether this is indeed the case, in comparison to VIVI, by measuring the fraction of nearest neighbors per object that belong to the same category. Table 2 shows the progression of this metric during training. Our method results in significantly more nearest neighbors belonging to the same class as the query object. This verifies that our loss function and training procedure achieve the desired outcome.

Evaluation on detection.

Our model learns from videos at the object level. It is natural to expect that a ResNet50 backbone pre-trained using our method will perform well when fine-tuned for downstream object detection. To this end, we fine-tune a RetinaNet architecture [47] on the MS COCO object detection dataset [48]. Images are rescaled and randomly cropped to

during training. We train the model for 60 epochs with an initial learning rate of

and batch size 256.

Results are shown in table 3. Pre-training using our method improves upon the VIVI baseline by % mAP points. Training on only the object level loss is % mAP point behind using all three levels of the hierarchy. These results suggest that the learned representations are indeed more object centric and that learning from all three levels combined yields representations more effective for downstream object detection.

Co-training with ImageNet.

Table 4 shows the resulting accuracies on VTAB when cotraining with ImageNet. Compared to the cotrained VIVI baseline, our method with its object-level loss increases the VTAB score from % to %. This increase in accuracy is modest in comparison to those in Table 1. We argue that ImageNet is a clean curated dataset whereas YT8M is noisy. Adding cotraining with clean ImageNet improves the accuracy on natural datasets from % to %. It is not surprising that adding more noisy supervision, at the object level, does not give massive gains in this setting. We repeat the experiment with a higher capacity ResNet50. Again we observe modest, but statistically significant, improvements over VIVI. The largest improvement is on the structured datasets, which increase from % to %. These experiments highlight an interesting dichotomy between natural and structured subsets of VTAB: learning with ImageNet yields improvements on natural datasets, while using the detector yields improvements for structured datasets.

Method VTAB Natural Spec. Struc.
VIVI [68] 69.0 70.3 82.7 60.9
OURS 69.4 70.8 82.9 61.4
3x wider ResNet-50
VIVI [68] 70.2 71.4 83.7 62.2
OURS 70.5 71.6 83.6 63.0
Table 4: VTAB scores of the models cotrained on YT8M and ImageNet. The presented numbers are the average image classification accuracy of the fine tuned models over the respective VTAB category. notes a statistically significant difference between VIVI and our method.

5.2 Robustness

Table table 5 presents the classification accuracies on the eight robustness datasets. To get predictions in the ImageNet label space, we fine tune our learned representation and report results in row fine tuning. Our method compares favorably to the baseline on all datasets, which confirms the intuition that extending the video-shot-frame hierarchy to objects results in more robust image representations. The robustness results for the cotrained models are presented in Table 5, row cotraining. As expected, the results improve across all datasets. The final two columns of Table 5 note the delta between anchor accuracy and pm-k accuracy, where in three out of four cases our method scores a lower (better) delta.

Model Method

ImageNet [10]

ImageNet-A [30]

ImageNet-C [29]

ImageNet-V2 [59]

ObjectNet [3]

ImageNet-Vid [63]

ImageNet-Vid-pm-k [63]

YT-BB [63]

YT-BB-pm-k [63]



VIVI Fine tuning 62.6 0.5 6.8 51.1 16.2 57.9 36.5 58.0 39.9 21.4 18.1
OURS Fine tuning 65.2 0.6 9.5 53.4 18.4 61.7 43.4 60.8 42.3 18.3 18.6
VIVI Cotraining 73.1 1.1 24.3 59.8 20.9 58.7 41.7 49.4 35.4 17.0 14.0
OURS Cotraining 73.3 1.2 24.4 60.8 21.0 59.7 44.0 50.0 37.6 15.7 12.4
Table 5: Accuracy on robustness datasets from literature. These datasets are typically a perturbed version of ImageNet-like images and videos. Each dataset indicates a specific aspect of robustness. Higher accuracy corresponds to better robustness. Lower corresponds to better robustness. Lin. eval refers to a model trained on YT8M, with a post-hoc linear layer trained on the ImageNet training dataset. Cotraining refers to a model trained on YT8M and ImageNet.

We evaluate our models on the texture-shape data set from [20]. As the evaluation is done using the ImageNet label space, we use the same models that we evaluated on the robustness datasets. First, we evaluate the models that trained an additional linear layer on the ImageNet training set. The VIVI model, using only the video-shot-frame hierarchy, scores shape fraction on the provided dataset. Using our method to learn from the video-shot-frame-object hierarchy, the shape fraction increases to . A higher shape fraction indicates a better model, as the network has higher relative accuracy according to the shape of the object. Similarly, cotrained models improve from to when using our method to learn from objects in video. These results indicate a promising direction for future research.

6 Discussion

We have presented a hierarchy, videos-shots-frames-objects, to learn representations from video at multiple levels of granularity. Through our method, the learned representations transfer better to downstream image classification tasks and exhibit higher accuracy on out-of-distribution datasets. We identify three aspects for future research.

A taxonomy for learning transferable representations.

Our results show that using different learning signals present in videos benefits transfer learning to Natural, Specialized or Structured image classification tasks in a specific manner. We consider our work part of a larger line of research that creates a taxonomy for indexing the learning methods and their effect on transfer learning to specific datasets, similarly to [78], which outlined a taxonomy of multi modal learning. To give examples: We have evaluated our method using the VTAB benchmark, consisting of Natural, Specialized and Structured image classification tasks. Using the noisy videos from YT8M mainly improves transferability to Specialized and Structured tasks. Using the clean images from ImageNet improves transferability to Natural tasks. Our method, which receives implicit supervision from OpenImagesV4, shows highest improvement on Natural tasks. Thus using different sources of supervision improves the transferability to different tasks. We believe that understanding how different data and learning methods improve the performance on different groups of datasets is a central research question in transfer learning today, and that this work contributes towards this grand challenge by providing insight into the benefits of learning from uncurated video data.

Learning about objects without labels.

Our method uses an off-the-shelf detector to identify the objects. As the detector was trained on labeled data, learning at the object level of the hierarchy uses implicit supervision. Contemporary literature focused on other self-supervised methods to improve learning from video. For example, one could derive signals about objects using optical flow or keypoint detection [35, 36]. Combining these ideas in our paradigm of learning in the hierarchy might provide a useful research direction.

Learning about entire videos instead of image representations.

Our method shows improved results concerning transfer learning and robustness of image models for single images. This improvement raises the question how these results will translate to video understanding. Recently, there has been interest in video recognition [39] and video action localization [82]. We look forward to testing our learning methods on these tasks.

Improved robustness from learning about objects.

We have shown how our method results in more robust image classifiers. This observation suggests that learning about objects, invariant to other parts of the images, improves robustness. Several computer vision tasks concern objects. Therefore, we suggest that having object centered representation will contribute to developments in robustness.


Appendix A Statistics on the annotated Yt8m

This section shows statistics on the YT8M, annotated with the object detector [1]. We annotate each frame of YT8M with the object detector and store the five objects with highest detection scores. Our method relies on objects recurring multiple times in a video. The method works better when objects occur multiple times in the selected frames. Therefore, Table 6 displays statistics for objects that occur in most videos. For each object, we count how often the object recurs in the frames sampled with the strategy from [68]. For example, in percent of videos, an object with class Footwear occurs. Each of those videos has, on average, instances of the Footwear class.

We discard objects with a low detection score. Figure 5 shows the fraction of boxes below a certain threshold. All methods in this work use a threshold of , which discards about 3 percent of the objects. We experimented with higher thresholds, but this resulted in worse VTAB scores.

Figure 5: Cumulative histogram of the detection scores from the object detector. Histogram measured on the videos from YT8M, annotated with our detector [1]. In our experiments, we exclude boxes with scores below , which applies to approximately 3 percent of the objects.
Label name Videos (%) Recurrence
Street light
Land vehicle
Human face
Table 6: Recurrence of objects within the 32 frames sampled for learning from one video. For example, on average, % of the videos contain an object labeled Person. In each video where a Person occurs, the detector annotated an average of instances. We show averages over ten thousand videos that we randomly sampled from the training set.
Figure 6: VTAB scores on respective validation sets when changing the weight for the object-level loss. The optimum accuracy occurs at , which is the value we use in all experiments. The VTAB scores change away from the optimum, but are relatively stable when comparing to baseline (see Table 1). The error bars indicate bootstrapped 95% confidence intervals.

Appendix B Sensitivity to hyperparameters

Our experiments use three important hyperparameters. We used the validations sets from the VTAB benchmark to set the hyperparameters. This section shows the sweeps we make so one can judge the sensitivity for each hyperparameter. Figure 6 shows the search for hyperparameter from Equation 3. Figure 7 shows the search for a positive coefficient to include the cross entropy loss in the experiment for Table 1, row Also predict cross entropy. Figure 8 shows the search for a positive coefficient for the cross entropy loss when learning from the soft labels from ImageNet for Table 1, row Distilling from ImageNet.

Figure 7: VTAB scores on respective validation sets when changing the weight for the additional supervised loss on the objects. The optimum accuracy occurs at , which is the value we use in the ablation experiment. The error bars indicate bootstrapped 95% confidence intervals.
Figure 8: VTAB scores on respective validation sets when changing the weight for cross entropy loss on the soft labels from an ImageNet model. The optimum accuracy occurs at , which is the value we use in the ablation experiment. The error bars indicate bootstrapped 95% confidence intervals.

All tasks























ResNet50 from scratch 42.1 26.9 65.8 43.6 37.7 11.0 23.0 40.2 13.3 3.9 59.3 63.1 84.8 41.6 73.5 54.8 38.5 35.8 37.3 87.9 20.9 36.9 36.9
Transitive Invariance [72] 44.2 35.0 61.8 43.3 54.9 7.1 38.3 28.2 32.3 7.4 77.0 63.1 84.1 50.0 50.0 61.7 12.7 35.0 59.3 86.1 21.1 29.2 41.6
MS [13] 47.2 33.4 68.4 47.9 52.3 12.7 37.3 32.6 15.8 6.8 81.8 57.3 89.7 49.7 76.8 55.7 43.2 38.4 46.4 81.2 34.8 35.1 48.4
MT [13] 59.2 51.9 78.9 55.8 76.2 26.2 49.3 63.5 48.5 10.6 89.1 71.7 93.3 70.2 80.3 62.1 55.6 44.3 43.2 86.6 39.1 38.9 76.3
MobileNetV2 65.9 69.5 81.9 54.8 88.5 45.4 59.1 87.3 86.7 32.0 87.3 71.1 94.3 80.5 81.6 55.8 44.8 46.6 51.6 90.0 37.5 38.7 73.4
ImageNet supervised 68.5 71.3 83.0 58.9 84.7 60.0 68.2 87.3 90.3 36.1 72.2 75.4 95.2 81.4 80.0 57.7 73.9 45.6 59.7 88.5 29.1 34.2 82.3
ImageNet supervised (3x) 69.5 72.6 83.8 59.5 85.6 61.0 69.6 88.8 90.9 37.4 75.0 78.0 95.7 82.5 78.9 61.4 64.6 45.3 60.5 93.2 32.9 36.6 81.5
Detector backbone [1] 61.6 60.0 80.4 53.5 84.3 38.2 48.4 77.4 58.6 25.2 88.1 70.6 94.0 73.4 83.5 58.2 42.8 47.8 46.4 73.4 39.4 42.9 77.4
Comparison BigBiGAN [15] 59.1 56.6 79.1 51.3 80.8 39.2 56.6 77.9 44.4 20.3 76.8 69.3 95.6 74.0 77.4 55.6 53.9 38.7 46.7 70.6 27.2 46.3 71.4
VIVI [68] 60.8 55.1 80.0 56.3 74.8 29.2 48.6 76.9 54.8 13.6 87.6 71.4 94.4 74.1 80.1 59.0 54.0 47.1 50.9 91.7 37.0 42.4 68.2
OURS 64.0 58.9 81.8 59.7 81.5 35.9 51.6 76.9 60.1 17.1 89.3 72.7 94.7 76.9 82.7 62.6 61.5 50.8 53.3 92.2 41.5 39.0 76.6
Rand boxes and labels 60.3 55.1 80.0 54.9 75.7 28.2 49.6 76.7 53.1 14.7 88.0 71.3 93.8 74.0 80.8 60.8 55.7 34.5 50.7 94.0 37.2 37.0 69.5
Rand boxes 63.4 57.5 81.1 59.7 79.4 31.4 51.3 77.0 58.8 16.2 88.5 72.2 94.3 74.5 83.5 61.3 60.0 48.0 52.2 95.0 40.6 42.1 78.2
Distilling from ImNet 63.1 59.6 81.5 56.9 81.3 35.4 58.4 75.5 54.3 24.9 87.5 73.0 95.4 75.8 82.0 61.0 50.0 47.0 50.7 89.3 36.5 41.8 79.3
Include CE loss 64.9 60.5 81.3 60.5 83.9 38.9 55.2 76.2 59.2 20.7 89.3 71.2 94.7 77.0 82.3 63.7 65.6 50.8 52.8 94.2 34.7 40.4 81.7
Distilling detector 57.1 52.2 77.2 51.3 78.2 29.6 49.1 56.9 47.2 21.0 83.5 70.0 91.8 66.3 80.5 58.1 44.9 41.4 44.9 77.5 30.9 32.4 80.2
Video only Filter half of detections 61.9 57.6 80.3 56.4 79.7 31.8 50.4 78.3 59.2 14.5 89.2 71.0 94.3 74.9 81.0 58.4 50.6 48.4 52.2 90.9 34.9 41.6 74.4
VIVI [68] 69.0 70.0 83.5 60.8 87.2 51.5 64.6 88.2 85.9 32.3 79.9 72.5 95.5 81.0 84.9 61.2 74.6 44.7 61.9 90.6 29.4 43.7 80.5
OURS 69.5 70.7 83.2 61.5 88.0 53.1 64.9 88.1 86.3 33.5 81.0 72.2 94.5 80.5 85.6 56.5 79.3 46.6 60.5 92.7 28.6 45.2 82.3
VIVI (3x) [68] 70.5 72.6 83.8 62.0 88.0 54.3 69.4 89.6 87.9 34.6 84.2 72.9 95.3 82.3 84.9 58.3 74.5 46.3 67.8 92.1 33.1 44.1 80.2
Cotraining OURS (3x) 70.8 72.2 83.4 63.4 87.1 55.3 67.8 90.0 87.7 35.6 81.6 72.0 95.0 82.4 84.1 63.2 80.5 47.3 66.0 87.8 33.9 46.0 82.6
Table 7: VTAB accuracies for each method and dataset considered in our work. Each number represents the accuracy after transferring the model learned with the method to the specific dataset. Each dataset has only 1000 labeled samples. We follow the transfer protocol from [60]