Supervised deep learning necessitates the collection and manual annotation of large amounts of data, which is often expensive, hard to scale, and may require domain expertise (e.g., in the context of medical data). Expensive data annotation hence presents a bottleneck which impedes the application of deep learning methods to diverse, previously under-explored, problems. Learningtransferable visual representations, namely representations obtained by training a model on one task (or collection of tasks) which can then be used as a starting point for multiple unseen downstream tasks using few samples, is therefore a key research challenge .
An emerging body of work based on self-supervision has demonstrated that it is possible to learn such transferable visual representations. The idea is to carefully construct a pretext task which does not rely on manual annotation, yet encourages the model to compute useful features of the input. Videos are a rich source of such pretexts tasks as they capture the variations of instances over time which are not present in images. In addition, there is an abundance of videos available on the Internet covering almost any imaginable domain. As a result, and with the recent emergence of research video data sets , videos have been investigated in the context of self-supervision (for example, [37, 60, 59, 27, 61, 69, 38, 48, 39, 3, 2]). We believe that a holistic approach which captures these diverse efforts can be coupled with image-based pretext tasks to further improve the performance of self-supervised models.
In this work we propose a versatile video-based self-supervision framework for learning image representations. We divide a video data set into its natural hierarchy of frames, shots, and videos. The intuition is that the model can leverage (1) the frames to learn to be robust to color perturbations or contrast changes, (2) the shot information to be robust to rigid and non-rigid transformations of objects in a scene, and that (3) explicitly accounting for the video-level context should encourage the model to capture semantic relationships of scenes across shots/clips. In contrast to individual frame, shot, or video-level self-supervision objectives, our holistic approach can yield a representation that transfers better to a large set of downstream tasks. As an additional benefit, our approach does not need to pre-compute optical flow or motion segmentation masks, nor does it rely on object tracking.
In contrast to most previous work, our goal is to learn feature representations for downstream image classification as opposed to action recognition. We train the proposed model on the yt8m data set (without using video-level labels) and show that this approach leads to state-of-the-art self-supervised results on the 19 diverse downstream tasks of the vtab . We then show how to co-train the model jointly with labeled images, outperforming an ImageNet-pretrained ResNet-50 with fewer labeled images. We also investigate the robustness of our co-training models to natural perturbations as induced by the variations across nearby frames in videos .
In summary, our contributions are:
We propose a versatile framework to learn image representations from non-curated videos by learning frame, shot, and video-level invariances.
We train a variety of models on M videos from the yt8m data set and achieve a % absolute improvement over image/frame-based baselines across the 19 diverse tasks of the vtab benchmark , which sets new state of the art among unsupervised methods.
We augment the ssl training framework with a supervised classification loss using data from ImageNet. The resulting models outperform an ImageNet-pretrained network using only 10% labeled ImageNet images (and no additional unlabeled ones), and achieve a new state of the art when co-trained with the full ImageNet data set, outperforming the best previous supervised result by points.
2 Related work
Self-supervised learning of image representations
ssl is an active topic of research in the computer vision community. Recent methods[63, 24, 4, 42, 23, 56] have advanced the state of the art in terms of learning representations that can linearly separate between the 1000 ImageNet categories . Prior work has explored diverse self-supervision cues such as spatial-context 67], equivariance to transformations [17, 41]; alongside unsupervised techniques such as clustering [6, 68], generative modelling [13, 31], and exemplar learning .
Learning image representations from videos
More relevant to our contribution is the body of literature on ssl of image representations from videos. The temporal context of frames in video data has been widely exploited. For example, [37, 34, 15, 5, 60] make use of the order in which frames appear in a video to learn representations. Other forms of temporal context include its combination with spatial context , and the use of spatio-temporal co-occurrence statistics . Orthogonal to these efforts, which attempt to be selective of the differences between frames, prior work along the lines of slow feature analysis [61, 69] also exploited videos as a means of learning invariant representations. Temporal coherence was exploited in a co-training setting by early work 
on learning cnn for visual object recognition and face recognition. Slow and steady feature analysis attempts to learn representations that exhibit higher order temporal coherence. This object deformation signal can be separated from global camera motion by tracking objects using unsupervised methods. These tracked patches have been used to learn image representations . Tracking in this context may be replaced by spatio-temporally matched region proposals .
Some of the earliest work making use of temporal consistency used future frame prediction  as a pretext task. A more challenging version of this task is single frame future synthesis. The ambiguity in single-frame prediction has been side-stepped via time-agnostic prediction , motion segmentation , cross-pixel matching , and by giving the model a motion cue as input . The latter two require distilling the temporal information from video pixels into optical-flow fields.
Optical-flow has been treated as a separate modality from the RGB pixels in a multi-modal setting [48, 56]. Even beyond optical-flow, videos, as found on the Internet, are inherently multi-modal, as they contain audio as well as subtitles. Thus relevant here are multi-modal learning methods that combine vision and audio [39, 8, 43, 3], and vision and text  to achieve better performance than their uni-modal baselines. In a robotics setting, RGB pixels may be considered together with ego-motion [2, 30]. Time-contrastive networks  consider two views of the same action to learn view invariant representations also applied in a robotics setting.
Doersch et al.  show that motion-based ssl may be combined with other self-supervision cues namely exemplar, colorization, and spatial-context, to pre-train models that perform better than each of these cues individually. Taking inspiration from their success our framework presents a synergistic combination of ssl methods.
Fine-tuning models trained on ImageNet labels is a popular strategy for transferring representations to new tasks . Kornblith et al.  show that better supervised models tend to transfer better when fine-tuned. Other supervised learning benchmarks focus on performance on multiple data sets, either via transfer learning, meta-learning, or multitask learning [46, 57]. In the representation learning literature, models are usually evaluated in-domain, typically on ImageNet [66, and references therein]
. However, self-supervised models are now performing well on tasks such as surface normal estimation, detection, and navigation. The vtab evaluates the transferability of representations beyond object classification in the natural image domain to many domains and task semantics such as counting and localization . Similarly, recent developments in nlp have lead to representations that transfer effectively to many diverse tasks .
3 Learning video-induced visual invariances
We consider a data set containing videos, each composed of multiple shots. For simplicity of exposition we assume that each video consists of shots, and each shot has frames. If we denote the -th frame in the -th shot of video by , we can write the data set as . Our framework consists of a frame-encoder , a frame embedding pooling function , and one or multiple shot-level prediction functions . The pooling function computes an embedding of the -th shot in video by feeding each frame through the frame encoder and applying the pooling function,
The pooling function can have different forms, ranging from simple average pooling to attention pooling taking the values of the individual frame embeddings into account. Shot-level prediction functions are trained to predict pretext (label-free) targets from shot embeddings.
More formally, to learn invariances at different levels of abstraction, we define a frame/shot-level loss and a video-level loss. The frame/shot-level loss takes the form
where are shot-level pretext labels and is a shot-level loss that can be instantiated as only acting on the frame level in the sense of decomposing into a sum over the frames (see Sec. 3.2 for concrete instantiations of losses). The video-level loss is given by
where the are video-level pretext labels and is a video-level loss (see Sec. 3.3 for concrete losses). The total loss is then given by , where balances the shot level and video level losses. is minimized jointly w.r.t. the parameters of , , and the .
Co-training with labeled images
We also consider the case where one has access to a limited number of labeled images in addition to the video data. Combining image-based ssl losses with a supervised loss applied to a subset of the images was studied previously by . They found that this approach leads to a state-of-the-art semi-supervised models, and improves the performance of supervised models when all images are labeled. Here, we consider the related setup where the ssl loss is computed on video data, and the supervised loss is based on image data from a different data set. Specifically, we additionally apply
followed by a linear classifier to mini-batches of labeled images and compute the cross-entropy lossbetween the predictions and the image labels. The total loss is then computed as , where balances the contributions of the self-supervised and supervised loss terms.
3.2 Learning shot-level invariances
To define the frame/shot-level loss , we propose to build on any ssl loss designed for images, such as classifying exemplars , solving jigsaw puzzles of image patches , or rotation prediction . For learning shot-induced invariances, one can take two approaches:
apply the image-based ssl loss independently to each frame so that the shot-induced invariances are learned implicitly through the combination of pooling function and and video-level prediction task, or
explicitly ensure that the embeddings of the frames from the same shot are similar by adding a triplet or a contrastive loss to the image-based ssl loss.
In this work, in the spirit of approach (i) we consider ssl by rotation prediction  without additional explicit shot-level loss. To explore approach (ii) we rely on a variant of exemplar ssl , where each image is associated with a different class, and a feature extractor is trained to classify each image into its own class after heavily augmenting it (random cropping, rotation, contrast, and color shifts). Following [11, 32], to scale this approach to hundreds of millions of images (frames), we employ a triplet loss  encouraging augmentations of the same image to be close and augmentations of different images to be far apart. To learn invariances from different frames of the same shot, rather than picking a random frame from the shot and applying random augmentations to it, we pick consecutive frames from the same shot and augment each frame once. As a result, our feature extractor learns both the invariances induced by temporal variation in video as well as those induced by the data augmentation.
3.3 Learning video-level invariances
In contrast to action recognition networks, which learn video representations that have to be discriminative w.r.t. changes between frames, our framework targets learning representations that are invariant to such changes. Nevertheless, discriminative tasks useful for learning representations for action recognition, such as predicting whether a sequence of frames is played forward or backward , verifying whether the frames are ordered or shuffled , or predicting features corresponding to future frames , can be useful to learn abstract transferable representations when applied to sensibly chosen groups of aggregated frames. Following this intuition, our framework allows to apply any of these tasks to shot embeddings, rather than individual frame embeddings. For example, determining whether a sequence of shot embeddings is played forward or backward requires understanding of the high-level semantics of the scene and objects in each shot. Similarly, predicting future shot embeddings from the past ones encourages learning an abstract summary of each shot. In this work we will explore exactly these two approaches.
For shot order prediction, we randomly reverse the order of the shot embeddings and train a prediction function to predict the shot order from concatenated shot embeddings, i.e., in (3) is the cross-entropy loss and is if the sequence of shot embeddings is reversed and otherwise. To train
to predict future shot embeddings, we rely on noise-contrastive estimation. Specifically, we use the embeddings of the shots to obtain a prediction of the embedding of the shot steps in the future. Then, should quantify the quality of the prediction, which we accomplish using the InfoNCE loss 
where is trained to assign high scores to pairs of shot embeddings from the same video, and low values to embeddings computed from different videos.222In practice, we use all shot embeddings from the other videos, not only those at time step , which is known to improve performance . Note that the terms in (4) can, up to an additive constant, be seen as the cross-entropy loss of an -class classification problem where the correct label is , so that we could reformulate the loss in the form (3) using class labels .
4 Experimental setup
Our experiments encompass two training phases, which we refer to as upstream and downstream. First, in the upstream phase, we train our models on video (and image) data using the methods proposed in the previous section. Then, we fine-tune those trained models on a set of downstream problems in the second phase. We focus on the challenging scenario in which the downstream data is limited, and use only 1000 examples for each downstream data set . To understand the limits of the proposed approaches we have also experimented using the full downstream data sets. We provide these results in the supplementary material as our main focus is the low data regime.
We train on the videos in the yt8m data set , which consists of millions of YouTube video IDs with over 3800 visual entities. We downloaded approximately M of these videos sampled at Hz and split them into a training set of M and a testing set of M videos. We further split the videos into shots using a simple strategy based on color histograms, similarly to . We also present results of several baselines approaches applied to a dataset obtained by selecting a single random frame from each video, which we refer to as yt8m frames.
Furthermore, in the co-training experiments we also use (a class-balanced fraction of) the ImageNet (ILSVRC-2012) training set , which contains 1.2M images classified into 1000 categories.
To evaluate the learned representations, we use the data sets and follow the protocol of the vtab . This protocol consists of 19 data sets categorized into three groups as follows (details and references in the appendix).
Natural — Six classical image classification problems on natural images (Caltech101, CIFAR-100, DTD, Flowers102, Pets, Sun397 and SVHN).
Specialized — Image classification on data captured using specialist equipment, from the remote-sensing (Resisc45, EuroSAT) and the medical (Patch Camelyon, Diabetic Rethinopathy) domains.
Structured — Eight tasks to predict properties of the objects appearing in an image (how many there are, their relative position and distance), on both rendered (Clevr, dSprites, SmallNORB, DMLAB) and real (KITTI) data.
For each of these 19 data sets and each model that we propose, we launch a sweep over 4 hyper-parameters (learning rates and schedules, as in the lightweight mode of ). Then, we choose the models that had the best validation accuracy when averaged over these 19 tasks. These best-performing models were then re-trained for each data set on 1000 random points from the union of the train and validation set and evaluated on the testing set. To account for the randomness coming from the initialization of the fresh classification head and the order in which the data appears, we repeated this evaluation scheme three times and report the median test set accuracy (following ).
Architectures and training details
The frame encoder is modeled using the ResNet-50 v2  architecture with BatchNorm . We also investigated the effect of model capacity by widening the network by a factor of three. To avoid mismatch in batch statistics between the two data sources, in the co-training experiments we replace BatchNorm with GroupNorm  and also standardize  the weights of the convolutions. We construct mini-batches by sampling either 2 or 4 consecutive shots from each video (dropping those videos with fewer shots), and randomly select 8 consecutive frames for exemplar-based shot-level ssl and 4 consecutive frames rotation-based frame-level ssl. For the loss, when we sample 2 shots, we predict the embedding of one from the embedding of the other one using a mlp, i.e., the function in (4) has the form , where are mlp with a single hidden layer with 256 units. In the experiments with 4 shots, we use a lstm prediction function with 256 hidden units to predict every shot embedding from the previous ones. We use temporal order prediction only together with exemplar-based ssl and for data with 2 shots per video, relying on a single-hidden-layer mlp with 512 hidden units as prediction function. Throughout, we rely on (parameter-free) average pooling for . For both frame and shot-level ssl approaches we use the augmentation mechanism from . For models co-trained with a supervised loss based on a fraction of ImageNet we additionally use the same HSV-space color randomization as .
We also perform experiments where we replace the augmentation mechanism from 
with aa, which is an augmentation policy learned using a reinforcement learning algorithm from the full ImageNet data set. While it can causelabel leakage when applied to unsupervised methods, we investigate it to understand how these automatically learned invariances compare to those induced by shot-based augmentation which are label-free.
In all cases we choose the batch size such that the product of the number of videos and the number of shots is 2048, i.e., . We train all unsupervised models for 120k iterations, using sgd with a learning rate of 0.8 and momentum 0.9, multiplying the learning rate by 0.1 after 90k and 110k iterations. The co-trained models are trained for 100k iterations, and the schedule as well as the batch size is chosen depending on the amount of labeled data used. For the weight (and for co-trained models) we sweep over at most four different values. A complete description of all hyper-parameters and architectures can be found in the appendix.
We train a rotation and exemplar baseline model on ImageNet and a data set obtained by sampling one frame from each video in our training set (yt8m frames). We use the same training protocol as  for the respective methods except that we increase the batch size to 2048 and the schedule stretched to 120k iterations to be comparable to our methods. Furthermore, for the exemplar-based model we ablate the video-level prediction task, which amounts to treating the shots independently and only using the frames from the same shot as exemplars. In addition, we consider 3 baselines from : A vanilla ResNet-50 v2 pretrained on ImageNet (achieving top-1/top-5 accuracy of %/% on the ImageNet validation set), the exemplar model trained on ImageNet with 10% class-balanced labeled data from  (Semi-Ex-10%), which achieves state-of-the-art semi-supervised accuracy on ImageNet, and the rotation model trained on ImageNet with all labels  (Sup-Rot-100%).
We further compare against three prior works that learn image representations from video data: The ms and the mt-ssl models from , and the ti model from . ms learns representations based on a foreground-background segmentation pretext task. The segmentation maps are derived using an off-the-shelf offline video segmentation algorithm. mt-ssl combines ms and three other self supervision objectives to train a multi-task network. Its representation derives from a combination of colorization, spatial context, and motion segmentation cues. The ms and mt-ssl models fine-tuned in this evaluation have a ResNet-101  architecture up to block3. ti builds a graph combining intra-instance and inter-instance edges and exploits transitivity to learn invariant representations. The intra-instance edges are obtained by tracking patches in videos. We fine-tune their publicly available pre-trained VGG-16  checkpoint. We refer the reader to the supplementary material for implementation details regarding the evaluation of these baselines.
In this section we focus on the low sample-size regime, i.e., when each downstream data set consists of 1000 samples, and discuss the performance on the full data sets in the supplementary material (Table 4). In brief, the ranking of the methods according to the vtab mean score using all examples is similar to the ranking according to the vtab 1000 example mean score. Further, here we only present the best configuration (w.r.t. the number of shots and choice of prediction function) for each of our Video-Induced Visual Invariance (VIVI) learning approaches, and defer the results for other configurations to the supplementary material (Table 4).
5.1 Self-supervised learning
Fig. 2 shows the results for our models and the exemplar-based baselines. The baseline trained on yt8m frames only (Ex-YT-F), without leveraging any temporal information, achieves a mean vtab 1000 example score of %. Exploiting the temporal variations within shots to create exemplars (Ex-YT-S) increases that score by about points. Further, adding the video-level prediction loss on top adds another points. It hence appears that leveraging both shot- and video-level invariances using our approach leads to significant gains over just using frames. In addition, increasing the model capacity (using a wider model) leads to another increase by points. Note that this model is only points behind the semi-supervised model from  (Semi-Ex-10%) which uses 128k labeled images from ImageNet for training (cf. Table 1). The gains mostly come from improvements on the natural and structured data sets, whereas video level losses do not notably improve the score on the specialized data sets (see Fig. 2). We observed the largest gains when using with shots and more modest improvements for and temporal order prediction with shots (see Table 4 in the supplementary material).
Similarly to the exemplar experiments, we observe gains of points in the mean vtab 1000 example score over the frame-based baseline (Rot-YT-F) when using a video-level prediction task (VIVI-Rot(4); see Table 2). The gains are smaller for than for shots when combined with , and temporal order prediction was not effective when combined with rotation prediction as frame-level loss for both . We emphasize that the frame encoder trained via rotation ssl on yt8m frames performs considerably worse than the same model trained on ImageNet. This is not surprising as ImageNet images are carefully cropped and the data has a balanced class distribution. By contrast, frames sampled from yt8m are less balanced in terms of content and arguably provide many shortcuts for the rotation task such as black borders, overlaid logos, frames with text on a uniform background, or might lack any orientation cues.
Effect of AutoAugment (AA)
Table 2 shows the effect of using aa  instead of the augmentation mechanism from . The effect is strongest on the frame-based baselines, increasing the vtab 1000-example score by at least 2, and weakest on models involving shot- and video-level losses, where the increase is between and points. Hence, the invariances induced by aa are, to some degree, complementary to the proposed shot- and video-level losses. However, note that aa is trained on labeled ImageNet images, which might introduce label leakage. Hence, methods relying on aa should not be considered fully unsupervised.
Comparison with related work
Fig. 3 presents a summary of the comparison with baselines. We omit MS and ti as they obtain a vtab 1000 example mean score comparable to relative patch location prediction  and jigsaw  ssl trained on ImageNet. These two methods have a significantly lower vtab 1000 example score than the mt-ssl model, as well as the rotation and exemplar ssl baselines (see Table 4 in the supplementary material). Our VIVI models clearly outperform both the ImageNet baseline and the mt-ssl model. The score obtained by mt-ssl is comparable to that obtained by rotation-based ssl trained on ImageNet, which in turn scores points higher than exemplar-based SSL. Both our models and mt-ssl significantly outperform rotation and exemplar-based ssl on the structured data sets, whereas the ImageNet-based exemplar baseline obtains the highest mean score on the specialized data sets.
5.2 Co-training with ImageNet
In Table 1 we compare the scores obtained by our exemplar-based co-training models with the baselines from . Our model with frame/shot-level and video-level losses and a wider architecture (VIVI-Ex(4)-Big) reduces the gap between exemplar trained on ImageNet and the strong Semi-Ex-10% semi-supervised baseline model by more than a factor of 2. Moreover, our model co-trained with 10% labeled ImageNet examples (class-balanced, no additional unlabeled ImageNet examples are used) outperforms both the Semi-Ex-10% baseline and the ImageNet pre-trained ResNet-50 on the vtab 1000 examples mean score. Using the entire labeled ImageNet training set for co-training yields an increase of points. Finally, scaling up the architecture and applying aa to preprocess the ImageNet data adds points, leading to a clear new state of the art on the vtab benchmark. The largest gains from using (a subset of) ImageNet can generally be observed on the natural data sets, whereas the gains on the specialized and structured data sets are significantly lower. This result is not surprising given that many data sets in the natural category are semantically similar to ImageNet. Fig. 4 shows the per-data set increase/decrease in the vtab 1000 example score when adding a classification loss computed on the entire ImageNet data set to VIVI-Ex(4).
Robustness to video perturbations
Our co-trained models are trained to both recognize 1000 ImageNet categories and be invariant to deformations found in video data. We therefore expect model predictions to be stable across neighbouring frames in a video. To measure if this is indeed the case, we evaluate our VIVI-Ex(4)-Co(100%) model on the ImageNet-Vid-Robust  benchmark. This benchmark measures the drop in accuracy under a stricter definition of the 0-1 loss using videos from the ImageNet-Vid data set . Given a set of frames, the prediction on an “anchor” frame is considered correct only if all
neighboring frames are predicted correctly. Intuitively, the drop in performance going from standard top-1 accuracy on anchor frames to this stricter loss function is indicative of a lack in model robustness. The lower the drop the more robust the model. In Table3 we observe that our co-trained model is slightly more robust than its purely supervised counterpart, although the results are still within error bars. This is similar to the difference in performance drop observed for fine-tuning on ImageNet-Vid as reported in the benchmark paper itself [51, Table 1]. These initial results suggest that our co-training approach leads to a similar effect as fine-tuning, despite the domain shift between yt8m and ImageNet-Vid. It seems that robustness to natural perturbations in videos is extremely challenging and worth investigating in the future.
|Model Type||Accuracy Original||Accuracy Perturbed|
|ImageNet||68.0 [65.2, 70.7]||49.9 [46.9, 52.9]||18.1|
|VIVI-Ex(4)-Co(100%)||62.2 [59.3, 65.1]||46.3 [43.3, 49.2]||15.9|
is the absolute difference between these two. On this benchmark, lower difference is better. Small text in gray reports the Clopper-Pearson confidence interval.
We propose and evaluate a versatile framework for learning transferable, data-efficient image representations by exploiting video-induced visual invariances at different levels of granularity. The framework can be instantiated with any image-based ssl loss at the frame/shot-level and arbitrary sequence prediction proxy tasks at the video-level. Our experiments reveal that purely self-supervised models benefit greatly from exploiting video-induced invariances, outperforming the ssl baselines trained on ImageNet by a large margin, in particular on problems that require predicting the structural properties of the data. Moreover, when augmenting the proposed framework with a supervised classification loss, the resulting models outperform a vanilla ImageNet-pretrained model using fewer labeled examples, and sets a new state of the art on the vtab benchmark when co-trained with the full ImageNet data set.
Future research could target better understanding of how the choice of losses and data sets used for upstream training impacts the performance on different tasks in downstream evaluation. While we found our co-trained models to be somewhat more robust to natural perturbations induced by videos than models trained only on images, further research is needed on learning models that overcome robustness-issues related to perturbations induced by videos.
We would like to thank Xiaohua Zhai for inspiring discussions, in particular on how to learn from video shots, and for contributions to preliminary experiments that led to this paper. Further, we would like to thank Raphael Marinier for help with preparing the yt8m data set. Finally, we are grateful to Lucas Beyer for his implementation of GroupNorm with weight standardization.
-  Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv:1609.08675, 2016.
-  Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In Proc. ICCV, pages 37–45, 2015.
-  Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In Proc. ICCV, pages 609–617, 2017.
-  Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. In NeurIPS, 2019.
-  Uta Buchler, Biagio Brattoli, and Bjorn Ommer. Improving spatiotemporal self-supervision by deep reinforcement learning. In Proc. ECCV, pages 770–786, 2018.
Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze.
Deep clustering for unsupervised learning of visual features.Proc. ECCV, 2018.
-  Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. In Proc. CVPR, 2019.
-  Virginia R de Sa. Learning classification with unlabeled data. In NeurIPS, pages 112–119, 1994.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. CVPR, pages 248–255. IEEE, 2009.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT, 2018.
-  Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proc. ICCV, pages 1422–1430, 2015.
-  Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In ICCV, 2017.
-  Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. In Proc. ICLR, 2017.
Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas
Discriminative unsupervised feature learning with convolutional neural networks.In NeurIPS, pages 766–774, 2014.
Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould.
Self-supervised video representation learning with odd-one-out networks.In Proc. CVPR, 2017.
-  Ruohan Gao, Dinesh Jayaraman, and Kristen Grauman. Object-centric representation learning from unlabeled videos. In Proc. ACCV, pages 248–263. Springer, 2016.
-  Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. Proc. ICLR, 2018.
-  Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual representation learning. arXiv:1905.01235, 2019.
-  Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proc. AISTATS, 2010.
-  Tengda Han, Weidi Xie, and Andrew Zisserman. Video representation learning by dense predictive coding. In Proc. ICCV Workshops, 2019.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Proc. ECCV. Springer, 2016.
-  Olivier J Hénaff, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. Data-efficient image recognition with contrastive predictive coding. arXiv:1905.09272, 2019.
-  R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. Proc. ICLR, 2019.
-  Minyoung Huh, Pulkit Agrawal, and Alexei A Efros. What makes imagenet good for transfer learning? arXiv:1608.08614, 2016.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proc. ICML, 2015.
-  Phillip Isola, Daniel Zoran, Dilip Krishnan, and Edward H Adelson. Learning visual groups from co-occurrences in space and time. arXiv:1511.06811, 2015.
-  Dinesh Jayaraman, Frederik Ebert, Alexei A Efros, and Sergey Levine. Time-agnostic prediction: Predicting predictable video frames. Proc. ICLR, 2019.
-  Dinesh Jayaraman and Kristen Grauman. Slow and steady feature analysis: higher order temporal coherence in video. In Proc. CVPR, pages 3852–3861, 2016.
-  Dinesh Jayaraman and Kristen Grauman. Learning image representations tied to egomotion from unlabeled video. IJCV, 125(1):136–161, Dec 2017.
-  Diederik P. Kingma, Danilo Jimenez Rezende, Shakir Mohamed, and Max Welling. Semi-supervised learning with deep generative models. CoRR, abs/1406.5298, 2014.
-  Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual representation learning. In Proc. CVPR, pages 1920–1929, 2019.
-  Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? CVPR, 2019.
-  Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. In Proc. ICCV, pages 667–676, 2017.
-  Aravindh Mahendran, James Thewlis, and Andrea Vedaldi. Cross pixel optical-flow similarity for self-supervised learning. In Proc. ACCV, pages 99–116. Springer, 2018.
-  Jordi Mas and Gabriel Fernandez. Video shot boundary detection based on color histogram. Notebook Papers TRECVID2003, NIST, 15, 2003.
-  Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In Proc. ECCV. Springer, 2016.
-  Hossein Mobahi, Ronan Collobert, and Jason Weston. Deep learning from temporal coherence in video. In Proc. ICML, pages 737–744. ACM, 2009.
-  Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal deep learning. In Proc. ICML, pages 689–696, 2011.
-  Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proc. ECCV, pages 69–84, 2016.
-  Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Representation learning by learning to count. In Proc. ICCV, 2017.
-  Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv:1807.03748, 2018.
-  Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. Ambient sound provides supervision for visual learning. In Proc. ECCV, pages 801–816. Springer, 2016.
-  Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In Proc. CVPR, pages 2701–2710, 2017.
-  Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Weight standardization. arXiv:1903.10520, 2019.
-  Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems, pages 506–516, 2017.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
Nawid Sayed, Biagio Brattoli, and Björn Ommer.
Cross and learn: Cross-modal self-supervision.
German Conference on Pattern Recognition, pages 228–243. Springer, 2018.
-  Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proc. CVPR, 2015.
-  Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised learning from video. In ICRA, 2018.
-  Vaishaal Shankar, Achal Dave, Rebecca Roelofs, Deva Ramanan, Benjamin Recht, and Ludwig Schmidt. A systematic framework for natural perturbations from videos. arXiv:1906.02168, 2019.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
-  Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In Proc. ICML, pages 843–852, 2015.
-  Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. arXiv:1904.01766, 2019.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proc. CVPR, 2015.
-  Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv:1906.05849, 2019.
-  Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, and Hugo Larochelle. Meta-dataset: A dataset of datasets for learning to learn from few examples. arXiv:1903.03096, 2019.
-  Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In Proc. ICCV, 2015.
-  Xiaolong Wang, Kaiming He, and Abhinav Gupta. Transitive invariance for self-supervised visual representation learning. In Proc. ICCV, pages 1329–1338, 2017.
-  Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. Learning and using the arrow of time. In Proc. CVPR, pages 8052–8060, 2018.
-  Laurenz Wiskott and Terrence J Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715–770, 2002.
-  Yuxin Wu and Kaiming He. Group normalization. In Proc. ECCV, pages 3–19, 2018.
-  Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proc. CVPR, pages 3733–3742, 2018.
Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer.
S4l: Self-supervised semi-supervised learning.In Proc. ICCV, 2019.
-  Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. The Visual Task Adaptation Benchmark. arXiv:1910.04867, 2019.
-  Xiaohang Zhan, Xingang Pan, Ziwei Liu, Dahua Lin, and Chen Change Loy. Self-supervised learning via conditional motion propagation. arXiv:1903.11412, 2019.
-  Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In Proc. ECCV, pages 649–666, 2016.
-  Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning of visual embeddings. In Proc. ICCV, pages 6002–6012, 2019.
-  Will Y Zou, Andrew Y Ng, and Kai Yu. Unsupervised learning of visual invariance with temporal coherence. In NIPS 2011 Workshop on Deep Learning and Unsupervised Feature Learning, volume 3, 2011.
Appendix A Architectures
Here we expand on the short description in Section 4. The frame encoder is modelled using the ResNet-50 v2  architecture with BatchNorm . We also investigate in several experiments the effect of model capacity by widening the network by a factor of three. To avoid mismatch in batch statistics between the two data sources, in the co-training experiments we replace the BatchNorm with GroupNorm  and also standardize  the weights of the convolutions.
For each prediction task, we attach a different linear head to the 2048-dimensional pre-logits ResNet representation before applying the respective loss or prediction function. For exemplar, following, we use a linear head with 1000 outputs with L2-normalization of the features before feeding into the triplet-loss. For rotation prediction we rely on a linear head with 4 outputs. For the video-level loss (prediction across shots using and temporal order prediction) we project the pre-logits, average-pooled across the frames of the same shot, to 512 dimensions using a linear head, and feed this representation to the prediction functions . Finally, in the experiments with co-training, we rely on an additional linear classification head with 1000 outputs.
For the loss, when we sample 2 shots, we predict one from the other using an mlp, i.e., the function in (4) has the form , where are mlp with a single hidden layer with 256 units and 128 outputs. In the experiments with 4 shots, we use a 2-layer lstm prediction function with 256 hidden units to predict every shot embedding from the previous ones. To match the dimension of the lstm output (256) and that of the future shot embeddings (512) we employ another linear layer. We use temporal order prediction only together with exemplar-based ssl and for data with 2 shots per video, relying on a single-hidden-layer mlp with 512 hidden units as prediction function.
For both frame and shot-level ssl approaches we use the augmentation mechanism from . For models co-trained with a supervised loss based on a fraction of ImageNet we additionally use the same HSV-space color randomization as . We also perform experiments where we replace the augmentation mechanism from  with aa, which is an augmentation policy learned using a reinforcement learning algorithm from the full ImageNet data set. More specifically, we rely on the TF-Hub module publicly available at https://tfhub.dev/google/image_augmentation/nas_imagenet/1.
Appendix B Training details
Table 5 provides details about the schedules, batch size, loss weights, etc. used for the individual methods. When exploring the effect of aa we reduce the weight of the video-level loss, , by a factor of 2. The schedule for VIVI-Ex(4)-Co(10%) is motivated as follows. We take the schedule and batch size used for the ImageNet exemplar co-training experiments for 10% labeled ImageNet examples from 
, stretch the schedule to 100k iterations and reduce the batch size (as well as the learning rate) so that number of epochs over the 10% (128k example) data set matches that of.
We set the margin parameter in the semi-hard triplet loss  to 0.5. For rotation-based ssl, following common practice [17, 32], we compute the predicted rotation after appending to the mini-batch 3 rotated copies of the mini-batch along the batch dimension and compute the rotation loss for all rotated copies.
We train all models on 128 cores of a Google TPU v3 Pod. For exemplar ssl the triplet loss is computed per core. For all frame/shot level loss variants, is computed across all cores when prediction is across 4 shots, and computed per core when prediction is across 2 shots as computing the loss across all cores led to instabilities for that case.
Appendix C Baseline fine-tuning details
As mentioned in the main manuscript we compared against two baseline methods: MT-SSL (Multi-Task Self-Supervised Learning) , and TI (Transitive Invariance) . For MT-SSL we considered two variants: MS which was pre-trained on motion segmentation only, and MT-SSL which combined MS with three other tasks in a multi-task setting. We obtained pre-trained checkpoints for all three methods (MS, MT-SSl, and TI) from the authors of their respective prior works.
c.1 Fine-tuning motion segmentation and multi-task SSL baselines
MS and MT-SSL pre-trained a ResNet-101 up to block3. The representation at block3 is , which is too big. In 
, the authors used max-pooling to down-sample this toand then trained a linear predictor for ImageNet classification. We experimented with this approach for VTAB evaluation. The default evaluation protocol for VTAB is to sweep over initial learning rates: and . These were too high for the MS and MT-SSL models. For several downstream evaluation tasks fine-tuning diverged. We therefore modified the evaluation sweep minimally to sweep over initial learning rates: . We also evaluated a simpler alternative: Global average pooling the block3 representation into a
dimensional vector. We found that global average pooling the representation achieved best results on the VTAB validation set. It also did not diverge at higher learning rates, so we could use the default learning rate schedule in this case. We therefore used this setting for the final evaluation on test data.
c.2 Fine-tuning the transitive invariance baseline
at stridefrom the input image and then average the representations across all of these. We found that (c) was computationally extremely expensive. (b) performed best and we report results for that approach on the VTAB test set.