Self-Supervised Learning of Video-Induced Visual Invariances

by   Michael Tschannen, et al.

We propose a general framework for self-supervised learning of transferable visual representations based on video-induced visual invariances (VIVI). We consider the implicit hierarchy present in the videos and make use of (i) frame-level invariances (e.g. stability to color and contrast perturbations), (ii) shot/clip-level invariances (e.g. robustness to changes in object orientation and lighting conditions), and (iii) video-level invariances (semantic relationships of scenes across shots/clips), to define a holistic self-supervised loss. Training models using different variants of the proposed framework on videos from the YouTube-8M data set, we obtain state-of-the-art self-supervised transfer learning results on the 19 diverse downstream tasks of the Visual Task Adaptation Benchmark (VTAB), using only 1000 labels per task. We then show how to co-train our models jointly with labeled images, outperforming an ImageNet-pretrained ResNet-50 by 0.8 points with 10x fewer labeled images, as well as the previous best supervised model by 3.7 points using the full ImageNet data set.



page 5

page 6

page 11


Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet?

Despite recent progress made by self-supervised methods in representatio...

Representation learning from videos in-the-wild: An object-centric approach

We propose a method to learn image representations from uncurated videos...

Concept Generalization in Visual Representation Learning

Measuring concept generalization, i.e., the extent to which models train...

Improvements to context based self-supervised learning

We develop a set of methods to improve on the results of self-supervised...

MERLOT: Multimodal Neural Script Knowledge Models

As humans, we understand events in the visual world contextually, perfor...

Diverse Imagenet Models Transfer Better

A commonly accepted hypothesis is that models with higher accuracy on Im...

Masked Visual Pre-training for Motor Control

This paper shows that self-supervised visual pre-training from real-worl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Supervised deep learning necessitates the collection and manual annotation of large amounts of data, which is often expensive, hard to scale, and may require domain expertise (e.g., in the context of medical data). Expensive data annotation hence presents a bottleneck which impedes the application of deep learning methods to diverse, previously under-explored, problems. Learning

transferable visual representations, namely representations obtained by training a model on one task (or collection of tasks) which can then be used as a starting point for multiple unseen downstream tasks using few samples, is therefore a key research challenge [65].

An emerging body of work based on self-supervision has demonstrated that it is possible to learn such transferable visual representations. The idea is to carefully construct a pretext task which does not rely on manual annotation, yet encourages the model to compute useful features of the input. Videos are a rich source of such pretexts tasks as they capture the variations of instances over time which are not present in images. In addition, there is an abundance of videos available on the Internet covering almost any imaginable domain. As a result, and with the recent emergence of research video data sets [1], videos have been investigated in the context of self-supervision (for example, [37, 60, 59, 27, 61, 69, 38, 48, 39, 3, 2]). We believe that a holistic approach which captures these diverse efforts can be coupled with image-based pretext tasks to further improve the performance of self-supervised models.

Method Mean Nat. Spec. Str.
Ex-ImageNet 59.5 50.5 81.4 56.4
VIVI-Ex(4) 62.5 (+3.0) 55.9 80.9 59.1
VIVI-Ex(4)-Big 63.3 (+3.8) 57.5 81.0 59.5
Semi-Ex-10% [65] 65.3 70.2 81.9 52.7
VIVI-Ex(4)-Co(10%) 67.2 (+1.9) 63.3 82.6 62.9
Sup-100% [65] 66.4 73.5 82.5 52.1
Sup-Rot-100% [65] 68.0 (+1.6) 73.6 83.1 55.5
VIVI-Ex(4)-Co(100%) 69.4 (+3.0) 69.9 83.3 62.1
VIVI-Ex(4)-Co(100%)-Big 71.7 (+5.3) 72.5 84.3 64.7
Table 1: Mean testing accuracy and per-category mean accuracy for models fine-tuned on the 19 diverse downstream tasks (based on Natural, Specialized, Structured data sets) from the vtab benchmark [65], using only 1000 labels per task. The proposed unsupervised models (VIVI-Ex(4) / VIVI-Ex(4)-Big) trained on raw yt8m videos and variants co-trained with 10%/100% labeled ImageNet data (VIVI-Ex(4)-Co(10%) / VIVI-Ex(4)-Co(100%)), outperform the corresponding unsupervised (Ex-ImageNet), semi-supervised (Semi-Ex-10%) and fully supervised (Sup-100%, Sup-Rot-100%) baselines by a large margin.

In this work we propose a versatile video-based self-supervision framework for learning image representations. We divide a video data set into its natural hierarchy of frames, shots, and videos. The intuition is that the model can leverage (1) the frames to learn to be robust to color perturbations or contrast changes, (2) the shot information to be robust to rigid and non-rigid transformations of objects in a scene, and that (3) explicitly accounting for the video-level context should encourage the model to capture semantic relationships of scenes across shots/clips. In contrast to individual frame, shot, or video-level self-supervision objectives, our holistic approach can yield a representation that transfers better to a large set of downstream tasks. As an additional benefit, our approach does not need to pre-compute optical flow or motion segmentation masks, nor does it rely on object tracking.

In contrast to most previous work, our goal is to learn feature representations for downstream image classification as opposed to action recognition. We train the proposed model on the yt8m data set (without using video-level labels) and show that this approach leads to state-of-the-art self-supervised results on the 19 diverse downstream tasks of the vtab [65]. We then show how to co-train the model jointly with labeled images, outperforming an ImageNet-pretrained ResNet-50 with fewer labeled images. We also investigate the robustness of our co-training models to natural perturbations as induced by the variations across nearby frames in videos [51].

In summary, our contributions are:

  • [itemsep=2pt,parsep=2pt]

  • We propose a versatile framework to learn image representations from non-curated videos by learning frame, shot, and video-level invariances.

  • We train a variety of models on M videos from the yt8m data set and achieve a % absolute improvement over image/frame-based baselines across the 19 diverse tasks of the vtab benchmark [65], which sets new state of the art among unsupervised methods.

  • We augment the ssl training framework with a supervised classification loss using data from ImageNet. The resulting models outperform an ImageNet-pretrained network using only 10% labeled ImageNet images (and no additional unlabeled ones), and achieve a new state of the art when co-trained with the full ImageNet data set, outperforming the best previous supervised result by points.

2 Related work

Self-supervised learning of image representations

ssl is an active topic of research in the computer vision community. Recent methods

[63, 24, 4, 42, 23, 56] have advanced the state of the art in terms of learning representations that can linearly separate between the 1000 ImageNet categories [47]. Prior work has explored diverse self-supervision cues such as spatial-context [11]

, colorization 

[67], equivariance to transformations [17, 41]; alongside unsupervised techniques such as clustering [6, 68], generative modelling [13, 31], and exemplar learning [14].

Learning image representations from videos

More relevant to our contribution is the body of literature on ssl of image representations from videos. The temporal context of frames in video data has been widely exploited. For example, [37, 34, 15, 5, 60] make use of the order in which frames appear in a video to learn representations. Other forms of temporal context include its combination with spatial context [59], and the use of spatio-temporal co-occurrence statistics [27]. Orthogonal to these efforts, which attempt to be selective of the differences between frames, prior work along the lines of slow feature analysis [61, 69] also exploited videos as a means of learning invariant representations. Temporal coherence was exploited in a co-training setting by early work [38]

on learning cnn for visual object recognition and face recognition. Slow and steady feature analysis 

[29] attempts to learn representations that exhibit higher order temporal coherence. This object deformation signal can be separated from global camera motion by tracking objects using unsupervised methods. These tracked patches have been used to learn image representations [58]. Tracking in this context may be replaced by spatio-temporally matched region proposals [16].

Some of the earliest work making use of temporal consistency used future frame prediction [53] as a pretext task. A more challenging version of this task is single frame future synthesis. The ambiguity in single-frame prediction has been side-stepped via time-agnostic prediction [28], motion segmentation [44], cross-pixel matching [35], and by giving the model a motion cue as input [66]. The latter two require distilling the temporal information from video pixels into optical-flow fields.

Optical-flow has been treated as a separate modality from the RGB pixels in a multi-modal setting [48, 56]. Even beyond optical-flow, videos, as found on the Internet, are inherently multi-modal, as they contain audio as well as subtitles. Thus relevant here are multi-modal learning methods that combine vision and audio [39, 8, 43, 3], and vision and text [54] to achieve better performance than their uni-modal baselines. In a robotics setting, RGB pixels may be considered together with ego-motion [2, 30]. Time-contrastive networks [50] consider two views of the same action to learn view invariant representations also applied in a robotics setting.

Doersch et al. [12] show that motion-based ssl may be combined with other self-supervision cues namely exemplar, colorization, and spatial-context, to pre-train models that perform better than each of these cues individually. Taking inspiration from their success our framework presents a synergistic combination of ssl methods.

Figure 1: (left) Illustration of the frame, shot, and video-level encoding pipeline used in this work. Each frame is encoded using the frame encoder . The frame embeddings are then aggregated for each shot using a pooling function to obtain shot embeddings . Predictions on the video level are then computed using the prediction functions . (right) Intuitively, we want to choose frame/shot- and video-level losses that embed frames from the same shot close to each other and frames from different shots or videos far apart, while encouraging shot embeddings from the same video to be predictive of each other using (simple) prediction functions.111Video credit: and

Transferable representations

Fine-tuning models trained on ImageNet labels is a popular strategy for transferring representations to new tasks [25]. Kornblith et al. [33] show that better supervised models tend to transfer better when fine-tuned. Other supervised learning benchmarks focus on performance on multiple data sets, either via transfer learning, meta-learning, or multitask learning [46, 57]. In the representation learning literature, models are usually evaluated in-domain, typically on ImageNet [66, and references therein]

. However, self-supervised models are now performing well on tasks such as surface normal estimation, detection, and navigation 

[18]. The vtab evaluates the transferability of representations beyond object classification in the natural image domain to many domains and task semantics such as counting and localization [65]. Similarly, recent developments in nlp have lead to representations that transfer effectively to many diverse tasks [10].

3 Learning video-induced visual invariances

We start by giving an overview of the proposed framework in Sec. 3.1, and discuss frame/shot-level and video-level losses in detail in Sec. 3.2 and Sec. 3.3, respectively.

3.1 Overview

We consider a data set containing videos, each composed of multiple shots. For simplicity of exposition we assume that each video consists of shots, and each shot has frames. If we denote the -th frame in the -th shot of video by , we can write the data set as . Our framework consists of a frame-encoder , a frame embedding pooling function , and one or multiple shot-level prediction functions . The pooling function computes an embedding of the -th shot in video by feeding each frame through the frame encoder and applying the pooling function,


The pooling function can have different forms, ranging from simple average pooling to attention pooling taking the values of the individual frame embeddings into account. Shot-level prediction functions are trained to predict pretext (label-free) targets from shot embeddings.

More formally, to learn invariances at different levels of abstraction, we define a frame/shot-level loss and a video-level loss. The frame/shot-level loss takes the form


where are shot-level pretext labels and is a shot-level loss that can be instantiated as only acting on the frame level in the sense of decomposing into a sum over the frames (see Sec. 3.2 for concrete instantiations of losses). The video-level loss is given by


where the are video-level pretext labels and is a video-level loss (see Sec. 3.3 for concrete losses). The total loss is then given by , where balances the shot level and video level losses. is minimized jointly w.r.t. the parameters of , , and the .

Co-training with labeled images

We also consider the case where one has access to a limited number of labeled images in addition to the video data. Combining image-based ssl losses with a supervised loss applied to a subset of the images was studied previously by [64]. They found that this approach leads to a state-of-the-art semi-supervised models, and improves the performance of supervised models when all images are labeled. Here, we consider the related setup where the ssl loss is computed on video data, and the supervised loss is based on image data from a different data set. Specifically, we additionally apply

followed by a linear classifier to mini-batches of labeled images and compute the cross-entropy loss

between the predictions and the image labels. The total loss is then computed as , where balances the contributions of the self-supervised and supervised loss terms.

3.2 Learning shot-level invariances

To define the frame/shot-level loss , we propose to build on any ssl loss designed for images, such as classifying exemplars [14], solving jigsaw puzzles of image patches [40], or rotation prediction [17]. For learning shot-induced invariances, one can take two approaches:

  1. [label=(),itemsep=2pt,parsep=2pt]

  2. apply the image-based ssl loss independently to each frame so that the shot-induced invariances are learned implicitly through the combination of pooling function and and video-level prediction task, or

  3. explicitly ensure that the embeddings of the frames from the same shot are similar by adding a triplet or a contrastive loss to the image-based ssl loss.

In this work, in the spirit of approach (i) we consider ssl by rotation prediction [17] without additional explicit shot-level loss. To explore approach (ii) we rely on a variant of exemplar ssl [14], where each image is associated with a different class, and a feature extractor is trained to classify each image into its own class after heavily augmenting it (random cropping, rotation, contrast, and color shifts). Following [11, 32], to scale this approach to hundreds of millions of images (frames), we employ a triplet loss [49] encouraging augmentations of the same image to be close and augmentations of different images to be far apart. To learn invariances from different frames of the same shot, rather than picking a random frame from the shot and applying random augmentations to it, we pick consecutive frames from the same shot and augment each frame once. As a result, our feature extractor learns both the invariances induced by temporal variation in video as well as those induced by the data augmentation.

3.3 Learning video-level invariances

In contrast to action recognition networks, which learn video representations that have to be discriminative w.r.t. changes between frames, our framework targets learning representations that are invariant to such changes. Nevertheless, discriminative tasks useful for learning representations for action recognition, such as predicting whether a sequence of frames is played forward or backward [60], verifying whether the frames are ordered or shuffled [37], or predicting features corresponding to future frames [20], can be useful to learn abstract transferable representations when applied to sensibly chosen groups of aggregated frames. Following this intuition, our framework allows to apply any of these tasks to shot embeddings, rather than individual frame embeddings. For example, determining whether a sequence of shot embeddings is played forward or backward requires understanding of the high-level semantics of the scene and objects in each shot. Similarly, predicting future shot embeddings from the past ones encourages learning an abstract summary of each shot. In this work we will explore exactly these two approaches.

Figure 2: vtab 1000 example mean score and per-category mean score of exemplar ssl from yt8m frames (Ex-YT-F), with additional shot-level self-supervision (Ex-YT-S), the proposed method with InfoNCE video-level prediction across 4 frames (VIVI-Ex(4)) and additionally 3wider architecture (VIVI-Ex(4)-Big). Both shot- and video-level losses improve the overall score, with the gains coming mostly from higher mean accuracy on the natural and structured subsets.

For shot order prediction, we randomly reverse the order of the shot embeddings and train a prediction function to predict the shot order from concatenated shot embeddings, i.e., in (3) is the cross-entropy loss and is if the sequence of shot embeddings is reversed and otherwise. To train

to predict future shot embeddings, we rely on noise-contrastive estimation

[19]. Specifically, we use the embeddings of the shots to obtain a prediction of the embedding of the shot steps in the future. Then, should quantify the quality of the prediction, which we accomplish using the InfoNCE loss [42]


where is trained to assign high scores to pairs of shot embeddings from the same video, and low values to embeddings computed from different videos.222In practice, we use all shot embeddings from the other videos, not only those at time step , which is known to improve performance [42]. Note that the terms in (4) can, up to an additive constant, be seen as the cross-entropy loss of an -class classification problem where the correct label is , so that we could reformulate the loss in the form (3) using class labels .

4 Experimental setup

Our experiments encompass two training phases, which we refer to as upstream and downstream. First, in the upstream phase, we train our models on video (and image) data using the methods proposed in the previous section. Then, we fine-tune those trained models on a set of downstream problems in the second phase. We focus on the challenging scenario in which the downstream data is limited, and use only 1000 examples for each downstream data set [65]. To understand the limits of the proposed approaches we have also experimented using the full downstream data sets. We provide these results in the supplementary material as our main focus is the low data regime.

Upstream training

We train on the videos in the yt8m data set [1], which consists of millions of YouTube video IDs with over 3800 visual entities. We downloaded approximately M of these videos sampled at Hz and split them into a training set of M and a testing set of M videos. We further split the videos into shots using a simple strategy based on color histograms, similarly to [36]. We also present results of several baselines approaches applied to a dataset obtained by selecting a single random frame from each video, which we refer to as yt8m frames.

Furthermore, in the co-training experiments we also use (a class-balanced fraction of) the ImageNet (ILSVRC-2012) training set [9], which contains 1.2M images classified into 1000 categories.

Downstream evaluation

To evaluate the learned representations, we use the data sets and follow the protocol of the vtab [65]. This protocol consists of 19 data sets categorized into three groups as follows (details and references in the appendix).

  • [itemsep=2pt,parsep=2pt]

  • Natural — Six classical image classification problems on natural images (Caltech101, CIFAR-100, DTD, Flowers102, Pets, Sun397 and SVHN).

  • Specialized — Image classification on data captured using specialist equipment, from the remote-sensing (Resisc45, EuroSAT) and the medical (Patch Camelyon, Diabetic Rethinopathy) domains.

  • Structured — Eight tasks to predict properties of the objects appearing in an image (how many there are, their relative position and distance), on both rendered (Clevr, dSprites, SmallNORB, DMLAB) and real (KITTI) data.

For each of these 19 data sets and each model that we propose, we launch a sweep over 4 hyper-parameters (learning rates and schedules, as in the lightweight mode of [65]). Then, we choose the models that had the best validation accuracy when averaged over these 19 tasks. These best-performing models were then re-trained for each data set on 1000 random points from the union of the train and validation set and evaluated on the testing set. To account for the randomness coming from the initialization of the fresh classification head and the order in which the data appears, we repeated this evaluation scheme three times and report the median test set accuracy (following  [65]).

Figure 3: Comparison of the vtab 1000 example mean score of the proposed method with exemplar frame/shot-level ssl and InfoNCE video-level prediction across 4 frames (VIVI-Ex(4), and with a 3 wider architecture (VIVI-Ex(4)-Big)), with ImageNet-based exemplar (Ex-ImageNet) and rotation (Rot-ImageNet) baselines, as well as the multi-task ssl model from [12]. Our models outperform all baselines on average, and in particular on the structured data sets.

Architectures and training details

The frame encoder is modeled using the ResNet-50 v2 [22] architecture with BatchNorm [26]. We also investigated the effect of model capacity by widening the network by a factor of three. To avoid mismatch in batch statistics between the two data sources, in the co-training experiments we replace BatchNorm with GroupNorm [62] and also standardize [45] the weights of the convolutions. We construct mini-batches by sampling either 2 or 4 consecutive shots from each video (dropping those videos with fewer shots), and randomly select 8 consecutive frames for exemplar-based shot-level ssl and 4 consecutive frames rotation-based frame-level ssl. For the loss, when we sample 2 shots, we predict the embedding of one from the embedding of the other one using a mlp, i.e., the function in (4) has the form , where are mlp with a single hidden layer with 256 units. In the experiments with 4 shots, we use a lstm prediction function with 256 hidden units to predict every shot embedding from the previous ones. We use temporal order prediction only together with exemplar-based ssl and for data with 2 shots per video, relying on a single-hidden-layer mlp with 512 hidden units as prediction function. Throughout, we rely on (parameter-free) average pooling for . For both frame and shot-level ssl approaches we use the augmentation mechanism from [55]. For models co-trained with a supervised loss based on a fraction of ImageNet we additionally use the same HSV-space color randomization as [64].

We also perform experiments where we replace the augmentation mechanism from [55]

with aa, which is an augmentation policy learned using a reinforcement learning algorithm from the full ImageNet data set. While it can cause

label leakage when applied to unsupervised methods, we investigate it to understand how these automatically learned invariances compare to those induced by shot-based augmentation which are label-free.

In all cases we choose the batch size such that the product of the number of videos and the number of shots is 2048, i.e., . We train all unsupervised models for 120k iterations, using sgd with a learning rate of 0.8 and momentum 0.9, multiplying the learning rate by 0.1 after 90k and 110k iterations. The co-trained models are trained for 100k iterations, and the schedule as well as the batch size is chosen depending on the amount of labeled data used. For the weight (and for co-trained models) we sweep over at most four different values. A complete description of all hyper-parameters and architectures can be found in the appendix.


We train a rotation and exemplar baseline model on ImageNet and a data set obtained by sampling one frame from each video in our training set (yt8m frames). We use the same training protocol as [32] for the respective methods except that we increase the batch size to 2048 and the schedule stretched to 120k iterations to be comparable to our methods. Furthermore, for the exemplar-based model we ablate the video-level prediction task, which amounts to treating the shots independently and only using the frames from the same shot as exemplars. In addition, we consider 3 baselines from [65]: A vanilla ResNet-50 v2 pretrained on ImageNet (achieving top-1/top-5 accuracy of %/% on the ImageNet validation set), the exemplar model trained on ImageNet with 10% class-balanced labeled data from [64] (Semi-Ex-10%), which achieves state-of-the-art semi-supervised accuracy on ImageNet, and the rotation model trained on ImageNet with all labels [64] (Sup-Rot-100%).

We further compare against three prior works that learn image representations from video data: The ms and the mt-ssl models from [11], and the ti model from [59]. ms learns representations based on a foreground-background segmentation pretext task. The segmentation maps are derived using an off-the-shelf offline video segmentation algorithm. mt-ssl combines ms and three other self supervision objectives to train a multi-task network. Its representation derives from a combination of colorization, spatial context, and motion segmentation cues. The ms and mt-ssl models fine-tuned in this evaluation have a ResNet-101 [21] architecture up to block3. ti builds a graph combining intra-instance and inter-instance edges and exploits transitivity to learn invariant representations. The intra-instance edges are obtained by tracking patches in videos. We fine-tune their publicly available pre-trained VGG-16 [52] checkpoint. We refer the reader to the supplementary material for implementation details regarding the evaluation of these baselines.

5 Results

In this section we focus on the low sample-size regime, i.e., when each downstream data set consists of 1000 samples, and discuss the performance on the full data sets in the supplementary material (Table 4). In brief, the ranking of the methods according to the vtab mean score using all examples is similar to the ranking according to the vtab 1000 example mean score. Further, here we only present the best configuration (w.r.t. the number of shots and choice of prediction function) for each of our Video-Induced Visual Invariance (VIVI) learning approaches, and defer the results for other configurations to the supplementary material (Table 4).

5.1 Self-supervised learning


Fig. 2 shows the results for our models and the exemplar-based baselines. The baseline trained on yt8m frames only (Ex-YT-F), without leveraging any temporal information, achieves a mean vtab 1000 example score of %. Exploiting the temporal variations within shots to create exemplars (Ex-YT-S) increases that score by about points. Further, adding the video-level prediction loss on top adds another points. It hence appears that leveraging both shot- and video-level invariances using our approach leads to significant gains over just using frames. In addition, increasing the model capacity (using a wider model) leads to another increase by points. Note that this model is only points behind the semi-supervised model from [64] (Semi-Ex-10%) which uses 128k labeled images from ImageNet for training (cf. Table 1). The gains mostly come from improvements on the natural and structured data sets, whereas video level losses do not notably improve the score on the specialized data sets (see Fig. 2). We observed the largest gains when using with shots and more modest improvements for and temporal order prediction with shots (see Table 4 in the supplementary material).


Similarly to the exemplar experiments, we observe gains of points in the mean vtab 1000 example score over the frame-based baseline (Rot-YT-F) when using a video-level prediction task (VIVI-Rot(4); see Table 2). The gains are smaller for than for shots when combined with , and temporal order prediction was not effective when combined with rotation prediction as frame-level loss for both . We emphasize that the frame encoder trained via rotation ssl on yt8m frames performs considerably worse than the same model trained on ImageNet. This is not surprising as ImageNet images are carefully cropped and the data has a balanced class distribution. By contrast, frames sampled from yt8m are less balanced in terms of content and arguably provide many shortcuts for the rotation task such as black borders, overlaid logos, frames with text on a uniform background, or might lack any orientation cues.

Effect of AutoAugment (AA)

Table 2 shows the effect of using aa [7] instead of the augmentation mechanism from [55]. The effect is strongest on the frame-based baselines, increasing the vtab 1000-example score by at least 2, and weakest on models involving shot- and video-level losses, where the increase is between and points. Hence, the invariances induced by aa are, to some degree, complementary to the proposed shot- and video-level losses. However, note that aa is trained on labeled ImageNet images, which might introduce label leakage. Hence, methods relying on aa should not be considered fully unsupervised.

Exemplar Rotation
yt-f yt-s vivi(4) vivi(4)-Big yt-f vivi
w/o aa 59.4 61.3 62.5 63.3 56.9 58.9
aa 61.8 62.8 63.0 64.4 58.9 59.9
Table 2: Effect of replacing the data augmentation mechanism from [55] with aa. Video-induced invariances learned by our method are complementary to AA in the sense that applying AA to different variants of our method consistently leads to improvements.

Comparison with related work

Fig. 3 presents a summary of the comparison with baselines. We omit MS and ti as they obtain a vtab 1000 example mean score comparable to relative patch location prediction [11] and jigsaw [40] ssl trained on ImageNet. These two methods have a significantly lower vtab 1000 example score than the mt-ssl model, as well as the rotation and exemplar ssl baselines (see Table 4 in the supplementary material). Our VIVI models clearly outperform both the ImageNet baseline and the mt-ssl model. The score obtained by mt-ssl is comparable to that obtained by rotation-based ssl trained on ImageNet, which in turn scores points higher than exemplar-based SSL. Both our models and mt-ssl significantly outperform rotation and exemplar-based ssl on the structured data sets, whereas the ImageNet-based exemplar baseline obtains the highest mean score on the specialized data sets.

5.2 Co-training with ImageNet

Figure 4: Per-data set comparison of our exemplar-based unsupervised model (VIVI-Ex(4)) and its counterpart co-trained with the full ImageNet data set (VIVI-Ex(4)-Co(100%)). The accuracy on most of the natural (red) and specialized (green) data sets improves, with the largest improvements observed on the latter, while the accuracy decreases for about half of the structured data sets (blue).

In Table 1 we compare the scores obtained by our exemplar-based co-training models with the baselines from [65]. Our model with frame/shot-level and video-level losses and a wider architecture (VIVI-Ex(4)-Big) reduces the gap between exemplar trained on ImageNet and the strong Semi-Ex-10% semi-supervised baseline model by more than a factor of 2. Moreover, our model co-trained with 10% labeled ImageNet examples (class-balanced, no additional unlabeled ImageNet examples are used) outperforms both the Semi-Ex-10% baseline and the ImageNet pre-trained ResNet-50 on the vtab 1000 examples mean score. Using the entire labeled ImageNet training set for co-training yields an increase of points. Finally, scaling up the architecture and applying aa to preprocess the ImageNet data adds points, leading to a clear new state of the art on the vtab benchmark. The largest gains from using (a subset of) ImageNet can generally be observed on the natural data sets, whereas the gains on the specialized and structured data sets are significantly lower. This result is not surprising given that many data sets in the natural category are semantically similar to ImageNet. Fig. 4 shows the per-data set increase/decrease in the vtab 1000 example score when adding a classification loss computed on the entire ImageNet data set to VIVI-Ex(4).

Robustness to video perturbations

Our co-trained models are trained to both recognize 1000 ImageNet categories and be invariant to deformations found in video data. We therefore expect model predictions to be stable across neighbouring frames in a video. To measure if this is indeed the case, we evaluate our VIVI-Ex(4)-Co(100%) model on the ImageNet-Vid-Robust [51] benchmark. This benchmark measures the drop in accuracy under a stricter definition of the 0-1 loss using videos from the ImageNet-Vid data set [47]. Given a set of frames, the prediction on an “anchor” frame is considered correct only if all

neighboring frames are predicted correctly. Intuitively, the drop in performance going from standard top-1 accuracy on anchor frames to this stricter loss function is indicative of a lack in model robustness. The lower the drop the more robust the model. In Table 

3 we observe that our co-trained model is slightly more robust than its purely supervised counterpart, although the results are still within error bars. This is similar to the difference in performance drop observed for fine-tuning on ImageNet-Vid as reported in the benchmark paper itself [51, Table 1]. These initial results suggest that our co-training approach leads to a similar effect as fine-tuning, despite the domain shift between yt8m and ImageNet-Vid. It seems that robustness to natural perturbations in videos is extremely challenging and worth investigating in the future.

Model Type Accuracy Original Accuracy Perturbed
ImageNet 68.0 [65.2, 70.7] 49.9 [46.9, 52.9] 18.1
VIVI-Ex(4)-Co(100%) 62.2 [59.3, 65.1] 46.3 [43.3, 49.2] 15.9
Table 3: ImageNet-Vid-Robust: We evaluate our VIVI-Ex(4)-Co(100%) model (co-trained using all labeled images available in the ImageNet training set), on the ImageNet-Vid-Robust benchmark [51]. Accuracy original is the top-1 accuracy measured on “anchor” frames. Accuracy perturbed is the PM-10 accuracy from the benchmark. It is the worst case accuracy defined over neighbouring 20 frames [51] around each “anchor” frame.

is the absolute difference between these two. On this benchmark, lower difference is better. Small text in gray reports the Clopper-Pearson confidence interval.

6 Conclusion

We propose and evaluate a versatile framework for learning transferable, data-efficient image representations by exploiting video-induced visual invariances at different levels of granularity. The framework can be instantiated with any image-based ssl loss at the frame/shot-level and arbitrary sequence prediction proxy tasks at the video-level. Our experiments reveal that purely self-supervised models benefit greatly from exploiting video-induced invariances, outperforming the ssl baselines trained on ImageNet by a large margin, in particular on problems that require predicting the structural properties of the data. Moreover, when augmenting the proposed framework with a supervised classification loss, the resulting models outperform a vanilla ImageNet-pretrained model using fewer labeled examples, and sets a new state of the art on the vtab benchmark when co-trained with the full ImageNet data set.

Future research could target better understanding of how the choice of losses and data sets used for upstream training impacts the performance on different tasks in downstream evaluation. While we found our co-trained models to be somewhat more robust to natural perturbations induced by videos than models trained only on images, further research is needed on learning models that overcome robustness-issues related to perturbations induced by videos.


We would like to thank Xiaohua Zhai for inspiring discussions, in particular on how to learn from video shots, and for contributions to preliminary experiments that led to this paper. Further, we would like to thank Raphael Marinier for help with preparing the yt8m data set. Finally, we are grateful to Lucas Beyer for his implementation of GroupNorm with weight standardization.


Appendix A Architectures

Here we expand on the short description in Section 4. The frame encoder is modelled using the ResNet-50 v2 [22] architecture with BatchNorm [26]. We also investigate in several experiments the effect of model capacity by widening the network by a factor of three. To avoid mismatch in batch statistics between the two data sources, in the co-training experiments we replace the BatchNorm with GroupNorm [62] and also standardize [45] the weights of the convolutions.

For each prediction task, we attach a different linear head to the 2048-dimensional pre-logits ResNet representation before applying the respective loss or prediction function. For exemplar, following

[32], we use a linear head with 1000 outputs with L2-normalization of the features before feeding into the triplet-loss. For rotation prediction we rely on a linear head with 4 outputs. For the video-level loss (prediction across shots using and temporal order prediction) we project the pre-logits, average-pooled across the frames of the same shot, to 512 dimensions using a linear head, and feed this representation to the prediction functions . Finally, in the experiments with co-training, we rely on an additional linear classification head with 1000 outputs.

For the loss, when we sample 2 shots, we predict one from the other using an mlp, i.e., the function in (4) has the form , where are mlp with a single hidden layer with 256 units and 128 outputs. In the experiments with 4 shots, we use a 2-layer lstm prediction function with 256 hidden units to predict every shot embedding from the previous ones. To match the dimension of the lstm output (256) and that of the future shot embeddings (512) we employ another linear layer. We use temporal order prediction only together with exemplar-based ssl and for data with 2 shots per video, relying on a single-hidden-layer mlp with 512 hidden units as prediction function.

For both frame and shot-level ssl approaches we use the augmentation mechanism from [55]. For models co-trained with a supervised loss based on a fraction of ImageNet we additionally use the same HSV-space color randomization as [64]. We also perform experiments where we replace the augmentation mechanism from [55] with aa, which is an augmentation policy learned using a reinforcement learning algorithm from the full ImageNet data set. More specifically, we rely on the TF-Hub module publicly available at

Appendix B Training details

Table 5 provides details about the schedules, batch size, loss weights, etc. used for the individual methods. When exploring the effect of aa we reduce the weight of the video-level loss, , by a factor of 2. The schedule for VIVI-Ex(4)-Co(10%) is motivated as follows. We take the schedule and batch size used for the ImageNet exemplar co-training experiments for 10% labeled ImageNet examples from [64]

, stretch the schedule to 100k iterations and reduce the batch size (as well as the learning rate) so that number of epochs over the 10% (128k example) data set matches that of


We set the margin parameter in the semi-hard triplet loss [49] to 0.5. For rotation-based ssl, following common practice [17, 32], we compute the predicted rotation after appending to the mini-batch 3 rotated copies of the mini-batch along the batch dimension and compute the rotation loss for all rotated copies.

We train all models on 128 cores of a Google TPU v3 Pod. For exemplar ssl the triplet loss is computed per core. For all frame/shot level loss variants, is computed across all cores when prediction is across 4 shots, and computed per core when prediction is across 2 shots as computing the loss across all cores led to instabilities for that case.

LR #it. w. #it. LR schedule WD batch size #exemp. Ex-ImageNet 0.8 120k 17k 0.1@52k;86k - - 2048 8 Ex-YT-F 0.8 120k 17k 0.1@52k;86k - - 2048 8 Ex-YT-S 0.8 120k 5k 0.1@90k;110k - - 2048 8 (sh.) VIVI-Ex(2)-Ord 0.8 120k 5k 0.1@90k;110k - sh. 8 (sh.) VIVI-Ex(2) 0.8 120k 5k 0.1@90k;110k - sh. 8 (sh.) VIVI-Ex(4) 0.8 120k 5k 0.1@90k;110k - sh. 8 (sh.) VIV-Ex(4)-Big 0.8 120k 5k 0.1@90k;110k 0.04 - sh. 8 (sh.) VIVI-Ex(4)-Co(10%) 0.1 100k 3k 0.1@76k;88k;96k 0.04 sh., 256 im. 8 (sh.) VIVI-Ex(4)-Co(100%) 0.8 100k 3k 0.1@70k;85k;95k 0.04 sh., 2048 im. 8 (sh.) VIVI-Ex(4)-Co(100%)-Big 0.8 100k 3k 0.1@70k;85k;95k 0.04 sh., 2048 im. 8 (sh.) Rot-ImageNet 0.8 120k 17k 0.1@52k;86k - - 2048 1 Rot-YT-F 0.8 120k 17k 0.1@52k;86k - - 2048 1 VIVI-Rot(2) 0.8 120k 5k 0.1@90k;110k - sh. 4 (sh.) VIVI-Rot(4) 0.8 120k 5k 0.1@90k;110k - sh. 4 (sh.)

Table 5: Learning rate (LR), number of training iterations (#it.), number of linear warm-up iterations (w. #it.), learning rate schedule (LR schedule), weight decay (WD), video-level loss weight (), supervised cross-entropy loss weight (), batch size, and the number of exemplars (#exemp.) for the different models considered in this paper. Lists of values indicate values explored in the parameter sweep, with the optimal value (in terms of validation VTAB 1000 example score) underlined. For the co-training methods we indicate video (suffix “sh.”) and image (suffix “im.”) batch size. If the number of exemplars is followed by “(sh.)” we use consecutive frames of the same shot to create exemplars.

Appendix C Baseline fine-tuning details

As mentioned in the main manuscript we compared against two baseline methods: MT-SSL (Multi-Task Self-Supervised Learning) [12], and TI (Transitive Invariance) [59]. For MT-SSL we considered two variants: MS which was pre-trained on motion segmentation only, and MT-SSL which combined MS with three other tasks in a multi-task setting. We obtained pre-trained checkpoints for all three methods (MS, MT-SSl, and TI) from the authors of their respective prior works.

c.1 Fine-tuning motion segmentation and multi-task SSL baselines

MS and MT-SSL pre-trained a ResNet-101 up to block3. The representation at block3 is , which is too big. In [12]

, the authors used max-pooling to down-sample this to

and then trained a linear predictor for ImageNet classification. We experimented with this approach for VTAB evaluation. The default evaluation protocol for VTAB is to sweep over initial learning rates: and . These were too high for the MS and MT-SSL models. For several downstream evaluation tasks fine-tuning diverged. We therefore modified the evaluation sweep minimally to sweep over initial learning rates: . We also evaluated a simpler alternative: Global average pooling the block3 representation into a

dimensional vector. We found that global average pooling the representation achieved best results on the VTAB validation set. It also did not diverge at higher learning rates, so we could use the default learning rate schedule in this case. We therefore used this setting for the final evaluation on test data.

c.2 Fine-tuning the transitive invariance baseline

We exported the pre-trained caffe checkpoint into tensorflow using the caffe-tensorflow tool

333 We found that the pre-trained VGG-16 backbone diverges at higher learning rates when fine-tuning downstream on VTAB tasks. We therefore manually adjusted the sweep over initial learning rates and found to work well. Another challenge with transferring this baseline model to several downstream data sets was that it is a patch-based model that expects dimensional input, whereas the VTAB benchmark scales all images to . We experimented with three ways of deploying this downstream: (a) Resize the input image from into , (b) apply the model fully convolutionally and compute a global average pool at the end, and (c) crop patches of size

at stride

from the input image and then average the representations across all of these. We found that (c) was computationally extremely expensive. (b) performed best and we report results for that approach on the VTAB test set.

Appendix D Additional results

In Fig. 5 to 9 we provide per-data set comparisons of different model pairs to better understand the effect of increasing the model size, using aa, and co-training with different amounts of labeled images. All numbers are accuracies when using 1000 labels for fine-tuning.

Figure 5: Per-data set comparison of ImageNet-based exemplar ssl (Ex-ImageNet) with VIVI-Ex(4). Training on yt8m rather than ImageNet and exploiting temporal information mostly helps on natural (red) and structured (blue) data sets, and slightly hurts for some specialized (green) data sets.
Figure 6: Per-data set comparison of VIVI-Ex(4) and a 3 wider counterpart (VIVI-Ex(4)-Big). Increasing model capacity leads to an increase in accuracy for all natural (red) data sets and some structured (blue) and specialized (green) data sets. However, some structured and specialized data sets also incur a reduction in accuracy.
Figure 7: Per-data set comparison of VIVI-Ex(4) and a variant using aa. aa seems to benefit all data set categories similarly, and also leads to reductions in accuracy for a few data sets from all categories.
Figure 8: Per-data set comparison of VIVI-Ex(4) and its counterpart co-trained with 10% class-balanced ImageNet data (VIVI-Ex(4)-Co(10%)). Most data sets from each category incur an increase in accuracy, but one data set from each the natural and structured categories suffer a significant loss in accuracy.
Figure 9: Effect of increasing the number of ImageNet images used for co-training from 10% (VIVI-Ex(4)-Co(10%)) to 100% (VIVI-Ex(4)-Co(100%)). The accuracy on the majority of natural (red) data sets is significantly increased, whereas most of the structured data sets incur a slight drop in accuracy.