Video representation learning is an important problem which benefits high-level perception tasks including action recognition and video object detection [39, 6, 40]. It has many key applications, such as web-video retrieval, robot perception, and smart homes and cities. However, learning visual representations generally requires a large number of labeled training examples. This is even more so for videos, as videos are higher-dimensional input than images and video CNNs have more learnable parameters than 2D ones. Simultaneously, videos are more expensive to collect and annotate than images, as they require additional, and often ambiguous, temporal annotations . Additionally, in rare event detection, very few examples may be available to begin with. Thus, obtaining a good video representation without relying on domain specific, annotated video samples, has significant impact on real-world scenarios where large-scale video data curation and labeling is prohibitive.
In this paper, we present a new, principled method for unsupervised learning of video representations from unlabeled video data. It is based on the observation that the optimized combination of multiple self-supervised tasks111In this work we use unsupervised and self-supervised interchangeably., which are additionally encouraged by multi-modal distillation, is often sufficient to learn good feature representations. Importantly, we demonstrate that such combination could be found without per-class or per-video labeling by instead matching the representation statistics to a general power distribution of video classes, e.g., Zipf’s law .
Our approach is to train the network so that its intermediate representations reflect not just information directly obtained from its own input modality (e.g., RGB image) but also information from different modalities (e.g., grayscale, optical flow, and audio). The idea is that synchronized multi-modal data sources should benefit representation learning of each other as they correspond to the same content. This is done by the introduction of ‘distillation’  losses between multiple streams of networks. The distillation losses, as well as the self-supervised tasks, do not rely on human annotation or supervision. As a result, our approach is formulated as a multi-modal, multi-task unsupervised learning, where the tasks include single-modality tasks like frame ordering as well as multi-modality tasks like video-audio alignment.
However, combining multiple different self-supervised task losses and distillation losses for unlabeled representation learning is a challenging problem, as certain tasks and modalities are more relevant to the final task than others and different loss functions have different scales. Thus, we newly introduce the concept of using an evolutionary algorithm to obtain a better multi-modal, multi-task loss function that appropriately combines all the losses to train the network. AutoML has successfully been applied to architecture search  and data augmentation . Here we extend this concept to unsupervised learning by automatically finding the weighting of self-supervised tasks for video representation learning. The ‘fitness’ of this evolution could naturally be measured with task specific labels (e.g., accuracy). However, we instead propose a purely unsupervised alternative based on power law distribution matching between the datasets, by using KL divergence constraints. These constraints do not require any labeled data, enabling fully unsupervised and unlabeled learning.
Our goal is to find video feature representations, based on a single RGB network, that can seamlessly improve supervised or unsupervised tasks without additional annotations. The main contributions are:
Formulation of unsupervised learning as multi-modal, multi-task learning, including distillation tasks to transfer features across modalities into a single-stream network. Once learned, it allows for faster representation computation while still capturing multi-modal features.
Evolutionary search for a loss function that automatically combines self-supervised and distillation tasks that are beneficial for unsupervised representation learning.
Introduction of an unsupervised representation evaluation metric based on power law distribution matching, which requires no labels and performs similarly to the label-guided one.
This work makes the surprising finding that large amounts of unlabeled data, combined with self-supervised tasks and the power law distribution matching, produces very powerful feature representations which are only rivaled by large datasets with very extensive data labeling: Our feature representations (obtained with zero labels) outperform ImageNet pre-training, and pre-training on small and medium-size labeled video datasets; it is only outperformed by Kinetics pre-training with full annotations based on human labeling of more than 200,000 videos. Further, the proposed representations outperform Kinetics training, when fine-tuned with Kinetics labels. We refer to the model as ‘ELo’, as it is based on evolving unsupervised losses.
2 Related Work
Unsupervised Video Representation Learning:
Obtaining labeled video data is expensive, and unlabeled video data is plentiful, and there have been many methods proposed for self-supervised learning of video representations. Some tasks take advantage of the temporal structure in videos, such as predicting if frames appear in order, reverse order, shuffled, color-consistency across frames, etc.[12, 26, 30, 24, 31, 46, 19, 21, 44, 45, 41]. Other work has explored using the spatial structure present in images, such as predicting relative spatial location of image patches  or tracking patches over time , showing promising results. Reconstruction or prediction of future frames, or time-contrastive learning[18, 33] to obtain representations has also been successful. Learning representations taking advantage of audio and video features has been explored by predicting if an audio clip is from a video  or if audio and video are temporally aligned [29, 7, 22, 3].
Multi-task self-supervised learning has also shown promising results [10, 32, 47], where tasks are assumed to have equal weights and are not multi-modal. Generating weak labels using -means clustering on CNN features [4, 5] or using clustering with meta-learning  has also been explored. In this paper, we propose a generalized approach to unsupervised representation learning, allowing for multi-modal inputs and automatic discovery of the tasks that benefit recognition performance.
. With the introduction of large activity recognition datasets (e.g., Kinetics and Moments in Time[20, 27]), much more accurate deep video CNNs are possible . We here show that they can be further improved by unsupervised representation learning.
We formulate unsupervised video representation learning as a combination of multi-task, multi-modal learning. The objective is not only to take advantage of multiple self-supervised tasks for the learning of a (good) representation space, but also to do so across multiple modalities. The idea is that models from synchronized multi-modal data, sharing the same semantic content, will benefit the representation learning of each other. We encourage that via introducing ‘distillation’ losses. At the same time, each modality may have multiple self-supervised tasks with their corresponding losses. Fig. 2 illustrates the multi-task, multi-modal formulation with multiple losses and distillation, Section 3.1 has the details.
To facilitate multi-task, multi-modal learning, importantly, we introduce the new concept of automatically evolving the main loss function. Certain tasks and modalities are more relevant to the final task so the representation needs to focus on those more than the others. The idea is to computationally search for how different multi-task and distillation losses should be combined, instead of constructing a loss function by trial-and-error. We discuss this more in Section 3.2.
One key technical question is how one can guide the evolution without a pre-defined task or a fitness function. We propose an unsupervised method to evaluate each loss function, based on matching of the power law distribution of activity classes (Section 3.2.2).
3.1 Unsupervised multi-modal learning
We construct a CNN for each modality. Each network is trained using several tasks not requiring labeled video data, and the information from each modality is combined using distillation  (Fig. 2). More specifically, we take advantage of multiple self-supervised tasks, such as frame reconstruction, future frame prediction, and frame temporal ordering (Section 3.3 discusses them in detail). Each of these tasks will yield an unsupervised loss for training. Learning with multiple self-supervised tasks makes our representations more generic, as they need to generalize to many tasks and are more transferable to unseen tasks.
For each modality, and its input , we build an embedding network, , which generates an embedded representation of the input: . is the feature representation for modality . Our embedding networks are (2+1)D ResNet-50 models, which take advantage of both 2D spatial convolutions and 1D temporal convolutions to represent videos; they provide state-of-the-art performance on video understanding tasks. As mentioned, for each modality, we consider several learning tasks, for example, frame reconstruction. Each of the tasks per modality has its own loss function. is the loss from task and modality and is the set of tasks for the modality.
Further, to better take advantage of the multi-modal representations, we use distillation to ‘infuse’ the other modalities into the RGB network at different locations. Our final objective is to train a single RGB network that provides a strong representation for video understanding. Our formulation allows the RGB network to learn representations from various tasks and modalities.
We combine the multi-task losses for unsupervised training, and per each modality, by a weighted sum, and we further combine it with a number of distillation losses which fuse or synchronize the multiple modalities:
where and are the weights of the losses. The weighted sum is the loss we use to train the entire model.
Distillation was introduced to train smaller networks by matching representation of deeper ones 
, or for generally transferring knowledge from pre-trained networks. Here, we use distillation to ‘infuse’ representations of different modalities into the main, RGB network. Note that we distill representations jointly while training. The distillation losses learn features by transferring information across modalities. More specifically, our formulation allows for the distillation of audio, optical flow and temporal information into a single, RGB-based convolutional neural network.
The distillation loss is the difference between the activations of a layer in the main network and a layer in another network . Such constraint encourages the activations of the main network to match the activations of the other network, infusing other features into the main network.
Distillation has previously been used for combining networks such as ensembles  or learning to predict optical flow features from RGB , here we are extending its usage for multi-modal representation learning from unlabeled video data. While in principle distillation can happen across all modalities, we do distillation only into the RGB stream, so as to obtain a final single-tower efficient representation for learning subsequent tasks. Using the learned weights for the RGB network, we can then extract representations for a set of videos.
3.2 Evolving an unsupervised loss function
Our representation learning is governed by the weight coefficients of the loss in Equation 1, and they need to be appropriately determined. The weighting supposedly reflects the importance or relevance of each task and modality to the main task; for example the optical flow modality may be important for tracking, whereas audio may give more information for temporal segmentation of videos in certain settings.
3.2.1 Unsupervised loss construction
Instead of manually constructing a loss function, we evolve the loss function by taking advantage of well-established evolutionary algorithms, e.g., . More specifically, our search space consists of all the weights of the loss function, both task weights and distillation weights. Each or is constrained to be in . Our evolutionary algorithm maintains a pool (i.e., population) of individuals where each individual is a set of weight values that compose the final loss function.
3.2.2 Unsupervised Zipf distribution matching
The evolutionary algorithm requires evaluation of the loss function (i.e., fitness measure) at each round to optimize the loss weight coefficients. We propose a new, unsupervised method for this. In order to measure the fitness of each individual (i.e., the set of weights to combine the tasks and modalities to form the final loss), we apply a -means clustering on the representation learned with the corresponding loss function, and analyze the cluster distributions. We first train the network using a smaller subset (100k) of unlabeled, random YouTube videos for 10,000 iterations (using the corresponding loss function). We then use a subset of random YouTube videos and similarly extract representations . Given these representations, we can then cluster them into clusters.
-means clustering can be viewed as a Gaussian Mixture Model with fixed variances and we calculate probabilities of each feature vector belonging to a cluster, which reduces to computing the distance. Specifically, for the cluster centroidswhere (a -dimensional vector), we can compute the probability of a feature vector belonging to a cluster as:
Since we are (naively) assuming all clusters have the same variance (for simplicity, let ) and an equal prior over all clusters, we can use Bayes rules to rewrite as:
which we note is the standard softmax function applied to the squared distances from a feature to a cluster center .
, the activity classes of videos follow a Zipf distribution. We can use this as a prior constraint on the distribution of the videos in these clusters. Specifically, given the above probability of each video belonging to each cluster, and the Zipf distribution, we compute the prior probability of each class aswhere is the th harmonic number and is some real constant. We then let , the average over all videos in the set. Using these two probability functions representing the classes/clusters, we can minimize the KL divergence:
By using this as a fitness metric, it poses a prior constraint over the distribution of (learned) video representations in clusters to follow the Zipf distribution. Note that this method requires no labeled data and is fully unsupervised. We refer to this entirely unsupervised method as ‘ELo’.
As an upper-bound and an alternative to our approach, we use a handful of class labels to evaluate the clustering (which is referred to as ELo-weak). This is done for the sake of comparison and is also a good alternative to align the final loss to a downstream video classification task. A subset of HMDB is used, and -means clustering is applied to the output representation of the RGB stream. The clusters are used for nearest-neighbors classification and the accuracy is the fitness of the individual. Due to randomness in -means clustering, in both settings, we run this processes 20 times and average fitness across all trials.
3.2.3 Loss evolution
As is typical with evolution approaches, the evolution of the loss is driven by mutations. Since our search space consists of continuous values in , we compare two different evolutionary strategies: tournament selection  and CMA-ES . For the tournament selection search, we mutate an individual loss function by randomly picking one weight and assigning new value in uniformly sampled from . For CMA-ES, at each iteration, all the components are changed based on the fitness of all the individuals in the evolution pool. For tournament selection, we evolve the loss function for 2000 rounds, generating and evaluating 2000 different loss functions and we use 250 rounds with CMA-ES, finding much faster convergence. Fig. 3 shows an example how our weights evolve over the rounds and Table 4 compares the performance of different search methods. Since everything is differentiable, we could learn these weights also with gradient descent, however, we leave this for future exploration as taking the derivative of the entire network w.r.t. to task weights is non-trivial.
3.3 Self-supervised tasks
Many tasks have been designed for unsupervised learning. We describe briefly the tasks we employ for our representation learning. Importantly, we allow many possible tasks and let the evolved loss function automatically discover which tasks are important and the optimal relative weightings. We also use tasks like DeepCluster  and local aggregation .
Reconstruction and prediction tasks: Given the representation for a modality, , we use a decoder CNN to generate the output. As the reconstruction is only used as a supervision signal, we do not need to generate extremely high quality reconstructions. Thus, we use a small, cheap decoder with only 6 convolutional layers, saving both memory and training time. Once the unsupervised learning is complete, we discard the decoders. Further, following previous work , our decoders have no temporal convolution, forcing the representations to contain all needed temporal information, which is desirable for video understanding. Each modality (e.g., RGB, optical flow, and audio) will be reconstructed. We also use several cross-modality transfer tasks: RGB to Flow, Flow to RGB, etc.
Another way to learn video temporal structure is to train a decoder to predict the next frames given frames. Future prediction of frames has previously been used for representation learning  and we perform this task for each modality. For these tasks, we minimize the distance between the ground truth () and predicted output ():
Temporal ordering: We use two tasks to learn representations that take advantage of temporal structure: binary classification of ordered frames and shuffled frames  and binary classification of forward and backward videos . We use a single, fully-connected layer to make a binary classification of the representation: ( is the representation for a modality). These are trained to minimize binary cross-entropy:
is the output of the binary classifier andis the ground truth.
Multi-modal contrastive loss: As videos contain multiple modalities, we want to take advantage of these different representations to learn an a generic representation by using a multi-modal embedding space. Given the representations for each modality, , we minimize a contrastive loss between various embedding spaces:
where and are representations from the same video but different modalities and is a representation from a different video. This task encourages the representations from the same video, but different modalities, to be close in the representation space, while representations from different videos are further apart.
Multi-modal alignment: We can further take advantage of both the temporal information and the multi-modal data by performing a multi-modal alignment task, illustrated in Fig. 4. The networks take input of temporally aligned samples from two modalities, and a sample from one modalitiy from a temporally different region. The model is trained to make a binary prediction if the two samples are temporally aligned.
|Supervised using additional labeled data|
|Scratch (No Pretraining)||15.7||17.8||35.2|
|Unsupervised using unlabeled videos|
|Frame Shuffle ||22.3||24.3||28.4|
|Reverse Detection ||21.3||24.3||27.5|
|Audio/RGB Align [29, 22]||32.4||36.8||40.2|
|RGB to Flow||31.5||36.4||39.9|
|Predicting 4 future frames||31.8||35.8||39.2|
|Ours, weakly-sup clustering, using unlabeled videos|
|Evolved Loss - ELo-weak||45.7||64.3||67.8|
|Ours, unsupervised, using unlabeled videos|
|Random Loss (unsup.)||26.4||26.9||31.2|
|Evolved Loss - ELo (unsup.)||43.4||64.5||67.4|
Unsupervised data source.
We use two million random, unlabeled YouTube video clips sampled randomly from the Youtube-8M dataset  (limiting the size for computational reasons). Previous works on self-supervised learning used videos from datasets (e.g., Kinetics or AudioSet in ) while ignoring the labels, leading a bias in the dataset, as those videos are trimmed to intervals with specific activities. Using a random sample from Youtube is less prone to bias as the videos are user generated, the labels are automatically tagged (no human verification), and potentially offers a very large set for training (up to 8M). We have verified there is no overlap between those datasets and the ones the models are evaluated on (e.g., HMDB).
Evaluation datasets and protocols. We used the following widely used datasets for evaluation: HMDB , UCF101 , Kinetics . We also used Imagenet  and Kinetics for reporting results with pre-training, as is customary in previous work. We use the standard protocols when evaluating video classification results on the labeled datasets, adopted by prior work as well. Please see the sup. material for dataset details.
We use a (2+1)D ResNet-50 as our backbone network. Given a loss function, we train the network for 100 epochs on 2 million unlabeled videos. The learning rate is set to 0.1 (during both the evolution and the final training). We use cosine learning rate decay with a warmup period of 2 epochs.
During search, we used smaller networks, similar to a ResNet-18 for faster learning. For search, the fitness of each model can be found in 4 hours using 8 GPUs. The final model uses 64 GPUs for 3 days to train (equivalent time to training I3D/(2+1)D ResNet-50 on Kinetics).
|(2+1)D ResNet-50 Scratch||35.2||63.1|
|(2+1)D ResNet-50 ImageNet||49.8||84.5|
|(2+1)D ResNet-50 Kinetics||74.3||95.1|
|Weakly guided, HMDB|
|Evolved Loss (ours)||67.8||94.1|
|Evolved Loss (ours, no distiliation)||53.7||84.2|
|Evolved Loss - ELo (ours)||67.4||93.8|
4.2 Comparison to previous methods
We evaluate our method in comparison to prior unsupervised and supervised representation learning. Specifically, we evaluate the representations in 3 settings: (1) -means clustering of the representations (2) fixing the weights of the network and training a single, fully-connected layer for classification and (3) fine-tuning the entire network. These three evaluations are done by directly evaluating the representation as well as finetuning the entire network.
We find that while all approaches outperform the randomly initialized networks, only our evolved loss function outperforms ImageNet pretraining and performs comparably to the pretrained network with labeled Kinetics data (Table 1). Furthermore, we outperform all prior unsupervised methods. Our approach performs similarly to the weakly supervised version of our evolution method, despite being unsupervised. We also compare to a loss function randomly sampled from our search space, which performs poorly. We find that some tasks are not beneficial to representation learning, thus the evolution is quite important as this allows automatically finding the best tasks and weightings.
In Table 2, we compare our approach to previous reported methods. We find that even though our approach is using more difficult unlabeled data, we still outperform the exiting methods by a significant margin.
We also find that distillation is extremely important. Without it, the RGB network can only take advantage of the other modalities through a limited number of tasks (e.g., RGB to Flow, audio/rgb alignment tasks). To learn a single RGB network, many important and high performing self-supervised tasks, such as flow, shuffling, can only influence the weights through distillation.
|Number of Labeled Samples|
|Method||400||2k||4k||8k||20k||40k||80k||120k||160k||225k (all samples)|
4.3 Improving supervised learning
Once we have learned a representation space using large amounts of unlabeled data, we want to determine how much labeled data is needed to achieve competitive performance. In Fig. 5 and Table 3, we compare various approaches trained using our unlabeled videos then fine-tuned on Kinetics using different amounts of labeled data. The Kinetics dataset has 225k labeled samples and we find that using only 25k (10%) yields reasonable performance (58.1% accuracy), only 11% lower than our baseline, fully-supervised model using all samples.
We are able to match performance using only 120k samples, about half the dataset. Using the entire dataset, we outperform the baseline network, due to better initilizations and the distillation of modalities into the RGB stream.
4.4 Benefit of additional unlabeled data
We explore the effect of using different amounts of unlabeled data. Given a loss function, we train a network using unlabeled samples. As adding more data while keeping the number of epochs fixed increases the number of iterations, we compare the training both keeping the iterations fixed and the number of epochs fixed.
The results on HMDB are shown in Fig. 6. When fixing the number of iterations to 100k, the performance increases as we add more data, even though the number of epochs (e.g., number of times each sample is seen) decreases. This suggests that during unsupervised training, the use of more, diverse data is beneficial, even when samples are seen fewer times. When fixing the number of epochs to 100, we find that adding more data further improves performance, suggesting that more training plus more data is best.
4.5 Additional Analysis
Examining the weights of the evolved loss function, and , allows us to check which tasks are more important for the target task. Fig. 7 illustrates the weights for several tasks () over the 250 evolution rounds. We observe tasks such as RGB frame shuffle get very low weights, suggesting they are not very useful for the action recognition task. Tasks such as audio alignment are quite important. The final fully-evolved loss is shown in Fig. 8.
Table 4 compares different search methods. As seen CMA-ES converges the most quickly and to the best fitness. In Fig. 9, we compare the two different fitness measures, finding strong correlation. This suggests that Zipf matching is suitable for unsupervised representation evaluation.
We proposed a unified framework for multi-task, multi-modal unsupervised video representation learning and found it benefits recognition tasks. We further introduced the concept of loss function evolution to automatically find the weights of the self-supervised tasks and modalities, with unsupervised fitness measure. We find powerful unsupervised video representations that outperform prior self-supervised tasks and can match or improve the performance of networks trained on supervised data.
-  Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.
Relja Arandjelovic and Andrew Zisserman.
Look, listen and learn.
Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 609–617, 2017.
-  Relja Arandjelovic and Andrew Zisserman. Objects that sound. In Proceedings of European Conference on Computer Vision (ECCV), 2018.
-  Miguel A Bautista, Artsiom Sanakoyeu, Ekaterina Tikhoncheva, and Bjorn Ommer. Cliquecnn: Deep unsupervised exemplar learning. In Advances in Neural Information Processing Systems (NIPS), pages 3846–3854, 2016.
-  Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proceedings of European Conference on Computer Vision (ECCV), pages 132–149, 2018.
Joao Carreira and Andrew Zisserman.
Quo vadis, action recognition? a new model and the kinetics dataset.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016.
-  Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
-  Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
-  Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982, 2018.
Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould.
Self-supervised video representation learning with odd-one-out networks.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
David E. Goldberg and Kalyanmoy Deb.
A comparative analysis of selection schemes used in genetic algorithms.In Foundations of Genetic Algorithms, pages 69–93. Morgan Kaufmann, 1991.
-  Chunhui Gu, Chen Sun, Sudheendra Vijayanarasimhan, Caroline Pantofaru, David A. Ross, George Toderici, Yeqing Li, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. AVA: A video dataset of spatio-temporally localized atomic visual actions. arXiv preprint arXiv:1705.08421, 2017.
-  Nikolaus Hansen, Sibylle D Müller, and Petros Koumoutsakos. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es). Evolutionary computation, 2003.
-  Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  Kyle Hsu, Sergey Levine, and Chelsea Finn. Unsupervised learning via meta-learning. In International Conference on Learning Representations, 2019.
Aapo Hyvarinen and Hiroshi Morioka.
Unsupervised feature extraction by time-contrastive learning and nonlinear ica.In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29. 2016.
-  Dinesh Jayaraman and Kristen Grauman. Slow and steady feature analysis: higher order temporal coherence in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
-  Dahun Kim, Donghyeon Cho, and In So Kweon. Self-supervised video representation learning with space-time cubic puzzles. In AAAI, 2018.
-  Bruno Korbar, Du Tran, and Lorenzo Torresani. Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems (NIPS), pages 7774–7785, 2018.
-  Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2011.
-  Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 667–676, 2017.
-  Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of European Conference on Computer Vision (ECCV), 2018.
-  Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In Proceedings of European Conference on Computer Vision (ECCV), 2016.
-  Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfruend, Carl Vondrick, et al. Moments in time dataset: one million videos for event understanding. arXiv preprint arXiv:1801.03150, 2018.
-  Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of European Conference on Computer Vision (ECCV), pages 69–84, 2016.
-  Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of European Conference on Computer Vision (ECCV), 2018.
-  Dilip Krishnan Phillip Isola, Daniel Zoran and Edward H Adelson. Learning visual groups from co-occurrences in space and time. 2015.
-  Lyndsey C Pickup, Zheng Pan, Donglai Wei, YiChang Shih, Changshui Zhang, Andrew Zisserman, Bernhard Scholkopf, and William T Freeman. Seeing the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
-  Zhongzheng Ren and Yong Jae Lee. Cross-domain self-supervised multi-task feature learning using synthetic imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey Levine. Time-contrastive networks: Self-supervised learning from pixels. 2017.
-  Gunnar A Sigurdsson, Olga Russakovsky, and Abhinav Gupta. What actions are needed for understanding human actions in videos? arXiv preprint arXiv:1708.02696, 2017.
-  Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (NIPS), pages 568–576, 2014.
-  K. Soomro, A. Roshan Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. In CRCV-TR-12-01, 2012.
Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov.
Unsupervised learning of video representations using lstms.
International Conference on Machine Learning (ICML), pages 843–852, 2015.
-  Jonathan C Stroud, David A Ross, Chen Sun, Jia Deng, and Rahul Sukthankar. D3d: Distilled 3d networks for video action recognition. arXiv preprint arXiv:1812.08249, 2018.
-  Du Tran, Lubomir D Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. C3d: generic features for video analysis. CoRR, abs/1412.0767, 2(7):8, 2014.
-  Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6450–6459, 2018.
-  Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. Tracking emerges by colorizing videos. In Proceedings of European Conference on Computer Vision (ECCV), 2018.
-  Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. Action recognition by dense trajectories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3169–3176. IEEE, 2011.
-  Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2794–2802, 2015.
-  X. Wang, K. He, and A. Gupta. Transitive invariance for selfsupervised visual representation learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
-  Xiaolong Wang, Allan Jabri, and Alexei A. Efros. Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
-  Donglai Wei, Joseph Lim, Andrew Zisserman, and William T. Freeman. Learning and using the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
Amir Zamir, Alexander Sax, William Shen, Leonidas Guibas, Jitendra Malik, and
Taskonomy: Disentangling task transfer learning.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6002–6012, 2019.
-  George K. Zipf. Human behavior and the principle of least effort. In Addison-Wesley, Cambridge, Massachusetts, 1949.
Appendix A Datasets
We compare our approach, referred to as ELo in the paper, on 3 standard video recognition datasets. Kinetics  is a large-scale video dataset with over 200k labeled video clips for 400 different activity classes. Each clip is 10 seconds long, leading to over over 500 hours of annotated video data. HMDB  is a smaller dataset with around 3000 training and 1500 validation video clips for 51 different activities. On average, each video is 3 seconds long. UCF-101  is similar to HMDB with 101 different actions, and about 13,000 videos split into training and test sets.
Using both large-scale data and smaller datasets shows that the representation obtained from unsupervised learning is general and works well even with limited labeled data.
Appendix B Visualization of loss evolution
Fig. 10 shows the t-SNE embedding for our unsupervised approach compared to random weights, ImageNet trained CNNs, and Kinetics trained CNNs. ELo generates more clear video clusters than random weights and ImageNet weights, and is more comparable to the model trained with supervised Kinetics videos and labels.
Appendix C Supplemental Results
Fig 11 visualizes the filters our approach learned compared to other approaches: it shows filters from random initialization, supervised learning with large data, previous self-supervised learning, and our unsupervised learning method ELo. ELo filters are quite similar to those learned with labeled data, but do show some differences.