Video Representation Learning with Visual Tempo Consistency

by   Ceyuan Yang, et al.

Visual tempo, which describes how fast an action goes, has shown its potential in supervised action recognition. In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning (VTHCL). Specifically, by sampling the same instance at slow and fast frame rates respectively, we can obtain slow and fast video frames which share the same semantics but contain different visual tempos. Video representations learned from VTHCL achieve the competitive performances under the self-supervision evaluation protocol for action recognition on UCF-101 (82.1%) and HMDB-51 (49.2%). Moreover, we show that the learned representations are also generalized well to other downstream tasks including action detection on AVA and action anticipation on Epic-Kitchen. Finally, our empirical analysis suggests that a more thorough evaluation protocol is needed to verify the effectiveness of the self-supervised video representations across network structures and downstream tasks.


page 1

page 2

page 3

page 4


Cycle-Contrast for Self-Supervised Video Representation Learning

We present Cycle-Contrastive Learning (CCL), a novel self-supervised met...

Learning Representations from Audio-Visual Spatial Alignment

We introduce a novel self-supervised pretext task for learning represent...

STC-mix: Space, Time, Channel mixing for Self-supervised Video Representation

Contrastive representation learning of videos highly relies on the avail...

Self-Conditioned Probabilistic Learning of Video Rescaling

Bicubic downscaling is a prevalent technique used to reduce the video st...

Memory-augmented Dense Predictive Coding for Video Representation Learning

The objective of this paper is self-supervised learning from video, in p...

PreViTS: Contrastive Pretraining with Video Tracking Supervision

Videos are a rich source for self-supervised learning (SSL) of visual re...

Slow and steady feature analysis: higher order temporal coherence in video

How can unlabeled video augment visual learning? Existing methods perfor...

Code Repositories


[arXiv 2020] Video Representation Learning with Visual Tempo Consistency

view repo

1 Introduction

In recent years, a great success of representation learning has been made, especially for self-supervised learning from images. The visual features obtained in a self-supervised manner have been getting very close to those of supervised training on ImageNet

Deng et al. (2009). Meanwhile, representing videos in a compact and informative way is also crucial for many analysis, since videos are redundant and noisy in their raw forms. However, supervised video representation learning demands a huge number of annotations, which in turn encourages researchers to investigate self-supervised learning schemes to harvest the massive amount of unlabelled videos.

Videos contain rich motion dynamics along the temporal dimension. Thus, if we can make the best of the underlying consistency as well as causality dependency in the activities ocurring in the videos, we can better leverage a large amount of unlabelled data for representation learning. For instance, previous attempts learn video representations by predicting the correct order of shuffled frames Misra et al. (2016), the arrow of time Wei et al. (2018), and the frames and motion dynamics in the future Wang et al. (2019b); Han et al. (2019). Considering the recent success of exploiting visual tempo in action recognition tasks Feichtenhofer et al. (2019); Yang et al. (2020), in this work we aim at exploring visual tempo for self-supervised video representation learning.

Visual tempo, which describes how fast an action goes, is an essential variation factor of video semantics. Particularly, an action instance can be performed and observed at different tempos due to multiple elements including mood and age of the performer and configuration of the observer, the resulting video thus varies from case by case. Nonetheless, the same instance with different tempos is supposed to share high similarity in terms of their discriminative semantics, which is exactly the underlying consistency for self-supervised representation learning.

While visual tempo could be utilized by directly predicting the correct tempo of a given action instance as in previous attempts Benaim et al. (2020); Misra et al. (2016); Wei et al. (2018); Wang et al. (2019b), we argue that such a predictive approach may enforce the learned representations to capture the information that distinguishes the frequence of visual tempos, which is not necessarily related to the discriminative semantics we are looking for. Therefore, we propose an alternative approach based on contrastive learning Hadsell et al. (2006); He et al. (2020); Tian et al. (2019); Chen et al. (2020b), which maximizes the mutual information between representations across videos from the same action instance but with different visual tempos. Specifically, we formulate self-supervised video representation learning as the consistency measurement between a pair of videos, which contains video frames from the same action instance but being sampled at the slow and fast visual tempo respectively. The learning is conducted by adopting a slow and a fast video encoder, and taking in turn a video from each pair as the query to distinguish its counterpart from a set of negative samples. In this way, the resulting video representations are expected to cpature the shared information and better retain its discriminations.

As shown in the literature Yang et al. (2020) that the feature hierarchy inside a video network ( I3D Carreira and Zisserman (2017)) already reflects semantics at various visual tempos, we further propose a hierarchical contrastive learning scheme, where we use network features across multiple depths as queries. Such a scheme not only leverages the variation of visual tempo more effectively but also provides a stronger supervision for deeper networks. Evaluated thoroughly on a wide variety of downstream action understanding tasks including action recognition on UCF-101 Soomro et al. (2012) and HMDB-51 Kuehne et al. (2011), action detection on AVA Gu et al. (2018), and action anticipation on Epic-Kitchen Damen et al. (2018), we found the representations learned via exploiting visual tempo consistency are highly discriminative and generalized, leading to the competitive performances for self-supervised video representation learning.

We summarize our contributions as follows: a) We demonstrate visual tempo can serve as a strong supervision signal for unsupervised video representation learning, which is utilized by the proposed hierarchical contrastive learning scheme. b) We show that our proposed framework can achieve competitive performances for action recognition on UCF-101 and HMDB-51, and generalize well to other downstream tasks such as action detection and action anticipation. c) We point out the limitation of current evaluation protocol for video representations, which should be improved to include more detailed ablation studies across network structures and diverse downstream tasks.

2 Related Work

Self-supervised video representation learning. Various pretext tasks have been explored for self-supervised video representation learning, such as modeling the cycle-consistency between two videos of the same category Dwibedi et al. (2019), modeling the cycle-consistency of time Wang et al. (2019c), predicting the temporal order of frames Fernando et al. (2017); Lee et al. (2017); Misra et al. (2016); Wei et al. (2018), predicting future motion dynamics and frames Wang et al. (2019b); Han et al. (2019); Oord et al. (2018) as well as predicting the color of frames Vondrick et al. (2018). In this work, we explore a different pretext task, which models the consistency between videos from the same action instance but with different visual tempos. There are also works that learn video representations using not only videos themselves but also corresponding text Sun et al. (2019a, b); Miech et al. (2019a) and audios Korbar et al. (2018); Arandjelovic and Zisserman (2017); Alwassel et al. (2019); Patrick et al. (2020). In contrast to those works, we learn compact video representations from RGB frames only.

Contrastive learning. Due to their promising performances, contrastive learning and its variants Bachman et al. (2019); Hénaff et al. (2019); Hjelm et al. (2019); Oord et al. (2018); Tian et al. (2019); Wu et al. (2018); He et al. (2020); Chen et al. (2020a) are considered as an important direction for self-supervised representation learning. Particularly, the most related work is the contrastive multiview coding (CMC) Tian et al. (2019), which learns video representations by maximizing the mutual information between RGB and flow data of the same frames. The difference is that in this work we learn video representations via the consistency between videos of the same action instance but with different visual tempos. Moreover, we further introduce a hierarchical scheme to leverage such a consistency at different depths of the encoding network, providing a stronger supervision for training deeper networks.

3 Learning from Visual Tempo Consistency

The goal of self-supervised video representation learning is to learn a video encoder that is able to produce compact and informative video representations, by regarding the structural knowledge and the consistency among a set of unlabelled videos as the self-supervision signal. The discriminative feature of is often verified through a set of downstream tasks (action classification, action detection and action anticipation). While various supervisions have been proposed by previous attempts, in this work we introduce the visual tempo consistency, a novel and effective self-supervision signal for video representation learning. We start by discussing what is the visual tempo consistency and why it is a strong supervision signal, then we introduce its learning process.

3.1 Visual Tempo Consistency as a Self-supervision Signal

Following Yang et al. (2020)

, we refer to visual tempo as how fast an action goes in an action video. As an internal variation factor of these videos, the visual tempos of actions across different classes have a large variance. In previous literature

Yang et al. (2020); Feichtenhofer et al. (2019), the benefits of considering the variance of visual tempo in supervised recognition tasks have been well explored. A question then arises: Whether such a variance can also benefit self-supervised learning? With a proper formulation of the variance of visual tempo, we show that it can serve as an effective and promising self-supervision signal.

Specifically, as shown in Feichtenhofer et al. (2019) we can adjust the sampling rate of frames to get videos of the same action instance but with different visual tempos. Without the loss of generality, we use videos of two different sampling rates and refer to them as fast and slow videos,   and . In order to ignore the effect of different distribution on backbones Huang et al. (2018), we thus introduce two encoders, and , respectively for fast and slow videos and learn them by matching the representations of an action instance’s corresponding fast and slow videos.

The intuition behind such approach for video representation learning is that, at first, learning via the consistency between multiple representations is shown to be more effective than learning by prediction Tian et al. (2019); He et al. (2020); Chen et al. (2020b, a). Moreover, while previous attempts resort to matching representations of different patches Wu et al. (2018) or different views ( RGB and optical flow) Tian et al. (2019) of the same instance, the inputs of these representations intrinsically have different semantics. On the contrary, the semantics of an instance’s fast and slow videos are almost identical, with visual tempo being the only difference. Encouraging the representation consistency between videos of the same instance but with different visual tempos thus provides a stronger supervision signal.

Figure 1: Framework. (a) The same instance with different tempos ( and ) should share high similarity in terms of their discriminative semantics while are dissimilar to other instances (grey dots). (b) The features at various depths of networks allow to construct the hierarchical representation spaces

3.2 Adopting Visual Tempo Consistency via Contrastive Learning

We apply contrastive learning to train our encoders and . Specifically, given two sets of videos and , where -th pair of videos contains two videos of the same -th instance but with different visual tempos, we can get their corresponding representations and by


where we refer to and as the fast and slow representations of -th instance. Learning and based on the visual tempo consistency involves two directions. For each fast representation , we encourage the similarity between and its slow representation counterpart

while decreasing the similarities between it and other slow representations. This process also applies to each slow representation. Subsequently we can obtain the loss functions:


where is a function measuring the similarity between two representations. can be calculated by



is the temperature hyperparameter 

Wu et al. (2018), and is a learnable non-linear mapping. As suggested by Chen et al. (2020a, b), applying such a mapping function can substantially improve the learned representations.

Memory bank. It is non-trivial to scale up if we extract the features of all videos at each iteration. Consequently, we reduce the computation overhead by maintaining two memory banks and of size as in Wu et al. (2018); Tian et al. (2019) where is the dimension of representations. and respectively store the approximated representations of fast and slow videos. Representations stored in and are accumulated over iterations as


where can be any or , and is the momentum coefficient to ensure smoothness and stability. Based on and , the learning process thus becomes taking a mini-batch of fast video as queries, computing the loss function based on their representation obtained via and sampled representations stored in . can be computed in a similar manner. It is worth noting one can further reduce the computation overhead by sampling representations from each bank rather than using the entire bank when computing and

, or adopting noise contrastive estimation as in

Wu et al. (2018); Tian et al. (2019).

3.3 Learning from Visual Tempo via Hierarchical Contrastive Learning

While we usually use the final output of and as the representation of an input video, it is known Feichtenhofer et al. (2019); Yang et al. (2020) that popular choices of and (I3D Carreira and Zisserman (2017)) contain a rich temporal hierarchy inside their architectures,  features of these networks at different depths already encode various temporal information due to their varying temporal receptive fields. Inspired by this observation, we propose to extend the loss functions in Eq.(8) to a hierarchical scheme, so that we can provide and a stronger supervision. The framework is shown in Fig.1. Particularly, the original contrastive learning can be regarded as a special case where only the final feature is used. Specifically, we use features at different depths of and as multiple representations of an input video,  replacing and of -th fast and slow videos with and , where is the set of depths we choose to extract features from and . For instance, we could collect the output of each residual layers () in 3D-ResNet Carreira and Zisserman (2017) to construct set . Accordingly, the original two memory banks are extended to a total of memory banks, and the final loss function is extended to , where


4 Experiments

We conduct a series of comprehensive experiments following the standard protocol of evaluating video representations from self-supervised learning. Specifically, we pretrain video encoders with the proposed VTHCL on a large-scale dataset ( Kinetics-400 Carreira and Zisserman (2017)) then finetune the encoders on the target dataset corresponding to a certain downstream task ( UCF-101 and HMDB-51 for action recognition). In practice, we regard the encoder for slow videos as our main encoder used for evaluation. To ensure reproducibility, all implementation details are included in Sec.4.1. Main results of action recognition are presented in Sec.4.2 with comparison to prior approaches. Sec.4.3 includes ablation studies on the components of VTHCL. To further demonstrate the effectiveness of VTHCL and show the limitation of current evaluation protocol, we evaluate VTHCL on a diverse downstream tasks including action detection on AVA Gu et al. (2018) and action anticipation on Epic-Kitchen Damen et al. (2018) in Sec.4.4.1 and Sec.4.4.2 respectively. It is worth noting all experiments are conducted on a single modality ( RGB frames) and evaluated on the corresponding validation set unless state otherwise.

4.1 Implementation Details

Backbone. Two paths of SlowFast Feichtenhofer et al. (2019) without lateral connections are adapted as and , which are modified from 2D ResNet He et al. (2016) by inflating 2D kernels Carreira and Zisserman (2017). The main difference between two encoders is the network width and the number of inflated blocks. Importantly, after self-supervised training, only the slow encoder would be adopted for various tasks.

Training Protocol. Following Tian et al. (2019); He et al. (2020); Wu et al. (2018); Chen et al. (2020a), video encoders in VTHCL are randomly initialized as default. Synchronized SGD serves as our optimizer, whose weight decay and momentum is set to 0.0001 and 0.9 respectively. The initial learning rate is set to 0.03 with a total batch size of 256. The half-period cosine schedule Loshchilov and Hutter (2017)

is also adapted to adjust the learning rate (200 epochs in total). Following the hyperparameters in

Wu et al. (2018); Tian et al. (2019), the temperature in Eq.(6) is set to 0.07 and the number of sampled representation is 16384.

Dataset. Kinetics-400 Carreira and Zisserman (2017)

contains around 240k training videos which serves as the large-scale benchmark for self-supervised representation learning. We extract video frames at the raw frame per second (FPS) and sample the consecutive 64 frames as a raw clip which can be re-sampled to produce slow and fast clips at the specific stride

and () separately. Unless state otherwise, the sample stride is 8,  our model will take 8 frames () as input.

4.2 Action Recognition

Setup. In order to conduct a fair comparison, following prior works we finetune the learned video encoders of VTHCL on UCF-101 Soomro et al. (2012) and HMDB-51 Kuehne et al. (2011) datasets for action recognition. Particularly, we obtain the video accuracy via the standard protocol Feichtenhofer et al. (2019); Yang et al. (2020); Wang et al. (2018)

,  uniformly sampling 10 clips of the whole video and averaging the softmax probabilities of all clips as the final prediction. We train our models for 100 epochs with a total batch size of 64 and an initial learning rate of 0.1, which is reduced by a factor of 10 at 40, 80 epoch respectively. Moreover, when pre-training on Kinetics-400

Carreira and Zisserman (2017), three levels of contrastive hierarchy is constructed,  we collect features from the output of due to the limitation of GPU resources. Unless state otherwise, is defaultly set to 2 for the fast clips (sample stride of fast encoder is ). Namely, the slow and fast encoders take 8 and 16 frames as the input separately.

Main Results. Table 1 illustrates the comparison between ours and other state-of-the-art approaches. Here all the methods utilize only a single modality. Besides, the results using different types of initializations ( Random, ImageNet inflated and Kinetics pretrained) are also included to serve as the lower/upper bounds. In particular, our method equipped with the shallower network (3D-ResNet18) can achieve top-1 accuracy of 80.6% and 48.6% respectively, outperforming previous works with a similar setting by large margins. Furthermore, increasing the capacity of the network from 3D-ResNet18 to 3D-ResNet50 can introduce a consistent improvement, achieving 82.1% and 49.2% top-1 accuracies. Compared to the supervised results of similar backbones obtained using a random initialization ( 61.1% and 68.0% on UCF-101 Soomro et al. (2012) for 3D-ResNet18 and 3D-ResNet50), our method can significantly decrease the gap between self-supervised and supervised video representation learning.

Method Architecture UCF-101 Soomro et al. (2012) HMDB-51 Kuehne et al. (2011)
Random/ImageNet/Kinetics 3D-ResNet18 68.0/83.0/92.6 30.8/48.2/66.7
Random/ImageNet/Kinetics 3D-ResNet50 61.1/86.2/94.8 21.7/51.8/69.3
MotionPred Wang et al. (2019a) C3D 61.2 33.4
RotNet3D Jing and Tian (2018) 3D-ResNet18 62.9 33.7
ST-Puzzle Kim et al. (2019) 3D-ResNet18 65.8 33.7
ClipOrder Xu et al. (2019) R(2+1)D-18 72.4 30.9
DPC Han et al. (2019) 3D-ResNet34 75.7 35.7
AoT Wei et al. (2018) T-CAM 79.4 -
SpeedNet Benaim et al. (2020) I3D 66.7 43.7
VTHCL-R18 (Ours) 3D-ResNet18 80.6 48.6
VTHCL-R50 (Ours) 3D-ResNet50 82.1 49.2
Table 1: Comparison with other state-of-the-art methods on UCF-101 and HMDB-51. Note that only the top-1 accuracies are reported

Effect of Architectures. Beyond the competitive performances, Table 1 also raises the awareness of the effect of various backbones. Intuitively, when increasing network capacity, the learned representations should be better. For example, works in image representation learning Kolesnikov et al. (2019); He et al. (2020); Chen et al. (2020b); Tian et al. (2019) confirms networks with larger capacities can boost the quality of learned representations. As for video representation learning, it can be seen from Table 1, when networks are well initialized ( supervised pretraining on ImageNet and Kinetics, or self-supervised using VTHCL on Kinetics), the one with a larger capacity indeed outperforms its counterpart. Particularly, when randomly initialized, 3D-ResNet50 performs worse on UCF-101 and HMDB than 3D-ResNet18 although it has a relatively larger capacity. It indicates the number of parameters of 3D-ResNet50 is too large compared to the scale of UCF-101 and HMDB, so that it suffers from overfitting. Therefore, while prior works usually employed a relatively shallow model ( 3D-ResNet18) in the evaluation, it is important to test a heavy backbone to see whether the proposed methods perform consistently across backbones. Further discussion on other downstream tasks can be found in Sec.4.4.

Backbone ImageNet pretrained ImageNet inflated VTHCL (Ours)
3D-ResNet18 68.5 45.9 33.8
3D-ResNet50 74.9 53.1 37.8
Table 2: Linear classification on Kinetics-400 Carreira and Zisserman (2017). Top-1 accuracy is reported

Linear classification. Linear protocol on the large-scale benchmark (ImageNet Deng et al. (2009)) could provide a direct and precise evaluation of the learned representations, which is widely used in image representation learning Tian et al. (2019); He et al. (2020); Chen et al. (2020b, a)

. Specifically, the features are frozen and only the linear classifier is trained in a supervised manner while current evaluation mainly finetunes the whole parameters on the small datasets. Therefore, we also conduct linear classification experiments on Kinetics-400

Carreira and Zisserman (2017). Table 2 presents the results under different training protocols. Specifically, ImageNet pretrained denotes that the whole parameters including backbone and linear classifier would be learnable after initialized by ImageNet weights, which serves as the reference bound. ImageNet inflated and Ours only train the linear classifier using the extracted features by corresponding models. Obviously, there exists a performance gap between fully- and self-supervised methods, especially on the large-scale benchmark.

R18 78.2/45.2 79.5/47.4 80.0/48.2
R50 80.3/47.3 80.9/47.7 80.6/48.0
(a) Various visual tempo. denotes the relative coefficient of sample stride for fast clip
R18 79.5/47.4 80.3/47.9 80.6/48.6
R50 80.9/47.7 81.5/48.5 82.1/49.2
(b) Various levels of contrastive formulation. denotes the number of levels of contrastive formulation
Table 3: Ablation Studies on visual tempo and hierarchical contrastive formulation. We report the top-1 accuracy on UCF-101 Soomro et al. (2012) and HMDB-51 Kuehne et al. (2011) respectively

4.3 Ablation Studies

Here we include the ablation study to investigate the effect of different VTHCL components.

Effect of relative visual tempo difference. Although in Table 1 we show VTHCL can obtain competitive results on UCF-101 Soomro et al. (2012) and HMDB Kuehne et al. (2011), it remains uncertain whether the relative visual tempo difference between slow and fast videos significantly affects the performance of VTHCL. We thus conduct multiple experiments by adjusting the relative coefficient of sample stride (). Specifically, 8, 16 and 32 frames are respectively fed into fast encoder while maintaining the number of frames for slow encoder as 8. When is 1, the input is exactly the same for both slow and fast encoders. In this case, VTHCL actually turns into instance discrimination task Wu et al. (2018) which distinguishes video instances mainly via the appearance instead of utilizing visual tempo consistency. Such a setting thus serves as our baseline to tell whether the visual tempo could help learn better video representations. Moreover, to avoid unexpected effects, we do not apply the hierarchical scheme, and only the final features of two encoders are used as in Sec.3.2.

Results are included in Table.(a)a, which suggests that a larger generally leads to a better performance for both 3D-ResNet18 and 3D-ResNet50. It has verified that the visual tempo difference between slow and fast videos indeed enforces video encoders to learn discriminative semantics utilizing the underlying consistency. Visual tempo as a source of the supervision signal can help self-supervised video representation learning.

Effect of hierarchical contrastive learning. We study the effect of the hierarchical contrastive formulation with a varying number of levels. Here refers to the number of elements in . For example, we collect the features from and build up a two-level contrastive formulation when . Furthermore, when is 1, the hierarchical scheme degrades into the general contrastive formulation shown in Sec.3.2. The relative coefficient is set to 2 for a fair comparison.

Results are included in Table.(b)b, showing that an increasing number of levels in the contrastive formulation significantly boosts the performance even when the model is quite heavy and tends to overfit. These results verify the effectiveness of utilizing the rich hierarchy inside a deep network, which correlate well with previous studies Yang et al. (2020). Besides, from the perspective of optimization, such a hierarchical scheme provides a stronger supervision, effectively avoiding the learning process from encountering issues such as gradient vanishing Szegedy et al. (2015), especially when a deep network is the encoder.

4.4 Evaluation on Other Downsteam Tasks

Representations learned via supervised learning on large scale datasets such as ImageNet Deng et al. (2009) and Kinetics-400 Carreira and Zisserman (2017) have shown to generalize well to a variety of tasks. While previous methods for unsupervised video representation learning  tend to study the quality of learned representations only on the action recognition task, it is important to include other downstream tasks for a comprehensive evaluation, since encoders may overfit to the action recognition benchmarks ( UCF-101 Soomro et al. (2012) and HMDB-51 Kuehne et al. (2011)). Therefore, we also benchmark VTHCL on other downstream tasks, including action detection on AVA Gu et al. (2018) and action anticipation on Epic-Kitchen Damen et al. (2018).

4.4.1 Action Detection on AVA

Dataset. AVA Gu et al. (2018) provides a benchmark for spatial-temporal localization of actions. Different from the traditional video detection (ImageNet VID dataset) whose labels are categories of given bounding boxes, annotations of AVA are provided for one frame per second and describe the action over time. AVA Gu et al. (2018) contains around 235 training and 64 validation videos and 80 ‘atomic’ actions.

Setup. We follow the standard setting as in Feichtenhofer et al. (2019); Wu et al. (2019) for training and validation we conduct the same pre-processing for region proposals. The slow encoder is employed as the backbone network with the number of 8 frames as input. Besides, the spatial stride of is set to 1 with the dilation of 2 to increase the spatial size of the output feature. The region-of-interest (RoI) features are computed by 3D RoIAlign He et al. (2017) and then fed into the per-class, sigmoid-based classifier for prediction. The slight difference of training protocol is that we train our model for 24 epochs and the learning rate is decayed by a factor of 10 at 16, 22 epochs which is the standard scheduler of object detection. Note that BatchNorm (BN) layers Ioffe and Szegedy (2015) are not frozen. SGD is adopted as our optimizer with the initial learning rate of 0.1 and weight decay of .

Results. Table.(a)a provides the mean Average Precision (mAP) of several common initialization. Similar observation appears that with the proper initialization (ImageNet, Kinetics and Ours), overfitting is slightly prevented such that 3D-ResNet50 can make the best of its increased capacity to achieve a better performance than 3D-ResNet18. It is worth noting that our method equipped with the same backbone (13.9 mAP) can beat 3D-ResNet18 trained via supervised learning on ImageNet (13.4 mAP). However, in action detection task, there exists a clear gap between video representations learned by self-supervised and supervised frameworks, although self-supervised approaches have obtained higher and higher results on action recognition. It is thus beneficial and necessary to include additional downstream tasks for evaluating self-supervised video representation learning.

4.4.2 Action Anticipation on Epic-Kitchen

Dataset. Epic-Kitchen Damen et al. (2018) provides a large-scale cooking dataset, which is recorded by 32 subjects in 32 kitchens. Besides, it contains 125 verb and 352 noun categories. Following Damen et al. (2018), we randomly select 232 videos (23439 segments) for training and 40 videos (4979 segments) for validation. Action anticipation requires to forecast the category of a future action before it happens, given a video clip as the observation. Following the original baseline of Epic-Kitchen Damen et al. (2018), we refer to as the anticipation time, and as the observation time. In our experiments, both and are set to 1 second.

Setup. In order to validate the learned representations themselves, we introduce no reasoning modules as in Ke et al. (2019); Miech et al. (2019b). Similar to Damen et al. (2018), we apply a shared MLP after the backbone network and then design two separable classification heads for noun and verb predictions. The slow encoder is employed as the backbone network with the number of 8 frames as input. Our models are trained for 80 epochs with an initial learning rate of 0.1 (which is divided by 10 at 40 and 60 epoch respectively).

Results. Top-1 accuracy of noun/verb prediction obtained by various models are presented in Table (b)b. Although our method can obtain the consistent improvements over the randomly initialized baseline, the gap between results of models learned with self-supervised and supervised schemes indicate the discriminative quality of learned representations can be further improved.

Random ImageNet Kinetics Ours
R18 11.1 13.4 16.6 13.9
R50 7.9 16.8 21.4 15.0
(a) Action Detection on AVA. Mean average precision (mAP) is reported
Random ImageNet Kinetics Ours
R18 8.9/26.3 13.5/28.0 14.2/28.8 11.2/27.0
R50 8.2/26.3 15.7/27.8 15.8/30.2 11.9/27.6
(b) Action Anticipation on Epic-Kitchen. Top-1 accuracy of Noun/Verb prediction is reported
Table 4: Representation Transfer. Results on action detection and anticipation are reported

4.5 Discussion

Heavy Backbones. Intuitively, heavy backbones are supposed to perform better than the lighter ones due to their increased capacity. However, our results on action recognition, detection and anticipation reveal that heavy backbones are likely to overfit when they are not well initialized. Therefore, when evaluating various methods of video representation learning, we should be more careful about whether they introduce consistent improvements on heavy backbones.

Thorough Evaluation. From our results, we argue that we need a more thorough evaluation for learned video representations across architectures, benchmarks and downstream tasks to study their consistency and generalization ability. The reasons are two-fold. a) Models with large capacities tend to overfit on UCF-101 Soomro et al. (2012) and HMDB-51 Kuehne et al. (2011) due to their limited scale and diversity, so that augmentation and regularization sometimes can be more important than representations themselves. Therefore, linear classification on large-scale datasets (Kinetics-400 Carreira and Zisserman (2017)) would be the direct and precise way to evaluate the learned representations. b) Evaluating representations for action recognition should not be the only goal. Our study on diverse downstream tasks shows that there remain gaps between video representations learned by self-supervised and supervised learning schemes, especially on action detection and action anticipation. The learned representation should facilitate as many downstream tasks as possible.

5 Conclusion

In this work, we leverage videos of the same instance but with varying visual tempos to learn video representations in a self-supervised way,  where we adopt the contrastive learning framework and extend it to a hierarchical contrastive learning. On a variety of downstream tasks, including action recognition, detection and anticipation, we demonstrate the effectiveness of our proposed framework, which obtains competitive results on action recognition, outperforming previous approaches by a clear margin. Moreover, our experiments further suggest that when learning the general visual representations of videos, we should evaluate more thoroughly and carefully the learned features under different network architectures, benchmarks and tasks.


We thank Zhirong Wu and Yonglong Tian for their public implementation of previous works.


  • H. Alwassel, D. Mahajan, L. Torresani, B. Ghanem, and D. Tran (2019) Self-supervised learning by cross-modal audio-video clustering. arXiv preprint arXiv:1911.12667. Cited by: §2.
  • R. Arandjelovic and A. Zisserman (2017) Look, listen and learn. In Proc. ICCV, Cited by: §2.
  • P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, pp. 15509–15519. Cited by: §2.
  • S. Benaim, A. Ephrat, O. Lang, I. Mosseri, W. T. Freeman, M. Rubinstein, M. Irani, and T. Dekel (2020) SpeedNet: learning the speediness in videos. Proc. CVPR. Cited by: §1, Table 1.
  • J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. CVPR, Cited by: §1, §3.3, §4.1, §4.1, §4.2, §4.2, §4.4, §4.5, Table 2, §4.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020a) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §2, §3.1, §3.2, §4.1, §4.2.
  • X. Chen, H. Fan, R. Girshick, and K. He (2020b) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §1, §3.1, §3.2, §4.2, §4.2.
  • D. Damen, H. Doughty, G. Maria Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. (2018) Scaling egocentric vision: the epic-kitchens dataset. In Proc. ECCV, Cited by: §1, §4.4.2, §4.4.2, §4.4, §4.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Proc. CVPR, Cited by: §1, §4.2, §4.4.
  • D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman (2019) Temporal cycle-consistency learning. In Proc. CVPR, Cited by: §2.
  • C. Feichtenhofer, H. Fan, J. Malik, and K. He (2019) Slowfast networks for video recognition. In Proc. ICCV, Cited by: Video Representation Learning with Visual Tempo Consistency, §1, §3.1, §3.1, §3.3, §4.1, §4.2, §4.4.1.
  • B. Fernando, H. Bilen, E. Gavves, and S. Gould (2017)

    Self-supervised video representation learning with odd-one-out networks


    Proceedings of the IEEE conference on computer vision and pattern recognition

    Cited by: §2.
  • C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al. (2018) Ava: a video dataset of spatio-temporally localized atomic visual actions. In Proc. CVPR, Cited by: §1, §4.4.1, §4.4, §4.
  • R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In cvpr, Cited by: §1.
  • T. Han, W. Xie, and A. Zisserman (2019) Video representation learning by dense predictive coding. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Cited by: §1, §2, Table 1.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. Proc. CVPR. Cited by: §1, §2, §3.1, §4.1, §4.2, §4.2.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proc. ICCV, Cited by: §4.4.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. CVPR, Cited by: §4.1.
  • O. J. Hénaff, A. Razavi, C. Doersch, S. Eslami, and A. v. d. Oord (2019) Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272. Cited by: §2.
  • R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2019) Learning deep representations by mutual information estimation and maximization. International Conference on Learning Representations. Cited by: §2.
  • D. Huang, V. Ramanathan, D. Mahajan, L. Torresani, M. Paluri, L. Fei-Fei, and J. Carlos Niebles (2018) What makes a video a video: analyzing temporal information in video understanding models and datasets. In Proc. CVPR, Cited by: §3.1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift.

    International Conference on Machine Learning

    Cited by: §4.4.1.
  • L. Jing and Y. Tian (2018) Self-supervised spatiotemporal feature learning by video geometric transformations. arXiv preprint arXiv:1811.11387. Cited by: Table 1.
  • Q. Ke, M. Fritz, and B. Schiele (2019) Time-conditioned action anticipation in one shot. In cvpr, Cited by: §4.4.2.
  • D. Kim, D. Cho, and I. S. Kweon (2019) Self-supervised video representation learning with space-time cubic puzzles. In aaai, Cited by: Table 1.
  • A. Kolesnikov, X. Zhai, and L. Beyer (2019) Revisiting self-supervised visual representation learning. In Proc. CVPR, Cited by: §4.2.
  • B. Korbar, D. Tran, and L. Torresani (2018) Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems, Cited by: §2.
  • H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre (2011) HMDB: a large video database for human motion recognition. In Proc. ICCV, Cited by: §1, §4.2, §4.3, §4.4, §4.5, Table 1, Table 3.
  • H. Lee, J. Huang, M. Singh, and M. Yang (2017) Unsupervised representation learning by sorting sequences. In Proc. ICCV, Cited by: §2.
  • I. Loshchilov and F. Hutter (2017)

    Sgdr: stochastic gradient descent with warm restarts

    International Conference on Learning Representations. Cited by: §4.1.
  • A. Miech, J. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman (2019a) End-to-end learning of visual representations from uncurated instructional videos. arXiv preprint arXiv:1912.06430. Cited by: §2.
  • A. Miech, I. Laptev, J. Sivic, H. Wang, L. Torresani, and D. Tran (2019b) Leveraging the present to anticipate the future in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Cited by: §4.4.2.
  • I. Misra, C. L. Zitnick, and M. Hebert (2016)

    Shuffle and learn: unsupervised learning using temporal order verification

    In Proc. ECCV, Cited by: §1, §1, §2.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2, §2.
  • M. Patrick, Y. M. Asano, R. Fong, J. F. Henriques, G. Zweig, and A. Vedaldi (2020) Multi-modal self-supervision from generalized data transformations. arXiv preprint arXiv:2003.04298. Cited by: §2.
  • K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §1, §4.2, §4.2, §4.3, §4.4, §4.5, Table 1, Table 3.
  • C. Sun, F. Baradel, K. Murphy, and C. Schmid (2019a) Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743. Cited by: §2.
  • C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019b) Videobert: a joint model for video and language representation learning. In Proc. ICCV, Cited by: §2.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proc. CVPR, Cited by: §4.3.
  • Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §1, §2, §3.1, §3.2, §4.1, §4.2, §4.2.
  • C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy (2018) Tracking emerges by colorizing videos. In Proc. ECCV, Cited by: §2.
  • J. Wang, J. Jiao, L. Bao, S. He, Y. Liu, and W. Liu (2019a) Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In Proc. CVPR, Cited by: Table 1.
  • J. Wang, J. Jiao, L. Bao, S. He, Y. Liu, and W. Liu (2019b) Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In Proc. CVPR, Cited by: §1, §1, §2.
  • X. Wang, R. Girshick, A. Gupta, and K. He (2018)

    Non-local neural networks

    In Proc. CVPR, Cited by: §4.2.
  • X. Wang, A. Jabri, and A. A. Efros (2019c) Learning correspondence from the cycle-consistency of time. In Proc. CVPR, Cited by: §2.
  • D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman (2018) Learning and using the arrow of time. In Proc. CVPR, Cited by: §1, §1, §2, Table 1.
  • C. Wu, C. Feichtenhofer, H. Fan, K. He, P. Krahenbuhl, and R. Girshick (2019) Long-term feature banks for detailed video understanding. In Proc. CVPR, Cited by: §4.4.1.
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proc. CVPR, Cited by: §2, §3.1, §3.2, §3.2, §4.1, §4.3.
  • D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang (2019) Self-supervised spatiotemporal learning via video clip order prediction. In Proc. CVPR, Cited by: Table 1.
  • C. Yang, Y. Xu, J. Shi, B. Dai, and B. Zhou (2020) Temporal pyramid network for action recognition. In Proc. CVPR, Cited by: Video Representation Learning with Visual Tempo Consistency, §1, §1, §3.1, §3.3, §4.2, §4.3.