Hierarchical Contrastive Motion Learning for Video Action Recognition

by   Xitong Yang, et al.

One central question for video action recognition is how to model motion. In this paper, we present hierarchical contrastive motion learning, a new self-supervised learning framework to extract effective motion representations from raw video frames. Our approach progressively learns a hierarchy of motion features that correspond to different abstraction levels in a network. This hierarchical design bridges the semantic gap between low-level motion cues and high-level recognition tasks, and promotes the fusion of appearance and motion information at multiple levels. At each level, an explicit motion self-supervision is provided via contrastive learning to enforce the motion features at the current level to predict the future ones at the previous level. Thus, the motion features at higher levels are trained to gradually capture semantic dynamics and evolve more discriminative for action recognition. Our motion learning module is lightweight and flexible to be embedded into various backbone networks. Extensive experiments on four benchmarks show that the proposed approach consistently achieves superior results.


MS^2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition

In this paper, we address self-supervised representation learning from h...

Featureless: Bypassing feature extraction in action categorization

This method introduces an efficient manner of learning action categories...

Variable-rate hierarchical CPC leads to acoustic unit discovery in speech

The success of deep learning comes from its ability to capture the hiera...

Learning recurrent representations for hierarchical behavior modeling

We propose a framework for detecting action patterns from motion sequenc...

Action Concept Grounding Network for Semantically-Consistent Video Generation

Recent works in self-supervised video prediction have mainly focused on ...

Slow-Fast Visual Tempo Learning for Video-based Action Recognition

Action visual tempo characterizes the dynamics and the temporal scale of...

Video Representations of Goals Emerge from Watching Failure

We introduce a video representation learning framework that models the l...

1 Introduction

Motion provides abundant and powerful cues for understanding the dynamic visual world. A broad range of video understanding tasks can benefit from the introduction of motion information, such as action recognition Wang and Schmid (2013); Simonyan and Zisserman (2014), activity detection Gu et al. (2018); Yang et al. (2019), object tracking Feichtenhofer et al. (2017); Lu et al. (2020)

, etc. Thus, how to extract and model temporal motion is one of the fundamental research problems in video understanding. While early work in this field mostly relies on the off-the-shelf pre-computed motion features (e.g., optical flow), recent work of action recognition has been actively exploiting convolutional neural networks (CNNs) for more effective motion learning from raw video frames 

Ng et al. (2018), encouraged by the tremendous success of end-to-end learning in various vision tasks.

A key challenge for end-to-end motion representation learning is to design an effective supervision. Unlike many other tasks that afford plenty of well-defined annotations, “ground truth motion” is usually unavailable or even undefined for motion learning in practice. One popular idea is to extract motion features by means of the action recognition supervision. However, the classification loss is shown to be sub-optimal in this task as it only provides an implied supervision to guide motion learning Stroud et al. (2020). This supervision is also prone to be biased towards appearance information since some video action benchmarks can be mostly solved by considering static images without temporal modeling Sevilla-Lara et al. (2019). Recently, some efforts have been made to explore pretext tasks with direct supervisions for motion learning, such as optical flow prediction Ng et al. (2018) and video frame reconstruction Zhu et al. (2018). Although having shown promising results, such supervisions are restricted to pixel-wise and short-term motion as they hinge on pixel photometric loss and movement between adjacent frames.

In light of the above observations, we introduce a novel self-supervised learning framework that enables more explicit motion supervision at multiple feature abstraction levels, which we term hierarchical contrastive motion learning. Specifically, given preliminary motion cues as a bootstrap, our approach progressively learns a hierarchy of motion features in a bottom-up manner, as illustrated in Figure 1(a). This hierarchical design is proposed to bridge the semantic gap between the low-level preliminary motion and the high-level recognition task—analogous to the findings in neuroscience that humans perceive motion patterns in a hierarchical way Giese and Poggio (2003); Bill et al. (2019). At each level, a discriminative contrastive loss Chopra et al. (2005); Hadsell et al. (2006) is used to provide an explicit self-supervision to enforce the motion features at current level to predict the future ones at previous level. In contrast to the aforementioned pretext tasks that focus on low-level image details, the contrastive learning encourages the model to learn useful semantic dynamics from previously learned motion features at a lower level, and is more favorable for motion learning at higher levels where the spatial and temporal resolutions of feature maps are low Oord et al. (2018); Chen et al. (2020); Han et al. (2019). To acquire the preliminary motion cues to initialize the hierarchical motion learning, we exploit the video frame reconstruction Jason et al. (2016); Ren et al. (2017) as an auxiliary task such that the whole motion representation learning enjoys an unified self-supervised setup.

The proposed motion learning module is realized via a side network branch, which is lightweight and flexible to embed into a variety of backbone CNNs. In particular, our hierarchical design promotes the appearance and motion fusion by integrating the learned motion features into the backbone network at multiple abstraction levels Feichtenhofer et al. (2016). Such a multi-level fusion paradigm is unachievable for previous motion learning methods Diba et al. (2019); Zhu et al. (2018) that depend solely on low-level motion supervisions. It is also noteworthy that our approach only introduces a small overhead to the computation of a backbone network at inference time. As shown in Figure 1

(b), the side branches (i.e., the shaded region) for self-supervised motion learning are discarded after training, and only the learned motion features are retained in the form of residual connections 

He et al. (2016). Compared with the two-stream methods Simonyan and Zisserman (2014) that require an additional temporal stream operating on the pre-computed optical flow, our approach is capable of boosting action recognition with a minor computational increase.

We summarize our main contributions as follows. First, we propose a new learning framework for motion representation learning from raw video frames. Second, we advance contrastive learning to a hierarchical design, and to our best knowledge, provide the first attempt to empower contrastive learning in motion representation learning for large-scale video action recognition. Third, our approach achieves superior results on four benchmarks without relying on off-the-shelf motion features or supervised pre-training. Our source code and model will be released.

Figure 1: (a) Illustration of the hierarchy of progressively learned motion cues: from pixel-level and short-range movement to semantic temporal dynamics. (b) Architecture of the proposed motion learning module embedded in a backbone network. (c) Overview of the contrastive learning to adopt higher-level motion features to predict future ones at lower-level.

2 Related Work

Motion Extraction for Action Recognition. A large family of the research focuses on motion modeling. The two-stream networks operate on low-level motion inputs such as optical flow or even frame difference Mahasseni et al. (2018); Yang et al. (2016). However, the short-range two-frame motion is ineffective at capturing high-level action semantics. It is also popular to employ 3D CNNs Hara et al. (2018); Xie et al. (2018) and/or RNNs Ng et al. (2015); Yang et al. (2018) to simultaneously model appearance and motion cues and capture long-term temporal context.

Recently, SlowFast Feichtenhofer et al. (2019) is introduced to capture motion through a fast pathway at a fine temporal resolution and a slow pathway to capture appearance at a low frame rate. STM Jiang et al. (2019) proposes channel-wise spatio-temporal and motion modules to enhance motion feature encoding. ActionFlowNet Ng et al. (2018) uses pre-computed optical flow as an additional supervision to encode motion together with appearance in a single stream network. To integrate optical flow with action networks, a differentiable representation flow layer is developed in Piergiovanni and Ryoo (2019). By formulating the TV-L1 algorithm Zach et al. (2007) in customized network layers, TVNet Fan et al. (2018) outputs optical flow like motion features to complement static appearance.

All these studies use classification loss as an indirect supervision or mimic optical flow design to learn motion extraction, while in this paper we aim to explicitly realize motion learning through the proposed hierarchical contrastive learning in a self-supervised way.

Self-Supervised Learning for Video Representation.

To take advantage of abundantly available unlabeled videos, numerous methods have been developed for video representation learning through different self-supervised signals, such as frame interpolation 

Long et al. (2016); Niklaus and Liu (2018); Janai et al. (2018) and sequence ordering Misra et al. (2016); Lee et al. (2017); Fernando et al. (2017). Compared to the two tasks above, video prediction has been more extensively explored.

Among video prediction methods, the future frame prediction Srivastava et al. (2015); Zhou et al. (2016); Xie et al. (2016); Lotter et al. (2017); Mathieu et al. (2016) accounts for the majority. While these methods do not require any pre-trained networks nor video annotations, the learned models may not effectively catch the high-level and long-term temporal dynamics. In contrast, another group of methods conduct prediction on the latent space Vondrick et al. (2016); Diba et al. (2019); Han et al. (2019)

, where the learned representations capture more abstract motion and allow better generalization. Note that the prediction of feature embedding usually requires a well-trained network for feature extraction 

Vondrick et al. (2016); Diba et al. (2019), or complex training schemes to avoid trivial solutions Han et al. (2019).

Compared to existing methods, we conduct motion learning at all levels across the hierarchical structure of a backbone network. We present a progressive training strategy to build the higher-level motion from prediction of the well-learned lower-level one through the proposed hierarchical contrastive learning so that our model can obtain multi-level motion features from scratch.

3 Method

Our approach has two phases: a self-supervised learning phase for hierarchical contrastive motion learning from raw video frames, and a joint learning phase that leverages the learned motion features to improve action recognition. In this section, we describe in details the two learning phases.

3.1 Formulation

We denote the convolutional features at different levels of a backbone network as , where is the number of abstraction levels. Our goal is to learn a hierarchy of motion representations that correspond to the different levels. As the first step, we employ video frame reconstruction to obtain the preliminary motion cues , which function as a bootstrap for the following hierarchical motion learning. With that, we progressively learn the motion features in a bottom-up manner. At each level , we learn the motion features by enforcing them to predict the future motion features at the previous level , as described in Sec. 3.2. We use the contrastive loss as an objective so that is trained to capture semantic temporal dynamics from . The learned motion features at each level are integrated into a backbone network via residual connections to perform appearance and motion feature fusion: , where is used to match the feature dimensions. After learning motion at all levels, we jointly train the whole network for action recognition in a multi-tasking manner, as presented in Sec. 3.4.

3.2 Self-Supervised Motion Learning

Prime Motion Block.

We first introduce a lightweight prime motion block to transform the convolutional features of a backbone network to more discriminative representations for motion learning. The key component of this block is a cost volume layer, which is inspired by the success of using cost volumes in stereo matching Hosni et al. (2012)

and optical flow estimation 

Sun et al. (2019). A cost volume is initially used to store the costs that measure how well a pixel in one frame matches other pixels in another frame to catch the inter-frame pixel-wise relations that indicate the rough motion.

In our case, given a sequence of convolutional features with length , we first conduct an convolution to reduce the input channels by , denoted as . This operation significantly reduces the computational overhead of prime motion block, and provides more compact representations to reserve the essential information to compute cost volumes. The adjacent features are then re-organized to feature pairs , which is used to construct the cost volumes. The matching cost between two features is defined as:



denotes the feature vector at time

and position , and the cosine distance is used as the similarity function: . Note that we replicate the last feature map to compute their cost volume in order to keep the original temporal resolution.

While constructing a full cost volume over the whole feature map is computationally expensive, we construct a “partial” cost volume following the practice in Sun et al. (2018). So we limit the search range with the max displacement of to be

and use a striding factor

to handle large displacements without increasing the computation. As a result, the cost volume layer outputs a feature tensor of size

, where and denote the height and width of a feature map. It is noteworthy that computing cost volumes is lightweight as it has no learnable parameters and much fewer FLOPs than 3D convolutions. Finally, we combine the cost volumes with the features obtained after dimension reduction, motivated by the observation that these two features provide complementary information for localizing the motion boundaries. We present the architecture and more details of prime motion block in appendix.

Hierarchical Contrastive Learning.

Although the prime motion block extracts rough motion features from convolutional features, we find that such features are easily biased towards appearance information when jointly trained with the backbone network. Thus, an explicit motion supervision is of vital importance for more effective motion learning at each level.

We propose a multi-level self-supervised objective based on the contrastive loss Chopra et al. (2005); Hadsell et al. (2006); Gutmann and Hyvärinen (2010), inspired by the recent success of contrastive predictive coding for self-supervised representation learning Oord et al. (2018); Han et al. (2019); Wu et al. (2018). Our goal is to employ the higher-level motion features as a conditional input to guide the prediction of the future lower-level motion features that are well-learned from a previous step. By this way, the higher-level features are forced to understand a more abstract trajectory that summarizes motion dynamics from the lower-level ones. This objective therefore allows us to extract slowly varying features that progressively correspond to high-level semantic concepts Oord et al. (2018); Han et al. (2019); Zhang and Tao (2012).

Formally, let us denote the motion features generated by the prime motion block at level as , where indicates the sequence length. In order to train , we enforce to predict the future motion features at the previous level (i.e., ), conditioned on the motion feature at the start time , as illustrated in Figure 1(c). In practice, a predictive function is applied for the motion feature prediction at time step : , where denotes channel-wise concatenation. We use a multi-layer perception with one hidden layer for the prediction function: , where

is ReLU and

is shared across all prediction steps for leveraging their common information.

We define the objective function of each level as a contrastive loss that encourages the predicted to be close to the ground truth while being far away from the negative samples:


where the similarity function is defined as the cosine similarity as the one used in computing cost volumes, and

denotes the sampling space of positive and negative samples. As shown in Figure 1(c), the positive sample of the predicted feature is the ground-truth feature that corresponds to the same video and locates at the same position in both space and time as the predicted one. As for the negative samples, considering efficiency, we randomly sample spatial locations for each video within a mini-batch to compute the loss, so the number of spatial negatives, temporal negatives and easy negatives for a predicted feature are respectively: , and , where is the batch size and is the sequence length of ground-truth features. See appendix for more sampling details. As illustrated in Figure 1(b), the contrastive motion learning is performed for multiple levels until the motion hierarchy of the whole network is built up.

Progressive Training.

Training the multi-level self-supervised learning framework simultaneously from the beginning is infeasible, as the lower-level motion features are initially not well-learned and the higher-level prediction would be arbitrary. To facilitate the optimization process, we propose a progressive training strategy that learns motion features for one level at a time, propagating from low-level to high-level. In practice, after the convergence of training at level , we freeze all network parameters up to level (therefore fixing the motion features ), and then start the training for level . In this way, the higher-level motion features can be stably trained with the well-learned lower-level ones.

3.3 Preliminary Motion Cues

To initialize the progressive training, the preliminary motion cues, i.e., , are required as a bootstrap. They should encode some low-level but valid movement information to facilitate the following motion learning. Therefore, we adopt video frame reconstruction to guide the extraction of preliminary motion cues. This task can be formulated as a self-supervised optical flow estimation problem Jason et al. (2016); Ren et al. (2017), aiming to produce optical flow to allow frame reconstruction from neighboring frames. Motivated by the success of recent work on estimating optical flow with CNNs Sun et al. (2018)

, we build a simple optical flow estimation module using 5 convolutional layers that are stacked sequentially with dense connections. We make use of the optical flow output to warp video frames through bilinear interpolation. The loss function consists of a photometric term that measures the error between the warped frame and the target frame, and a smoothness term that addresses the aperture problem that causes ambiguity in motion estimation:

. We define the photometric error as:


where indicates the warped frame at time and is the generalized Charbonnier penalty function with and Sun et al. (2014). We use a binary mask to indicate the positions of invalid warped pixels (i.e., out-of-boundary) and apply an indicator function to exclude those invalid positions. We compute the smoothness term as:


where and denote the gradients of estimated flow fields in directions.

3.4 Joint Training for Action Recognition

Our ultimate goal is to improve video action recognition with the learned hierarchical motion features. To integrate the learned motion features into a backbone network, we wrap our prime motion block into a residual block: , where is the convolutional features at level , is the corresponding motion features obtained in Sec. 3.2, and is a convolution. This seamless integration enables end-to-end fusion of appearance and motion information over multiple levels throughout a single unified network, instead of learning them disjointly like two-stream networks Simonyan and Zisserman (2014). After the motion representations are self-supervised learned at all levels, we add in the classification loss to jointly optimize the total objective, which is a weighted sum of the following losses:


where and are the weights to balance related loss terms. As shown in Figure 1(b), our multi-level self-supervised learning is performed via a side network branch, which can be flexibly embedded into standard CNNs. Furthermore, this self-supervised learning side branch is discarded after training so that our final network can well maintain the efficiency at runtime for inference.

4 Experiments

In this section, we first describe in details our experimental setup. We then present a variety of ablation studies to understand the contributions of each individual component in our design, and provide in-depth analysis with qualitative visualization results. In the end, we report comparisons to the state-of-the-art methods on four benchmark datasets.

4.1 Experimental Setup

We evaluate the proposed approach on four benchmarks: Kinetics-400 Carreira and Zisserman (2017), Something-Something (V1&V2) Goyal et al. (2017b) and UCF-101 Soomro et al. (2012). Our motion learning module is generic and can be instantiated with various 2D and 3D action networks Wang et al. (2018); Carreira and Zisserman (2017); Tran et al. (2018). In our experiments, we use the standard networks R2D Wang et al. (2018) and R(2+1)D Tran et al. (2018) as our backbones. We make a few changes to the backbones to improve efficiency, e.g., using bottleneck layers, dropping the temporal convolutions in and , and starting temporal striding from . For the prime motion block, we set and , and the channel reduction ratio . We set the temperature parameter in contrastive loss to 0.07 following the practice in previous work Wu et al. (2018); He et al. (2020). We follow the standard recipe in Goyal et al. (2017a); Feichtenhofer et al. (2019) for model training. Note that all models are trained from scratch or self-supervised pre-trained without any additional video annotations or pre-computed optical flow. More details on dataset, architecture and implementation are available in appendix.

4.2 Ablation Study and Model Analysis

Table 1: Comparison of efficacy scores of the motion features learned at different levels under different supervisions. Input Supervision Score action preliminary hierarchical Level 1 2.3 3.1 Level 2 1.7 2.5 3.0 Level 3 2.4 1.7 3.0 Figure 2: Top-1 accuracy on UCF-101 with incrementally adding motion learning blocks.
Methods PMB Self FLOPs UCF-101 Something-V1 Kinetics-400
Baseline: R2D 1.00 66.0 / 86.0 36.1 / 68.1 64.8 / 85.1
Ours: R2D 1.18 71.6 / 89.7 43.6 / 74.7 65.6 / 85.5
Ours: R2D 1.18 79.8 / 94.4 44.3 / 75.8 67.3 / 86.4
Baseline: R(2+1)D 1.00 68.0 / 88.2 48.5 / 78.1 66.8 / 86.6
Ours: R(2+1)D 1.11 73.4 / 92.1 49.2 / 77.9 67.4 / 86.9
Ours: R(2+1)D 1.11 80.7 / 95.6 50.4 / 78.9 68.3 / 87.4
Table 2: Ablation study on the prime motion block (PMB) and self-supervision (Self) for action recognition. We report computational cost and top-1 / top-5 accuracy on the three benchmarks.

Supervision for Motion Learning.

We first compare the motion features learned at different levels under different supervisions. Since our motion learning is based on self-supervisions that are not directly correlated with the final action recognition performance, we first seek to define a measurement to reflect the efficacy of the learned motion features. Towards this goal, we take the extracted motion feature as input and train a lightweight classifier for action recognition on UCF-101. We define the efficacy score as:

, where and indicate the top-1 accuracy on the training and test sets. Intuitively, a higher score implies that the representation is more discriminative (with higher training accuracy) and generalizes better (with a lower performance gap between training and testing).

Table 4.2 shows the efficacy scores of motion features at different levels with different supervisions, where “action” indicates the supervision by action classification, and “preliminary” and “hierarchical” refer to the supervisions by frame reconstruction and contrastive learning. Levels 1, 2 and 3 correspond to the motion features extracted after , and of R2D. We observe that the self-supervision of low-level frame reconstruction is particularly effective at level 1, but its performance degrades dramatically at higher levels due to lower spatial/temporal resolutions and higher abstraction of convolutional features. In contrast, the proposed self-supervision by hierarchical contrastive learning is more stable over different levels and more effective to model motion dynamics. It is also observed that the self-supervision, with correct choices at different levels, consistently outperforms the supervision by action classification, which is consistent with the findings in Ng et al. (2018); Stroud et al. (2020); Diba et al. (2019). In Figure 3, we visualize the estimated optical flow, the by-product of frame reconstruction, at each level and find that more accurate optical flow indeed presents at lower levels.

Figure 3: Visualization of the estimated optical flow at different feature abstraction levels. For each group, columns 1-2 are adjacent frames; column 3 is the reference optical flow extracted by Liu (2009); columns 4-6 are the estimated optical flow at levels 1, 2 and 3.
Methods Pre-Training Two-Stream Kinetics-400 Something-V1 Something-V2
I3D Carreira and Zisserman (2017) ImageNet 75.7
R(2+1)D Tran et al. (2018) Sports1M 74.3 45.7
R(2+1)D Tran et al. (2018) Sports1M 75.4
NL I3D-50 Wang et al. (2018) ImageNet 76.5 44.4
S3D-G Xie et al. (2018) ImageNet 77.2 48.2
TRN Zhou et al. (2018) ImageNet 42.0 55.5
TSM Lin et al. (2019) ImageNet 75.7 49.7 63.4
TSM Lin et al. (2019) ImageNet 52.6 66.0
ECO Zolfaghari et al. (2018) 70.0 49.5
SlowFast Feichtenhofer et al. (2019) 77.9
Disentangling Zhao et al. (2018) ImageNet 71.5
D3D Stroud et al. (2020) ImageNet 75.9
STM Jiang et al. (2019) ImageNet 73.7 50.7 64.2
Rep. Flow Piergiovanni and Ryoo (2019) 77.1
MARS Crasto et al. (2019) 72.7
DynamoNet Diba et al. (2019) 77.9
Ours, R2D-50 74.8 46.2 59.4
Ours, R(2+1)D-101 78.3 52.8 64.4
Table 3: Comparison with the state-of-the-art methods on Kinetics-400 and Something-V1&V2.

Contributions of Individual Components. Here we verify the contributions of proposed components by comparing against the baseline, as shown in Table 2. We use one clip per video and the center crop for evaluation to eliminate the impact of test-time augmentation. We report both top-1 and top-5 accuracy for each setting.

It is obvious that our approach consistently and significantly improves the action recognition accuracy for both 2D and 3D action networks. Our prime motion block provides complementary motion features at multiple levels, and the self-supervision further enhances the representations to encode semantic dynamics. In particular, for the dataset that heavily depends on temporal information like Something-V1, our approach remarkably improves the performance of baseline R2D by 8.2%. For the dataset that is small-scale and tends to overfit to the appearance information like UCF-101, our method improves model generalization and achieves 13.8% improvement. Moreover, our motion learning module only introduces a small overhead to FLOPs of the backbone network.

We next validate the contribution of our motion learning at each level by incrementally adding the proposed motion feature learning block to the baseline. Figure 4.2 demonstrates the results of R2D and R(2+1)D on UCF-101. We observe that notable gains can be obtained at multiple levels, and the performance gain does not vanish with the increase of motion learning blocks, suggesting the importance of leveraging hierarchical motion information across all levels.

4.3 Comparison with State-of-the-Art Results

We compare our approach with the state-of-the-art methods on the four action recognition benchmarks. We report results with the standard test-time augmentations Wang et al. (2018): increasing the spatial resolution to and sampling multiple clips per video (9 clips for Something-V1&V2 and 30 clips for Kinetics-400 and UCF-101).

Table 3 shows the comparisons on Kinetics-400 and Something-V1&V2. Without using optical flow or supervised pre-training, our model based on backbone R(2+1)D-101 achieves the best results among the single-stream methods over all three datasets. Our approach also outperforms most two-stream methods, apart from the recent two-stream TSM Lin et al. (2019) on Something-V2. For the datasets that focus more on temporal modeling like Something-V1&V2, 2D networks usually cannot achieve as good results as 3D models. However, by equipping with the proposed motion learning module, we find that our method based on backbone R2D-50 outperforms some 3D models, such as R(2+1)D and NL I3D. Our approach also achieves superior results compared with the most recent work that are specifically designed for temporal motion modeling (the second group in Table 3). More importantly, our motion learning is fully self-supervised from raw video frames without any supervision from optical flow or pre-trained temporal stream.

For the experiment on UCF-101, we fine-tune the models trained only on Kinetics-400 for classification following the standard setting in previous work, and report the averaged accuracy over all 3 splits. As shown in Table 4, our approach achieves 97.8% top-1 accuracy, a new state-of-the-art result among the single-stream methods on UCF-101.

Figure 4: Visualization of the learned features by Grad-CAM++ Chattopadhyay et al. (2017). Fonts in green and red indicate correct recognition and misclassification. (a-b) Features learned by our approach (bottom) are more sensitive to the regions with motion cues. (c) Our motion learning module (bottom) equips the 2D backbone with the ability of reasoning the temporal order of video frames.
Methods Pre-Training UCF-101
I3D Carreira and Zisserman (2017) ImageNet + Kinetics-400 95.4
R(2+1)D Tran et al. (2018) Kinetics-400 96.8
S3D-G Xie et al. (2018) ImageNet + Kinetics-400 96.8
Disentangling Zhao et al. (2018) ImageNet + Kinetics-400 95.9
D3D Stroud et al. (2020) ImageNet + Kinetics-400 97.0
STM Jiang et al. (2019) ImageNet + Kinetics-400 96.2
MARS Crasto et al. (2019) Kinetics-400 97.0
DynamoNet Diba et al. (2019) YouTube8M + Kinetics-400 97.8
Ours, R(2+1)D-101 Kinetics-400 97.8
Table 4: Comparison with the state-of-the-art methods on UCF-101.

4.4 Qualitative Results

To qualitatively verify the impact of the learned motion features, we utilize Grad-CAM++ Chattopadhyay et al. (2017) to visualize the class activation map of the last convolution layer. Figure 4 shows the comparison between baseline and our model with the backbone R2D-50 on UCF-101 and Something-V1. Our model attentions more on the regions with informative motion, while the baseline tends to be distracted by the static appearance. For instance, in Figure 4(a), our method focuses on the moving hands of the person, while the baseline concentrates on the static human body. Our motion learning module also equips the 2D backbone with effective temporal modeling ability. As shown in Figure 4(c), our model is capable of reasoning the temporal order of the video and predicting the correct action, while the baseline outputs the opposite prediction result as it fails to capture the chronological relationship.

5 Conclusion

We have presented hierarchical contrastive motion learning, a multi-level self-supervised framework that progressively learns a hierarchy of motion features from raw video frames. A discriminative contrastive loss at each level provides explicit self-supervision for motion learning. This hierarchical design bridges the semantic gap between low-level motion cues and high-level recognition tasks, meanwhile promotes effective fusion of appearance and motion information to finally boost action recognition. Extensive experiments on four benchmarks show that our approach compares favorably against the state-of-the-art methods yet without requiring optical flow or supervised pre-training.


  • J. Bill, H. Pailian, S. J. Gershman, and J. Drugowitsch (2019) Hierarchical structure is employed by humans during visual motion perception. Journal of Vision. Cited by: §1.
  • J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, Cited by: §B.1, Table 6, §4.1, Table 3, Table 4.
  • A. Chattopadhyay, A. Sarkar, P. Howlader, and V. N. Balasubramanian (2017) Grad-CAM++: improved visual explanations for deep convolutional networks. arXiv:1710.11063. Cited by: Figure 4, §4.4.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In ICML, Cited by: §1.
  • S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR, Cited by: §1, §3.2.
  • N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid (2019) MARS: motion-augmented RGB stream for action recognition. In CVPR, Cited by: Table 3, Table 4.
  • A. Diba, V. Sharma, L. V. Gool, and R. Stiefelhagen (2019) DynamoNet: dynamic action and motion network. In ICCV, Cited by: §B.1, Table 7, Appendix D, §1, §2, §4.2, Table 3, Table 4.
  • L. Fan, W. Huang, C. Gan, S. Ermon, B. Gong, and J. Huang (2018) End-to-end learning of motion representation for video understanding. In CVPR, Cited by: §2.
  • C. Feichtenhofer, H. Fan, J. Malik, and K. He (2019) SlowFast networks for video recognition. In ICCV, Cited by: §2, §4.1, Table 3.
  • C. Feichtenhofer, A. Pinz, and A. Zisserman (2016) Convolutional two-stream network fusion for video action recognition. In CVPR, Cited by: §1.
  • C. Feichtenhofer, A. Pinz, and A. Zisserman (2017) Detect to track and track to detect. In ICCV, Cited by: §1.
  • B. Fernando, H. Bilen, E. Gavves, and S. Gould (2017)

    Self-supervised video representation learning with odd-one-out networks

    In CVPR, Cited by: Table 7, Appendix D, §2.
  • M. A. Giese and T. Poggio (2003) Neural mechanisms for the recognition of biological movements. Nature Reviews Neuroscience. Cited by: §1.
  • P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017a) Accurate, large minibatch SGD: training ImageNet in 1 hour. arXiv:1706.02677. Cited by: §B.2, §4.1.
  • R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, and M. Mueller-Freitag (2017b) The “something something” video database for learning and evaluating visual common sense.. In ICCV, Cited by: §B.1, §4.1.
  • C. Gu, C. Sun, S. Vijayanarasimhan, C. Pantofaru, D. A. Ross, G. Toderici, Y. Li, S. Ricco, R. Sukthankar, and C. Schmid (2018) AVA: a video dataset of spatio-temporally localized atomic visual actions. In CVPR, Cited by: §1.
  • M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In AISTATS, Cited by: §3.2.
  • R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In CVPR, Cited by: §1, §3.2.
  • T. Han, W. Xie, and A. Zisserman (2019) Video representation learning by dense predictive coding. In ICCV Workshop, Cited by: §A.2, Table 7, Appendix D, §1, §2, §3.2.
  • K. Hara, H. Kataoka, and Y. Satoh (2018) Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?. In CVPR, Cited by: §2.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In CVPR, Cited by: §4.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §A.3, §1.
  • A. Hosni, C. Rhemann, M. Bleyer, C. Rother, and M. Gelautz (2012) Fast cost-volume filtering for visual correspondence and beyond. TPAMI. Cited by: §3.2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §A.3.
  • J. Janai, F. Guney, A. Ranjan, M. Black, and A. Geiger (2018) Unsupervised learning of multi-frame optical flow with occlusions. In ECCV, Cited by: §2.
  • J. Y. Jason, A. W. Harley, and K. G. Derpanis (2016) Back to basics: unsupervised learning of optical flow via brightness constancy and motion smoothness. In ECCV, Cited by: §1, §3.3.
  • B. Jiang, M. Wang, W. Gan, W. Wu, and J. Yan (2019) STM: spatiotemporal and motion encoding for action recognition. In ICCV, Cited by: §2, Table 3, Table 4.
  • D. Kim, D. Cho, and I. Kweon (2019) Self-supervised video representation learning with space-time cubic puzzles. In AAAI, Cited by: Table 7, Appendix D.
  • H. Lee, J. Huang, M. Singh, and M. Yang (2017) Unsupervised representation learning by sorting sequences. In ICCV, Cited by: Table 7, Appendix D, §2.
  • J. Lin, C. Gan, and S. Han (2019) TSM: temporal shift module for efficient video understanding. In ICCV, Cited by: §4.3, Table 3.
  • C. Liu (2009) Beyond pixels: exploring new representations and applications for motion analysis. Ph.D. Thesis, MIT. Cited by: Figure 3.
  • G. Long, L. Kneip, J. M. Alvarez, H. Li, X. Zhang, and Q. Yu (2016) Learning image matching by simply watching video. In ECCV, Cited by: §2.
  • I. Loshchilov and F. Hutter (2017)

    SGDR: stochastic gradient descent with warm restarts

    In ICLR, Cited by: §B.2.
  • W. Lotter, G. Kreiman, and D. Cox (2017) Deep predictive coding networks for video prediction and unsupervised learning. In ICLR, Cited by: §2.
  • Z. Lu, V. Rathod, R. Votel, and J. Huang (2020) RetinaTrack: online single stage joint detection and tracking. In CVPR, Cited by: §1.
  • B. Mahasseni, X. Yang, P. Molchanov, and J. Kautz (2018) Budget-aware activity detection with a recurrent policy network. In BMVC, Cited by: §2.
  • M. Mathieu, C. Couprie, and Y. LeCun (2016) Deep multi-scale video prediction beyond mean square error. In ICLR, Cited by: §2.
  • I. Misra, C. L. Zitnick, and M. Hebert (2016) Shuffle and learn: unsupervised learning using temporal order verification. In ECCV, Cited by: Table 7, Appendix D, §2.
  • Y. Ng, J. Choi, J. Neumann, and L. Davis (2018) ActionFlowNet: learning motion representation for action recognition. In WACV, Cited by: §B.1, Table 7, Appendix D, §1, §1, §2, §4.2.
  • Y. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici (2015) Beyond short snippets: deep networks for video classification. In CVPR, Cited by: §2.
  • S. Niklaus and F. Liu (2018) Context-aware synthesis for video frame interpolation. In CVPR, Cited by: §2.
  • A. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv:1807.03748. Cited by: §1, §3.2.
  • A. Piergiovanni and M. S. Ryoo (2019) Representation flow for action recognition. In CVPR, Cited by: §2, Table 3.
  • Z. Qiu, T. Yao, C. Ngo, X. Tian, and T. Mei (2019) Learning spatio-temporal representation with local and global diffusion. In CVPR, Cited by: §B.1.
  • Z. Ren, J. Yan, B. Ni, B. Liu, X. Yang, and H. Zha (2017)

    Unsupervised deep learning for optical flow estimation

    In AAAI, Cited by: §1, §3.3.
  • L. Sevilla-Lara, S. Zha, Z. Yan, V. Goswami, M. Feiszli, and L. Torresani (2019) Only time can tell: discovering temporal data for temporal modeling. arXiv:1907.08340. Cited by: §1.
  • K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In NeurIPS, Cited by: §1, §1, §3.4.
  • K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402. Cited by: §B.1, §4.1.
  • N. Srivastava, E. Mansimov, and R. Salakhudinov (2015) Unsupervised learning of video representations using LSTMs. In ICML, Cited by: §2.
  • J. C. Stroud, D. A. Ross, C. Sun, J. Deng, and R. Sukthankar (2020) D3D: distilled 3D networks for video action recognition. In WACV, Cited by: §B.1, §1, §4.2, Table 3, Table 4.
  • D. Sun, S. Roth, and M. J. Black (2014) A quantitative analysis of current practices in optical flow estimation and the principles behind them. IJCV. Cited by: §3.3.
  • D. Sun, X. Yang, M. Liu, and J. Kautz (2018) PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, Cited by: §3.2, §3.3.
  • D. Sun, X. Yang, M. Liu, and J. Kautz (2019) Models matter, so does training: an empirical study of CNNs for optical flow estimation. TPAMI. Cited by: §3.2.
  • D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In CVPR, Cited by: §A.1, Table 6, §4.1, Table 3, Table 4.
  • C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Anticipating visual representations from unlabeled video. In CVPR, Cited by: §2.
  • H. Wang and C. Schmid (2013) Action recognition with improved trajectories. In ICCV, Cited by: §1.
  • X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, Cited by: §A.1, Table 6, §4.1, §4.3, Table 3.
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In CVPR, Cited by: §3.2, §4.1.
  • J. Xie, R. Girshick, and A. Farhadi (2016) Deep3D: fully automatic 2D-to-3D video conversion with deep convolutional neural networks. In ECCV, Cited by: §2.
  • S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In ECCV, Cited by: §A.1, Table 6, §2, Table 3, Table 4.
  • X. Yang, P. Molchanov, and J. Kautz (2016) Multilayer and multimodal fusion of deep neural networks for video classification. In ACM MM, Cited by: §2.
  • X. Yang, P. Molchanov, and J. Kautz (2018) Making convolutional networks recurrent for visual sequence learning. In CVPR, Cited by: §2.
  • X. Yang, X. Yang, M. Liu, F. Xiao, L. Davis, and J. Kautz (2019) STEP: spatio-temporal progressive learning for video action detection. In CVPR, Cited by: §1.
  • C. Zach, T. Pock, and H. Bischof (2007) A duality based approach for realtime TV-L1 optical flow. In

    Joint Pattern Recognition Symposium

    Cited by: §2.
  • Z. Zhang and D. Tao (2012) Slow feature analysis for human action recognition. TPAMI. Cited by: §3.2.
  • Y. Zhao, Y. Xiong, and D. Lin (2018) Recognize actions by disentangling components of dynamics. In CVPR, Cited by: Table 3, Table 4.
  • B. Zhou, A. Andonian, A. Oliva, and A. Torralba (2018) Temporal relational reasoning in videos. In ECCV, Cited by: Table 3.
  • T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros (2016) View synthesis by appearance flow. In ECCV, Cited by: §2.
  • Y. Zhu, Z. Lan, S. Newsam, and A. Hauptmann (2018) Hidden two-stream convolutional networks for action recognition. In ACCV, Cited by: §1, §1.
  • M. Zolfaghari, K. Singh, and T. Brox (2018) Eco: efficient convolutional network for online video understanding. In ECCV, Cited by: Table 3.


Section A introduces more details of our approach, including the architectures of backbone networks, the sampling strategy for contrastive motion learning and the prime motion block. Section B describes the benchmark datasets used in our experiments and more training details. Section C reports the computational cost of our approach. Our self-supervised learning framework can also serve as pre-training of a network, and we present the results in Section D.

Appendix A Model Details

a.1 Backbones

We adopt the standard networks R2D-50 Wang et al. (2018) and R(2+1)D-101 Tran et al. (2018) as the backbones used in our experiments. A few changes are made to improve the computation efficiency, as shown in Table 5. Compared with the original network R(2+1)D Tran et al. (2018), our backbone supports higher input resolution and applies bottleneck layers with consistent number of channels. We start temporal striding from rather than , and employ the top-heavy design as Xie et al. (2018) for R(2+1)D, i.e., only using temporal convolutions in and .

Layer R2D-50 R(2+1)D-101 Output Size
input T 224 224
T 112 112
3 3 T 56 56
4 4 28 28
6 23 14 14
3 3 7 7
Table 5: Architectures of R2D-50 and R(2+1)D-101 used in our experiments.

a.2 Sampling Strategy for Contrastive Motion Learning

We denote the predicted motion feature at level as , where is the temporal index, and is the spatial index. The only positive pair is , which is the ground-truth feature that corresponds to the same video and locates at the same position in both space and time as the predicted one. Following the terminology used in Han et al. (2019), we define three types of negative samples for all the prediction and ground-truth pairs :

Spatial negatives are the ground-truth features that come from the same video of the predicted one but at a different spatial position, i.e., . Considering the efficiency, we randomly sample spatial locations for each video within a mini-batch to compute the loss. So the number of spatial negatives is .

Temporal negatives are the ground-truth features that come from the same video and same spatial position, but from different time steps, i.e., . They are the hardest negative samples to classify, and the number of temporal negatives are .

Easy negatives are the ground-truth features that come from different videos, and the number of easy negatives are , where is the batch size.

Figure 5: Architecture of the prime motion block.

a.3 Prime Motion Block

Here we describe the prime motion block in more details. As shown in Figure 5, the prime motion block is wrapped as a residual block He et al. (2016) such that the motion features can be inserted into a backbone network seamlessly. For the cost volume layer, we limit the search range with the maximum displacement and the stride , which is equivalent to covering a region of pixels with a stride 2. To combine the complementary information provided by the cost volumes and the convolutional features (after dimension reduction), we concatenate the two features in channels and then perform a 2D convolution. We use batch normalization Ioffe and Szegedy (2015) and ReLU after each convolutional layer and cost volume layer.

Appendix B Experiment Details

b.1 Datasets

We evaluate the proposed approach on four benchmarks: Kinetics-400 Carreira and Zisserman (2017), Something-Something (V1&V2) Goyal et al. (2017b) and UCF-101 Soomro et al. (2012). Kinetics-400 is a large-scale video dataset with 400 action categories. As some videos are deleted by their owners over time, our experiments are conducted on a subset of the original dataset with approximately 238K training videos (96%) and 196K validation videos (98%). In practice, we notice a bit accuracy drop for the same model using our collected dataset with fewer training samples, suggesting that our results can be improved with the full original dataset. Something-Something (V1&V2) are more sensitive to temporal motion modeling. Something-V1 contains about 100K videos covering 174 classes, and Something-V2 increases videos to 221K and improves annotation quality and video resolution. UCF-101 includes about 13K videos with 101 classes. As the number of training videos in UCF-101 is small, it is often used for evaluating unsupervised representation learning Diba et al. (2019); Ng et al. (2018)

and transfer learning 

Qiu et al. (2019); Stroud et al. (2020).

b.2 Model Training

We use the synchronized SGD with a cosine learning rate scheduling Loshchilov and Hutter (2017) and a linear warm-up Goyal et al. (2017a) for all model training. The spatial size of the input is , randomly cropped and horizontally flipped from a scaled video with the shorter side randomly sampled in [256, 320] pixels for R2D-50, and [256, 340] pixels for R(2+1)D-101. We apply temporal jittering when sampling clips from a video. The balancing weights for the joint training in Eq. 5 are set to , respectively. We describe the training details for different benchmarks as follow.

Kinetics-400. We sample a clip of frames with a temporal stride 2 for the experiments using the backbone R2D-50 and

frames for those with the backbone R(2+1)D-101. We train all models using the distributed SGD on GPU clusters with 8 clips per GPU. We set the learning rate per GPU to 0.0025, and linearly scale the learning rate according to the number of GPUs. For the self-supervised training phase, all models are trained for 80 epochs with the first 10 epochs for warm-up, and the global batch normalization is used to avoid trivial solution. As for the joint training phase, the models are trained for 200 epochs with the first 40 epochs for warm-up, and BN statistics is computed within each 8 clips.

Something-V1&V2. Since this dataset has a higher frame rate than Kinetics-400, we sample a clip of frames with a temporal stride 1 for all experiments. Models are trained for 150 epochs with the first 50 epochs for warm-up and the learning rate per GPU is also 0.0025.

UCF-101. For the experiments described in Table 4 in the paper, the models are initialized with the weights pre-trained on Kinetics-400 for classification, and then are fine-tuned for 30 epochs with a batch size of 32 and a learning rate of 0.002.

Appendix C Computational Cost

Our motion representation learning module is flexible to plug in a standard network and only introduces a small computation overhead to the backbone. We compare the computational cost in GFLOPs of our models with the state-of-the-art methods in Table 6. As the computational cost is highly affected by the backbone architecture, we only show the methods using similar backbone networks as our approach for fair comparisons. It can be observed that our models achieve superior results while maintaining a low inference-time cost, especially compared with the methods using two-stream networks.

Methods Flow GFLOPsCrops Top-1 Top-5
I3D Carreira and Zisserman (2017) 108 N/A 72.1 90.3
I3D Carreira and Zisserman (2017) 216 N/A 75.7 92.0
S3D-G Xie et al. (2018) 143 N/A 77.2 93.0
NL I3D-50 Wang et al. (2018) 282 30 76.5 92.6
NL I3D-101 Wang et al. (2018) 359 30 77.7 93.3
R(2+1)D Tran et al. (2018) 152 115 74.3 91.4
R(2+1)D Tran et al. (2018) 304 115 75.4 91.9
Ours, R2D 49 30 74.8 91.6
Ours, R(2+1)D 150 30 78.3 93.3
Table 6: Comparison of the computational cost and the top-1 / top-5 accuracy on Kinetics-400.

Appendix D Self-Supervised Pre-Training

The self-supervised learning of video representation without using large amount of labeled data has been gaining increasing attention in recent years Misra et al. (2016); Fernando et al. (2017); Lee et al. (2017); Kim et al. (2019); Han et al. (2019). In addition to the hierarchical motion learning, our multi-level self-supervised learning can also serve as pre-training of a network. As an example, after self-supervised pre-training on Kinetics-400, we fine-tune the network on UCF-101 for 80 epochs with a batch size of 16 and a learning rate of 0.01. We compare our approach with other state-of-the-art self-supervised methods in Table 7. Interestingly, even though not specifically designed for network pre-training, our approach is capable of learning effective video representation that generalizes well to the small dataset. Note that previous work requires optical flow Ng et al. (2018) or a much larger pre-training dataset (YouTube8M) Diba et al. (2019) to achieve state-of-the-art results.

Method Architecture Dataset Flow Accuracy
Shuffle and Learn Misra et al. (2016) CaffeNet UCF-101 50.2
OPN Lee et al. (2017) VGG-M-2048 UCF-101 59.8
Odd-one-out Fernando et al. (2017) AlexNet UCF-101 60.3
ActionFlowNet Ng et al. (2018) 3DResNet-18 UCF-101 83.9
3D-Puzzel Kim et al. (2019) 3DResNet-18 Kinetics-400 65.8
DPC Han et al. (2019) 3DResNet-34 Kinetics-400 75.7
DynamoNet Diba et al. (2019) STC-ResNet-101 YouTube8M 88.1
Ours (random init.) R2D-50 69.9
Ours R2D-50 Kinetics-400 82.2
Ours (random init.) R(2+1)D-101 70.9
Ours R(2+1)D-101 Kinetics-400 85.1
Table 7: Comparison with self-supervised methods on UCF-101 (split-1).