Exploring Relations in Untrimmed Videos for Self-Supervised Learning

08/06/2020 ∙ by Dezhao Luo, et al. ∙ 0

Existing video self-supervised learning methods mainly rely on trimmed videos for model training. However, trimmed datasets are manually annotated from untrimmed videos. In this sense, these methods are not really self-supervised. In this paper, we propose a novel self-supervised method, referred to as Exploring Relations in Untrimmed Videos (ERUV), which can be straightforwardly applied to untrimmed videos (real unlabeled) to learn spatio-temporal features. ERUV first generates single-shot videos by shot change detection. Then a designed sampling strategy is used to model relations for video clips. The strategy is saved as our self-supervision signals. Finally, the network learns representations by predicting the category of relations between the video clips. ERUV is able to compare the differences and similarities of videos, which is also an essential procedure for action and video related tasks. We validate our learned models with action recognition and video retrieval tasks with three kinds of 3D CNNs. Experimental results show that ERUV is able to learn richer representations and it outperforms state-of-the-art self-supervised methods with significant margins.



There are no comments yet.


page 1

page 3

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Convolutional Neural Networks(CNNs) have achieved great success in the computer vision field, especially for image related tasks [14]

. In general, CNNs are trained with large-scale labeled image datasets such as ImageNet

[22]. If transferred to some downstream tasks (

object detection, instance segmentation), the pre-trained models can promote the performance since they have better feature extraction capabilities. However, manual annotations of large-scale datasets are time-consuming and expensive, particularly for video related tasks.

Fig. 1: Illustration of the necessity to design proxy tasks for untrimmed videos. With shot changes in untrimmed videos, the models of previous methods may be confused while ERUV can take advantage of them.

Self-supervised representation learning, which extracts the supervisory signal from raw unlabeled data automatically as the learning target, has attracted unprecedented attention in recent years. First, labels are generated from the raw unlabeled data by pre-processing. Then, the labels together with the raw data are input to the network for model training. In this manner, models are trained to learn representations without human annotations. The trained models are then used to promote downstream tasks.

In previous studies on self-supervised learning for images, relative location [3, 20] or color of image [15] are used as supervision signals. Recently, many approaches have been proposed for video self-supervised learning. [19, 16, 6, 31, 7, 1] aim to learn representations for 2D CNNs. However, state-of-the-art performance for video related tasks are mostly based on 3D CNNs [25, 26, 4]. For better representing the spatio-temporal dynamics of videos with 3D CNNs in self-supervised manner, 3D video cubic puzzles [13], video motion and appearance statistics [28], video clip orders [32] and video cloze procedure [18] are taken as the supervisory signals.

The existing video self-supervised methods learn spatio-temporal representations with trimmed video datasets (e.g., UCF101 [24], HMDB51 [9], and Kinetics [12]). Nevertheless, trimmed video datasets are not real unlabeled because the start and the end frames of action instances are annotated manually. As is shown in Fig.1, genuine untrimmed videos may include action foreground frames and background frames simultaneously, and may include multi-view camera shots for the same action instance. As a result, previous 3D CNNs self-supervised learning methods are invalid with untrimmed videos. Inspired by exploring the relations of objects in object recognition [5, 8], we argue that the relations between actions within different shots in untrimmed videos may supply informational supervisory signals for self-supervised representation learning.

In this paper, we propose a novel self-supervised representation learning approach referred to as Exploring Relations in Untrimmed Videos (ERUV), targeting at learning representations while comparing differences and similarities between video clips. In ERUV, we generate video clips with a designed sampling strategy to model different relations. Then, we train a 3D-CNNs model to identify the categories of the relations. This mechanism has been explored by [8] in image object detection, while we extend it with a self-supervised manner to model the relations between video clips. Moreover, modeling rich and complicated relations of videos can promote the network’s spatio-temporal representation capability.

Specifically, ERUV consists of three components including shot editing, relation modeling and video comparing. The first component generates single-shot videos with shot change detection, since the single-shot videos focus on consistent motions. The second component promotes exploring representation capability by modeling cooccurrence and relevance relations between sampled video clips. Finally, video comparing model learns representation by predicting the categories of relations.

The contributions of this work include:

  • We propose ERUV to capture video appearance and temporal representations. To the best of our knowledge, this is the first self-supervised representation learning work utilizing untrimmed video datasets, so ERUV is a real self-supervised video representation learning method.

  • We propose a novel feature learning strategy. By modeling relations between video clips, we can integrate current self-supervised methods with our designed relations.

  • Extensive experimental results demonstrate that the trained networks learn rich spatio-temporal representations, and the proposed method outperforms other recently proposed self-supervised learning methods considerably.

Fig. 2: Illustration of the ERUV framework. Given untrimmed videos , ERUV first generates video segments () by shot change detection and long-duration shot breakdown. Then the relation modeling is used to sample video segments with different relations. The sampling strategy is stored to be the label. Finally the clips extracted from the video segments are fed to 3D models for spatio-temporal representation learning.

Ii Related Work

The most relevant works to ours are video action recognition and self-supervised representation learning, which are introduced in the following two subsections respectively.

Ii-a Video Action Recognition

Great progress has been achieved in video action recognition with deep neural networks.

Earlier researches take videos as sequences of frames, and apply 2D CNNs to extract features. [23] proposes two-stream convolutional networks in which RGB frames and optical flows are processed by 2D CNNs. Temporal segment network [30] is based on a sparse temporal sampling strategy and enables efficient and effective learning with the whole action video level supervision. In [35], temporal relation reasoning is proposed to learn and reason the temporal dependencies among video frames of different time scales.

Recently, methods based on 3D CNNs have become mainstream because they can model both spatial and temporal features simultaneously. C3D [25] extends the 2D convolution kernels to 3D kernels to model temporal features among frames. [26] proposes R3D and R(2+1)D. R3D is an extension of ResNet to 3D. In R(2+1)D, 3D convolution kernels are decomposed for spatial convolution and temporal convolution. Slowfast [4] uses a slow pathway with low frame rate to model spatial semantics and a fast pathway with high frame rate to capture temporal motion information.

Ii-B Self-Supervised Representation Learning

Self-supervised learning extract information from unlabelled data to train models. Previous methods usually encourage models to learn rich representations by predicting information which is hidden within un-annotated data. Afterwards, the learned models can be used to promote the performance of downstream tasks. Recently, some self- supervised learning methods with the supervision signal obtained automatically from unlabeled images or videos have attracted much attention.

Ii-B1 Image Representation Learning

In order to produce supervision signals for images, spatial transforms are usually applied to pre-process the unlabeled images [15, 17, 21]. For instance, the related position between image patches are applied as the signals. [20, 3] leverage image information by predicting relative positions of image patches. [2]

take the color of images as label, it encourage the network to learn statistic features via image colorization task.

Ii-B2 Video Representation Learning

Generally, in self-supervised video representation learning, the supervisory signal is generated automatically from unlabeled videos without manual intervention.

Early approaches mostly focus on self-supervised learning of 2D CNNs. [19, 16] utilizes the orders of frames as the supervision signals. [6]

proposes an odd-one-out network to predict the unrelated clip over a set of video clips.

[31] exploits the arrow of time as a supervisory signal. [7] extracts pixel-wise geometry information as flow fields and disparity maps and uses them as auxiliary supervision.

Recently, several methods for 3D CNNs self-supervised learning are proposed to learn the complicated spatial and temporal representation [34, 27]. [13] proposes a video representation learning method for 3D CNNs based on solving 3D video cubic puzzles. [11] proposes 3DRotNet in which the supervisory signals are rotation angles. [28] proposes to learn 3D CNNs representations by predicting the motion and appearance statistics of unlabeled videos. In [32], the order of video clips is used as the supervisory signal for 3D CNNs’ training. [18] proposed to complete a video cloze procedure for representation learning.

Despite the effectiveness of existing methods, they are all based on trimmed video datasets such as UCF101 [24], which are not really unlabeled. However, videos are genuinely untrimmed in real applications, in which previous self-supervised learning methods are invalid, as shown in Fig.1. Therefore, we argue that it is of great significance for developing a self-supervised learning method to train 3D CNNs with untrimmed video datasets.

Iii Exploring Relations in Untrimmed Videos

Modeling relations for objects would help object recognition [5]. Specifically, it predicts how likely two object classes may appear in the same image. The process of modeling relations for objects is comparing the differences and similarities between them, which can boost the feature extraction capability of the network. Motivated by the success of modeling relations for objects, we propose a novel representation learning method, referred to as ERUV.

ERUV consists of three components: shot editing, relation modeling, and video comparing, which is shown in Fig.2. In shot editing, the untrimmed videos are cut into single-shot videos based on shot change detection, then the long-duration shot is cut into several video segments. Whether the video segments are from the same untrimmed video or the same video shot are stored for generating the supervisory signals. For relation modeling, targeting to generate video clips with different relations, clips are sampled from video segments with designed sampling strategies. In video comparing, we use 3D CNNs to extract spatio-temporal representations for the sampled clips, then the extracted features are concatenated and fed to a fully-connected layer to predict the possible relation categories.

Iii-a Shot Editing

Different from trimmed videos, an untrimmed video often exhibits extremely complex dynamics. It may consist of both action foreground frames and action background frames, and the action foreground frames may only occupy small portions of the whole video sequence. However, the network needs to learn spatio-temporal features of continuous motion patterns, in which no shot change should exist. In order to generate video clips focusing on a continuous motion pattern, we first edit videos with shot-based processing.

Given an untrimmed video from datasets , we take as a sequence of frames , where is the total number of frames in . HOG features for are firstly extracted and then HOG feature difference is calculated between each adjacent frame and as a metric to measure the change of frame appearance. If the absolute value of HOG difference is larger than a given threshold, we take it as a shot change between and . After shot change detection, to breakdown long-duration actions, the shots are further cut into short video segments with a fixed length of . These -frame video segments are denoted as . Suppose we have a shot denoted as , where represents the beginning and ending location of the shot , we produce video segments from this shot as , where and . In the end, all these video segments from different shots are merged, can be denoted as , where is the total number of segment videos in . We allocate the shot id and the video id to each video segment to mark the source shot and the source untrimmed video it is extracted from.

Thus, we have obtained single-shot video segments with continuous motion patterns without human annotation. To be noted, we only clip the untrimmed videos with automatic shot change detection. Without discarding any background frames or breaking down any actions by their temporal borders, is different from human trimmed videos.

Iii-B Relation Modeling

In order to train the network to compare differences and similarities between video clips to learn spatio-temporal features, we design spatial cooccurrence and temporal pattern relations for self-supervised learning, which correspond to the learning of spatial and temporal features. The relations designed for self-supervised learning should be simple yet effective so that the model is able to learn rich representations. As shown in Fig.2, We design 7 kinds of relations: shot cooccurrence (), video cooccurrence (), dataset cooccurrence () and rotation cooccurrence () for spatial learning; Inverted pattern (), disorder pattern (), sped-up pattern () for temporal learning.

ERUV takes 2 video segments as input and the predicted relation category as output. The video segments are sampled from , denoted as and , where . The relation between and are stored as our labels. In our implementation, is randomly selected from for each training step, is sampled according to its relation with . The relation details and the corresponding sampling strategies of are described as follows.

Iii-B1 Spatial Cooccurrence Relation

To provide relations that focus on spatial representation learning, we introduce cooccurrence relations. ERUV can make the prediction of a specific cooccurrence relation by measuring spatial similarities between the inputs. In order to generate videos of different apparent similarities, we propose shot cooccurrence , video cooccurrence , dataset cooccurrence and rotation cooccurrence .

denotes that the actions in and may occur in the same shot, which means their spatial features are almost the same. Given a video segment during the training step, ERUV randomly samples from the video segment sets which have been labeled to be extracted in the same shot as .

denotes that the actions in and occur in the same video but in different shots, they may describe the same action which is taken by different cameras from different angles or two relative actions in semantics.

For , and are generated from different untrimmed videos, which denotes the actions between them will not occur in the same untrimmed video.

For , is randomly rotated by 90, 180, or 270 degrees to generate , so that the model is forced to learn orientation related features.

It is worthwhile to note that predicting cooccurrence relations between video segments and requires high-level semantic appearance information, and understanding the structure of objects or colors is not enough to tackle this task.

Iii-B2 Temporal Pattern Relation

To provide relations that focus on temporal features, we further introduce three kinds of temporal pattern relations, in which the video clips have similar appearances but different temporal patterns of actions, including invert pattern , disorder pattern and sped-up pattern . Given a video segment sampled from , denoted as , we apply a temporal transformation on it to generate . The corresponding transformation to each relation is described as follows.

For , is a temporally inverted version of . For , it denotes that is a disordered version of . To generate , we shuffle the frames of randomly. For sped-up pattern , is a fast-forward version of . We adapt uniform sampling with an interval of on , which is denoted as dilated sampling. The procedure generates with fast-forward playback rate. In our implementation, could be 2 or 4. Fig.2 shows an example of . To distinguish the minor difference between temporal patterns, the network has to learn the temporal representations.

Fig. 3: Illustration of C3D, R3D, and R(2+1)D blocks.

Iii-C Video Comparing

Given , and their relation by the previous section, ERUV randomly generates a -frame clip from each video segment as the learning sample to fed to backbones.

To learn a feature representation from video comparing, we take it as a classification task and use a simple siamese network. This network has 2 parallel stacks of layers with shared parameters. Every network stack takes C3D [25], R3D or R(2+1)D [26] as the backbones to extract both spatial and temporal features. Each stack takes a video clip as input and produces a representation as output.

The structure of the backbones is shown in Fig.3

. Since 2D CNNs are able to obtain spatial information, C3D extents 2D CNNs for spatio-temporal representation learning as it can model the temporal information of videos. It stacks five C3D blocks which consist of a classic 3D convolution. R3D is an extension of C3D, which refers to 3D CNNs with residual connections. To be specific, R3D block consists of two C3D blocks, the input and the output are connected by a residual unit. Besides, in R(2+1)D, the difference between the R(2+1)D block and the R3D block is that the 3D convolution is decomposed into a spatial 2D convolution and a temporal 1D convolution.

The features extracted from 3D backbones are then concatenated to a linear classification layer. The output is a probability distribution over different relations. With

is the - output of the fully connected layer for relations, the probabilities are as follows:

where is the probability that the relation belongs to class , and is the number of relations. We update the parameters of the network by minimizing the regularized cross-entropy loss of the predictions:

where is the groundtruth.

While this network uses 2 clips at training time, during testing we can obtain 3D CNNs representations of a single clip by using just one stack because the parameters across the stacks are shared.

Fig. 4: Illustration of examples from high jump. (1) shows an example of all the videos are background, (2) are foreground. ERUV is able to learn even if all videos are from the same category no matter they are background or foreground.

Iii-D Discussion

ERUV contributes to the domain of video self-supervised learning in two aspects. Firstly, compared with previous methods [32, 18], we propose novel and effective relations on videos. The cooccurrence relations will encourage the models not only to understand the internal spatial features but also to learn discriminative features by predicting their spatial relations which require rich spatial semantic features.

Also, ERUV targets in learning features with untrimmed videos which contain many background clips. As shown in Fig.4

, if all videos are from the same category, we can also classify it to

by their moving objects or scenes. Similarly, the pattern relations can still be helpful when the videos are backgrounds because the purpose of pattern relations is to gain the capacity to generate temporal features, rather than learning some specific action patterns.

Iv Experiment

In this section, we demonstrate the effectiveness of ERUV. First, We elaborate the experimental settings. Second, we conduct ablation studies to quantify the contributions. Third, we compare the representations of the network with other approaches and visualize them for clarity. Finally, we treat our method as a self-supervised approach to initialize models for action recognition and video retrieval and compare it with state-of-the-art methods.

Iv-a Experimental Settings

Iv-A1 Datasets

We pre-train our method on one large-scale dataset, namely Thumos14 [10]. The dataset is suitable for our method as they provide the original untrimmed videos. The Thumos14 dataset has 101 classes for action recognition and 20 classes for action detection. It is composed of four parts: training data, validation data, testing data, and background data. To verify the effectiveness of our method, we use the validation data (1010 videos) to train ERUV.

The experiments are fine-tuned on UCF101 and HMDB51 datasets to evaluate the performance of our self-supervised pre-trained network. UCF101 consists of 101 action categories with about 9.5K videos for training and 3.5k videos for testing. It exhibits challenging problems including intra-class variance of actions, complex camera motions, and cluttered backgrounds. HMDB51 consists of 51 action categories with about 3.4k videos for training and 1.4k for testing.The videos are mainly collected from movies and websites including the Prelinger archive, YouTube, and Google videos.

Iv-A2 Network Architecture

For video representation extraction, we choose C3D, R3D, and R(2+1)D as backbones in ERUV. C3D stacks five 3D convolution blocks, each block consists of a classic 3D convolution with the kernel size of

and followed by a batch normalization layer and a ReLU layer. R3D block consists of two 3D convolution layers followed by batch normalization and ReLU layers. The input and output are connected with a residual unit. R(2+1)D decompose the 3D kernel to a spatial

kernel and a temporal kernel.

Iv-A3 Implementation Details

The validation set of Thumos14 is used to train our self-supervised learning method, while UCF101 and HMDB51 are used to validate the effectiveness of ERUV.

In video editing, we follow the settings in [29] for shot change detection. And long-duration shots are broken down with a fixed length of 300 to generate . Those videos in which are shorter than 48 frames are discarded. The length of each clip is set to be 16, corresponding to the inputs of most 3D CNNs. Each frame is resized to and randomly cropped to

. We set the initial learning rate to be 0.01, momentum to be 0.9. Our pre-training process stops after 300 epochs and the best validation accuracy model is used for downstream tasks.

Method Overall(%) (%) (%) (%) (%) (%) (%) (%)
C3D 66.7 71.2 89.3 84.5 76.2 50.5 44.3 48.5
TABLE I: Accuracy of relation classification. “” denotes the shot coocurrence, “” the video coocurrence, “” the dataset coocurrence, “” the rotation coocurrence,“” the inverted pattern, “” the disorder pattern, “” sped-up pattern.
Method UCF101 acc(%)
Random initialization 62.1
ERUV with 65.0
ERUV with 66.3
ERUV with 67.9
ERUV with 68.8
ERUV with 66.0
ERUV with 65.0
ERUV with 65.4
ERUV with 66.4
ERUV with 70.4
TABLE II: Ablation study of cooccurrence and pattern relations. ERUV are firstly pre-trained on Thumos14 and then used to fine-tune action recognition on UCF101. The figures refer to action recognition accuracy.

Iv-B Ablation Study

In this section, we evaluate the effect of our designed relations on the first split of the UCF101. We first perform self-supervised pre-training using the validation data in Thumos14. The learned weights are then used as initialization for the supervised action recognition task.

Table I shows the results on Thumos14 which are trained and evaluated on the validation data. It can be seen that ERUV achieves 66.7% overall accuracy, for shot coocurrence (), video coocurrence (), dataset coocurrence () and rotation coocurrence(), ERUV respectively achieves 71.2%, 89.3%, 84.5%, and 76.2% accuracy. Considering that the accuracy of random guessing for the task is 14.3% (7 relations), the framework indeed learns to analyze the content of clips. Besides, it also shows that the designed relations are plausible.

As shown in Table II, to clearly show the effect of relations for representation learning, we conduct ablation experiments on ERUV with various relations for action recognition.

It can be seen that when pre-training with ( and only), or , the accuracy of action recognition outperforms the baseline (Random initialization) by 2.9%, 4.2% or 5.8%. When using all cooccurrence relations (ERUV with ), the performance further increased to 68.8%. Pre-training with ( and only) improves the performance by 3.9%, When using all pattern relations (ERUV with ), the performance further increased to 66.4%. Combining the cooccurrence relation and pattern relation (ERUV with ) finally improves the performance to 70.4%, significantly outperforming the baseline by 8.3%. The experiments show that ERUV can learn representative features and hence to promote the performance on action recognition task.

Fig. 5: Feature embedding results with ERUV compared with the random method and SOTA method VCOP.
Method UCF101(UCF101)(%) UCF101(Thumos14)(%)
VCOP [32] 66.7 64.9
VCP [18] 69.7 68.7
ERUV 71.2 70.4
TABLE III: Performance comparison under different pre-training datasets. UCF101(Thumos14) denotes the model is pre-trained on Thumos14 and fine-tuned on UCF101.

Iv-C Representation Learning

In this section, we further demonstrate the effectiveness ERUV comparing with previous methods on different training strategies.

First of all, as shown in Table III, VCOP [32], VCP [18] and ERUV are trained with UCF101 and Thumos14 respectively. Then the trained models are used to initialize the action recognition task on the first split of UCF101. In our implementation, VCOP and VCP are trained with , which are generated by shot editing III-A. Since it seems impossible for them to learn with raw untrimmed videos. Besides, when trying to train ERUV with UCF101, we generate in the same video, also we remove the relation of and , because there is no such label in the dataset.

It can be seen that when pre-training and fine-tuning on UCF101, ERUV outperforms VCOP by 4.5% and VCP by 1.5%. Showing that relations used in ERUV are better than previous methods. Pre-training on Thumos14 and fine-tuning on UCF101, ERUV can also outperform VCOP and VCP by 5.5% and 1.7% respectively. Note that, because we choose to fine-tune on UCF101, all methods perform better when pre-training on UCF101 than on Thumos14.

To indicate that why ERUV can gain better performance on action recognition, we visualize the features generated by the trained models. To be specific, we select 300 samples of 3 action classes from UCF101 and visualize their pool5 features with two-dimensional embeddings by PCA.

In Fig.5, we can see that the random method can not extract effective discriminative features. For VCOP [32], the inter-instance distance increases but it still can not extract effective discriminative features. For ERUV, the intra-class distance increases, and the inter-instance distance decreases. This implies that our model can learn better discriminative features, which can directly promote the action recognition task.

Iv-D Target Tasks

To further validate the effectiveness of ERUV, we use the pre-trained model to initialize action recognition backbones and directly apply the extracted features to video retrieval.

Iv-D1 Action Recognition

Method UCF101(%) HMDB51(%)
Jigsaw [20] 51.5 22.5
OPN [16] 56.3 22.1
Büchler[1] 58.6 25.0
Mas[28] 58.8 32.6
3D ST-puzzle[13] 65.0 31.3
C3D(random) 61.8 24.7
C3D(VCOP [32]) 65.6 28.4
C3D (VCP) [18] 68.5 32.5
C3D(ERUV) 69.6 33.7
R3D(random) 54.5 23.4
R3D(VCOP [32]) 64.9 29.5
R3D(VCP [18]) 66.0 31.5
R3D(ERUV) 68.8 31.6
R(2+1)D(random) 55.8 22.0
R(2+1)D(VCOP [32]) 72.4 30.9
R(2+1)D(VCP [18]) 66.3 32.2
R(2+1)D(ERUV) 68.4 31.9
TABLE IV: Comparison of action recognition accuracy on UCF101 and HMDB51.

After training with untrimmed videos from Thumos14, we fine-tune the model using labeled videos. We implement the fine-tune procedure and follow the settings of [32]. The training step for fine-tune stops after 150 epochs. During the test, we sample 10 clips uniformly for each video to obtain the action prediction.

Table IV shows the results on the UCF101 and HMDB51, we report the averages accuracy over 3 splits. It can be seen that, with C3D backbones, ERUV obtains 69.6% accuracy compared with 61.8% of random initialization on UCF101 dataset, 33.7% to 24.7% on HMDB51. It also outperforms the state-of-the-art VCP approach [18] by 1.1% and 1.2% respectively. ERUV also achieves better accuracy with R3D and R(2+1)D backbones. With R3DR(2+1)D backbones, ERUV has 14.3%12.6% and 8.2%9.9% performance gains over random method on UCF101 and HMDB51 respectively.

Since our training samples of Thumos14 maintains a large portion of background videos, the improvements gained on 3 splits of UCF101 and HMBD51 datasets show the effectiveness and generalization of ERUV.

Iv-D2 Video Retrieval

To directly test the features extracted by ERUV, we validate the pre-trained model with nearest-neighbor video retrieval. Following the protocol in [32], we extract 10 clips for every video with the ERUV pre-trained model. The clips extracted from the test set are used to query the - nearest clips from the training sets. If a video of the same category is matched, a correct retrieval is counted.

The video retrieval results on UCF101 and HMDB51 are shown in Table V and VI, which further indicate the effectiveness of ERUV trained models. Note that we outperform the SOTA method dramatically with different backbones on top1(%), for which the features extraction ability is critical.

Methods top1(%) top5(%) top10(%) top20(%) top50(%)
Jigsaw[20] 19.7 28.5 33.5 40.0 49.4
OPN[16] 19.9 28.7 34.0 40.6 51.6
Büchler[1] 25.7 36.2 42.2 49.2 59.5
C3D(random) 16.7 27.5 33.7 41.4 53.0
C3D(VCOP[32]) 12.5 29.0 39.0 50.6 66.9
C3D(VCP[18]) 17.3 31.5 42.0 52.6 67.7
C3D(ERUV) 25.2 40.5 48.3 57.6 70.4
R3D(random) 9.9 18.9 26.0 35.5 51.9
R3D(VCOP[32]) 14.1 30.3 40.4 51.1 66.5
R3D(VCP[18]) 18.6 33.6 42.5 53.5 68.1
R3D(ERUV) 21.4 35.2 43.8 53.1 68.3
R(2+1)D(random) 10.6 20.7 27.4 37.4 53.1
R(2+1)D(VCOP[32]) 10.7 25.9 35.4 47.3 63.9
R(2+1)D(VCP[18]) 19.9 33.7 42.0 50.5 64.4
R(2+1)D(ERUV) 22.0 35.1 42.6 51.5 64.9
TABLE V: Video retrieval performance on UCF101.
Methods top1(%) top5(%) top10(%) top20(%) top50(%)
C3D(random) 7.4 20.5 31.9 44.5 66.3
C3D(VCOP[32]) 7.4 22.6 34.4 48.5 70.1
C3D(VCP[18]) 7.8 23.8 35.3 49.3 71.6
C3D(ERUV) 8.6 25.3 37.0 53.7 75.7
R3D(random) 6.7 18.3 28.3 43.1 67.9
R3D(VCOP[32]) 7.6 22.9 34.4 48.8 68.9
R3D(VCP[18]) 7.6 24.4 36.3 53.6 76.4
R3D(ERUV) 8.6 24.3 35.5 53.0 75.9
R(2+1)D(random) 4.5 14.8 23.4 38.9 63.0
R(2+1)D(VCOP[32]) 5.7 19.5 30.7 45.8 67.0
R(2+1)D(VCP[18]) 6.7 21.3 32.7 49.2 73.3
R(2+1)D(ERUV) 9.3 26.4 38.5 51.8 73.9
TABLE VI: Video retrieval performance on HMDB51.

Iv-D3 Visualization

In order to obtain a better understanding of what ERUV learns, we visualize the feature attention maps [33] to indicate where the spatio-temporal representation focuses on. As shown in Fig.6, we visualize computed heat maps over sampled frames and compare them under different backbones. It can be seen that the learned features are more likely to focus on the dominant moving objects in the video.

Fig. 6: Visualization of attention maps. From left to right: a frame from a video clip, the attention map generated from C3D, R3D and R21D [33].

V Conclusion

In this paper, we propose a novel and real self-supervised method referred to as ERUV to obtain rich spatio-temporal features without human annotations. In ERUV, we train CNNs models to predict relations between video clips. Experimental results show the effectiveness of ERUV for downstream tasks such as action recognition and video retrieval. Our network inspires the field of video understanding with two aspects: self-supervised learning can be implemented with untrimmed videos and action relations are beneficial for video understanding.


  • [1] U. Buchler, B. Brattoli, and B. Ommer (2018)

    Improving spatiotemporal self-supervision by deep reinforcement learning

    In European conference on computer vision, pp. 770–786. Cited by: §I, TABLE IV, TABLE V.
  • [2] P. Deepak, K. Philipp, D. Jeff, D. Trevor, and A. E. Alexei (2016) Context encoders: feature learning by inpainting. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2536–2544. Cited by: §II-B1.
  • [3] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1422–1430. Cited by: §I, §II-B1.
  • [4] C. Feichtenhofer, H. Fan, J. Malik, and K. He (2019) Slowfast networks for video recognition. In International Conference on Computer Vision, pp. 6202–6211. Cited by: §I, §II-A.
  • [5] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan (2009) Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: §I, §III.
  • [6] B. Fernando, H. Bilen, E. Gavves, and S. Gould (2017) Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3636–3645. Cited by: §I, §II-B2.
  • [7] C. Gan, B. Gong, K. Liu, H. Su, and L. J. Guibas (2018) Geometry guided convolutional neural networks for self-supervised video representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5589–5597. Cited by: §I, §II-B2.
  • [8] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei (2018) Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597. Cited by: §I, §I.
  • [9] H. Jhuang, H. Garrote, E. Poggio, T. Serre, and T. Hmdb (2011) A large video database for human motion recognition. In International Conference on Computer Vision, Vol. 4, pp. 6. Cited by: §I.
  • [10] Y. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar (2014) THUMOS challenge: action recognition with a large number of classes. Cited by: §IV-A1.
  • [11] L. Jing, X. Yang, J. Liu, and Y. Tian (2018) Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387. Cited by: §II-B2.
  • [12] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §I.
  • [13] D. Kim, D. Cho, and I. S. Kweon (2019) Self-supervised video representation learning with space-time cubic puzzles. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 8545–8552. Cited by: §I, §II-B2, TABLE IV.
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §I.
  • [15] G. Larsson, M. Maire, and G. Shakhnarovich (2017) Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6874–6883. Cited by: §I, §II-B1.
  • [16] H. Lee, J. Huang, M. Singh, and M. Yang (2017) Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 667–676. Cited by: §I, §II-B2, TABLE IV, TABLE V.
  • [17] W. Lee, J. Na, and G. Kim (2019) Multi-task self-supervised object detection via recycling of bounding box annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4984–4993. Cited by: §II-B1.
  • [18] D. Luo, C. Liu, Y. Zhou, D. Yang, C. Ma, Q. Ye, and W. Wang (2020) Video cloze procedure for self-supervised spatio-temporal learning. arXiv preprint arXiv:2001.00294. Cited by: §I, §II-B2, §III-D, §IV-C, §IV-D1, TABLE III, TABLE IV, TABLE V, TABLE VI.
  • [19] I. Misra, C. L. Zitnick, and M. Hebert (2016)

    Shuffle and learn: unsupervised learning using temporal order verification

    In European conference on computer vision, pp. 527–544. Cited by: §I, §II-B2.
  • [20] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pp. 69–84. Cited by: §I, §II-B1, TABLE IV, TABLE V.
  • [21] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544. Cited by: §II-B1.
  • [22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §I.
  • [23] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pp. 568–576. Cited by: §II-A.
  • [24] K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §I, §II-B2.
  • [25] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In International Conference on Computer Vision, pp. 4489–4497. Cited by: §I, §II-A, §III-C.
  • [26] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §I, §II-A, §III-C.
  • [27] C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Generating videos with scene dynamics. In Advances in neural information processing systems, pp. 613–621. Cited by: §II-B2.
  • [28] J. Wang, J. Jiao, L. Bao, S. He, Y. Liu, and W. Liu (2019) Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4006–4015. Cited by: §I, §II-B2, TABLE IV.
  • [29] L. Wang, Y. Xiong, D. Lin, and L. Van Gool (2017) Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 4325–4334. Cited by: §IV-A3.
  • [30] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In European conference on computer vision, pp. 20–36. Cited by: §II-A.
  • [31] D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman (2018) Learning and using the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8052–8060. Cited by: §I, §II-B2.
  • [32] D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang (2019) Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10334–10343. Cited by: §I, §II-B2, §III-D, §IV-C, §IV-C, §IV-D1, §IV-D2, TABLE III, TABLE IV, TABLE V, TABLE VI.
  • [33] S. Zagoruyko and N. Komodakis (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928. Cited by: Fig. 6, §IV-D3.
  • [34] Y. Zhao, B. Deng, C. Shen, Y. Liu, H. Lu, and X. Hua (2017)

    Spatio-temporal autoencoder for video anomaly detection

    In Proceedings of the 25th ACM international conference on Multimedia, pp. 1933–1941. Cited by: §II-B2.
  • [35] B. Zhou, A. Andonian, A. Oliva, and A. Torralba (2018) Temporal relational reasoning in videos. In European conference on computer vision, pp. 803–818. Cited by: §II-A.