In the past few years, Convolutional Neural Networks (CNNs) have unprecedentedly advanced the field of computer vision. Generally, vision tasks are solved by training models on large-scale datasets with label annotations
. Typically, CNNs pre-trained on ImageNet incorporate rich representation capability and have been widely used as initial models.
Nevertheless, annotating large-scale datasets is costly and labor-intensive, particularly when facing tasks involving complex data (, videos) and concepts (, action analysis and video retrieval) [8, 14].
To conquer this issue, self-supervised representation learning, which leverages the information from unlabelled data to train desired models, has attracted increasing attention from the artificial intelligence community. For video data, existing approaches usually define an annotation-free proxy task, which provides special supervision for model learning by fulfilling the objective of the proxy task.
In the early research [7, 27], relative location of the patches in images or the order of video frames were used as a supervisory signal. However, the learned features were merely on a frame-by-frame basis, which are implausible to video analytic tasks where spatio-temporal features are prevailing. Recently,  proposed to learn representations by regressing motion and appearance statistics. In 
, an odd-one-out network is proposed to identify the unrelated or odd video clips from a set of otherwise related clips. To find the odd video clip, the models have to learn spatio-temporal features that can discriminate video clips of minor differences.
Despite of the effectiveness, existing approaches are usually developed upon domain-knowledge and therefore are not capable to incorporate various spatial-temporal operations. This seriously restricts the representation capability of learned models. Furthermore, the lack of a model assessment approach strikingly limits the pertinence of self-supervised representation learning.
In this paper, we propose a new self-supervised method called Video Cloze Procedure (VCP). In VCP, we withhold a video clip from a video sequence and apply multiple spatio-temporal operations on it. We train a 3D-CNN model to identify the category of operations, which drives learning rich feature representations. The motivation behinds VCP lies in that applying richer operations on video clips facilities exploring higher representation capability, Fig. 1.
VCP consists of three components including blank generation, option creation, and cloze completion. The first component generates blanks by withholding video clips from given clip sequences. The second component facilitates multiple spatial-temporal representation learning by applying spatial-temporal operations on the withheld clips. Finally, cloze completion fills the blanks with options and learns representations by predicting the category of operations.
VCP can act as either a proxy task or a target task in self-supervised learning. As a proxy task, it converts rich self-supervised representations into video clip operations, which enhances the flexibility and reduces the complexity of representation learning. As a target task, it can assess learned representation models in an interpretable manner.
The contributions of this work are summarized as follows:
We propose Video Cloze Procedure (VCP), providing a simple-yet-effective framework for self-supervised spatio-temporal representation learning.
We propose a new model assessment approach by designing VCP as a special target task, which improves the interpretability of self-supervised representation learning.
VCP is applied on three kinds of 3D CNN models and two target tasks including action recognition and video retrieval, and improves the state-of-the-arts with significant margins.
2 Related work
Self-supervised learning leverages the information from unlabelled data to train target models. Existing approaches usually define an annotation-free proxy task which demands a network predicting information latent in unannotated videos. The learned models can be applied to target tasks (supervised or unsupervised) in a fine-tuning manner.
2.1 Proxy Tasks
In a broad view of self-supervised learning, the proxy tasks can be constructed over multiple sensory data such as ego-motion and sound [21, 5, 6, 13, 22]. In a special view of visual representation learning, proxy tasks can be categorized into: (1) Image property transform and (2) Video content transform.
Image Property Transform
Spatial transforms applied on images can produce supervision signals for representation learning . As a representative research,  proposed learning CNN features by rotating the images and predicting the rotated angles. [16, 7] proposed learning image representations by completing damaged Jigsaw puzzles.  proposed context inpainting, by training a CNN to generate the contents of a withheld image region conditioned on its surroundings. 
proposed unsupervised correspondence, by training a representation model to match image patches of transform in-variance.
Video Content Transform
A large number of video clips with rich motion information provide various self-supervised signals. In , the order of video frames was used as a supervisory signal. In [19, 18], predicting the orders of frames or video clips drives learning spatio-temporal representation. In , an odd-one-out network was proposed to identify the unrelated or odd video clips from a set of otherwise related clips. To find the odd video clip, the models have to learn spatio-temporal features that can discriminate similar video clips. In , unsupervised motion-based segmentation on videos was used to obtain segments, which performed as pseudo ground truth to train a CNN to segment objects.
Early works usually learned features based on 2D CNN and merely on a frame-by-frame basis, which are implausible to video analytic tasks where spatio-temporal features are prevailing. Recently,  proposed learning 3D representations by regressing motion and appearance statistics,  proposed predicting the order of video clips.  proposed training 3D CNN by completing space-time cubic puzzles.
However, existing self-supervised learning methods are typically designed for specific target tasks, which restricts the capability of learned models. In addition, few of the proxy tasks are capable of assessing feature representation, which strikingly limits the pertinence of learned models.
2.2 Target Tasks
In this work, the self-supervised representation models are applied to target tasks including video action recognition and video retrieval. In many recent works, [25, 24] investigated training 3D CNN models on a large scale supervised video database. Nevertheless, the models trained on specific self-supervised tasks lack general applicability, , fine-tuning such models to various video tasks could produce sub-optimal results. To conquer these issues, we propose the novel VCP, which, by incorporating multiple self-supervised representations, improves the generality of the learned model.
3 Video Cloze Procedure
Cloze Procedure was firstly introduced by Wilson Taylor in 1953 as a metric to evaluate the capability of human language learning. Specifically, it deletes words in a prose selection according to a word-count formula or various other criteria and evaluates the success a reader has in accurately supplying the deleted words . Motivated by the success of Cloze Procedure in the field of language learning, we design the Video Cloze Procedure.
In this section, we first describe the details of VCP which consists of three components, blank generation, option creation, and cloze completion. We then discuss the advantages of the VCP over state-of-the-art methods in three aspects, including complexity, flexibility, and interpretability.
3.1 Blank Generation
Considering the spatial similarity and the temporal ambiguity among video frames, we take video clips  as the smallest unit in VCP. Considering that semantic information of different videos is temporally non-uniform, we generate the blanks in VCP using every-nth-words manner . Specifically, the blank generation component consists of two steps including clip sampling and clip deletion.
The clips including frames (with equal length) are sampled every frames (with equal interval without overlap) from the raw video. In this way, the relevance of the low-level vision cues, such as texture and color, among clips is weakened compared to those in successive or overlapped clips. As a result, the learner is forced to focus on middle- and high-level spatio-temporal features.
A video sequence of
successive clips is considered as a whole cloze item. We randomly delete one of the co-equal clips with the same probability in the cloze item to generate blanks. The removed clip is then utilized to create options. For clarity of description, we give an example of VCP by sampling three clips and deleting the middle one, as shown in Fig.2.
3.2 Option Creation
Aiming at training a model to distinguish the deleted clip from a heap of perplexing optional clips, we design spatial and temporal operations to create the optional clips (options). To learn richer representations, the operations should effectively confuse the learners, while reserving the spatial-temporal relevance. Under this principle, we design four operations including spatial rotation (), spatial permutation (), temporal remote shuffling (), and temporal adjacent shuffling () for VCP.
To provide options that focus on spatial representation learning, we introduce spatial rotation and spatial permutation. With spatial rotation (), a video clip is rotated by 90, 180, and 270 degrees so that the model is forced to learn orientation related features. With spatial permutation (), a video clip is divided into four tiles ( and either two tiles are permuted to produce a new option. There are kinds of options produced in total. Permutation with two tiles produces options with spatial structure information partially remained, which prevents models from learning low-level statistics to distinguish spatial chaos.
To provide options that focus on temporal features we further introduce two kinds of temporal operations. One operation is temporal remote shuffling (), where the deleted clip is substituted with a clip that has large temporal distance forward or backward. As the background of frames with reasonable temporal distance is probably similar which means the discriminative difference lies in the foreground, drives the model to learn more temporal information related to the foreground. The other operation is temporal adjacent shuffling (), where the original clip is divided into four sub-clips, and two of them are randomly shuffled once. Different from VCOP , we do not shuffle all the sub-clips and reduce the difficulty by forcing the model to judge whether or not the clip is shuffled instead of predicting the exact orders. In this way, rich temporal representation can be easy to learn.
3.3 Cloze Completion
To complete cloze, we fill the blanks by randomly sampling the clip options with operation category labels. To predict the operation categories applied on the clips, we use three 3D CNNs as the backbones and concatenate their output features according to the order of the clips in the raw video as illustrated in Fig. 2. The three CNNs share parameters so that a single strong model can be learned. The concatenate feature is fed to a fully connected (FC) layer, which predicts the corresponding operation category.
4 Self-supervised Representation Learning
We implement self-supervised representation learning and model assessment by treating VCP as a proxy task and a target task, respectively.
4.1 Representation Learning
As a proxy task, VCP can learn spatio-temporal representations using only the original labeled data for target tasks or using extra unlabeled data, Fig. 3.
For the target task, deep models learn to extract features in a direct manner trying to minimize the training loss with the supervision of specific annotations, category labels. During the procedure, the task-specific representation capability of models can be enforced while the general representation capacity of models is unfortunately ignored. With spatio-temporal operations applied on the clips, VCP learns rich and general representations by pre-training the models, which enhances the performance of target tasks without extra labeling efforts required.
On the other hand, VCP can leverage massive unlabeled data to break the overhead of model representation capability. With VCP, we pre-train a representation model on an un-annotated dataset as a warm-up initialization and then fine-tune such model on the annotated target dataset. VCP has the potential to learn the general representation, spatial-temporal integrity and continuity, in spatio-temporal domain, which facilitates improving the representation capability of models in video-based vision tasks.
4.2 Model Assessment
Beyond acting as a proxy task, VCP can also act as a target task, which offers a uniform and interpretable way to evaluate self-supervised representation models. In VCP, the classification accuracy of operations reflects what the models learn and how good they are. By simply replacing the head of the classification network with a fully connected layer to be fine-tuned while the parameters of the backbone network are fixed, operation category classification is implemented as a target task, Fig. 3.
In this way, the feature representative capability obtained from self-supervised proxy tasks is reserved. Meanwhile, corresponding features are utilized to train a classifier, the performance of which can be regarded as a metric to assess the representation models. With the hint dropped by VCP, we can not only elaborately assess models learned from different self-supervised proxy tasks but also can figure out how to improve a self-supervised method. This casts a new light on the significance of VCP.
To analyze the advantages of VCP over existing self-supervised methods, we contrast them from three aspects including complexity, flexibility, and interpretability.
Existing approaches that use spatio-temporal shuffling and order prediction [15, 29, 18] have computational complexity, given video frames/clips units. The high complexity is caused by the requirement to predict the exact order, which might be not necessary when learning representations. In contrast, VCP solely chooses optional options to fill the blanks while predicting the operation category of the option. It thus has a computational complexity.
For various target tasks, VCP can be adaptively applied by configuring the options (operations). For example, we can apply spatial permutation () to enhance spatial representation and apply temporal adjacent shuffling () to boost the temporal representation. In a flexible manner, VCP can incorporate special information in special spatial and/or temporal operations for different target tasks.
In existing approaches, different proxy tasks learn different representation models. It requires an interpretive way to explore the relationship between representation models and target tasks. With well-designed options, VCP offers the opportunity to analyze the models by testing their classification accuracy on uniform options (operations), which has great potential to contrapuntally overcome the weakness of models.
We conduct extensive experiments to evaluate VCP and its applications on target tasks. Firstly, we elaborate experimental settings for VCP. We then evaluate the representation learning of VCP with different option configurations and data strategies. We further conduct experiments on model assessment with VCP. Finally, we evaluate the performance of VCP applying on target tasks, , action recognition and video retrieval, and compare it with state-of-the-art methods.
5.1 Experiment Setting
The experiments are conducted on UCF101  and HMDB51  datasets. UCF101 contains 13320 videos over 101 action categories, exhibiting challenging problems include intra-class variance of actions, complex camera motions, and cluttered backgrounds. HMDB51 contains 6849 videos over 51 action categories. The videos are mainly collected from movies and websites including the Prelinger archive, YouTube, and Google videos.
C3D , R3D and R(2+1)D  are employed as backbones in VCP implementations. C3D extends the 2D convolution kernels to 3D kernels, so that it can model temporal information of videos. The size of convolution kernels is . R3D is an extension of ResNet  with C3D. In R(2+1)D, 3D convolution kernels are decomposed. For spatial convolution, each kernel is set to be where . For temporal convolution, it is set to be where .
In the blank generation, to avoid trivial results, three successive 16-frame clips are sampled every 8 frames from the raw video as a whole cloze item. Each frame is resized to and randomly cropped to
. In the option generation, we define the clips sampled from 16 frames away to the cloze item as remote clips. We set the initial learning rate to be 0.01, momentum to be 0.9 and stop training after 300 epochs.
5.2 Representation Learning
To validate what VCP learns, we first conduct ablation studies of VCP. We further conduct experiments with different data strategies to demonstrate the generality of the representations learned via VCP.
We firstly train a model to classify the categories of five options. Table 1 shows the results on UCF101, which are trained and evaluated on the first split. It can be seen that VCP achieves 78.42% overall accuracy, For spatial rotation (), spatial permutation (), and temporal adjacent shuffling (), VCP respectively achieves 95.04%, 97.53% and 94.57% accuracy. The results show that the designed five operations are plausible.
To clearly show the effect of option creation for representation learning, we conduct ablation experiments on VCP with various options for action recognition, Table 2. The experiments are conducted using C3D as the backbone. We pre-train VCP and then fine-tune the action recognition model on UCF101. The recognition accuracy is evaluated on the first test split.
It can be seen that when pre-training with a single spatial rotation (-VCP) or permutation (-VCP) operation, the accuracy of action recognition outperforms the baseline (random) by 2.3% or 1.4%. When using both spatial operations (-VCP), the performance further increased to 66.0%. Pre-training with a single temporal remote shuffling (-VCP) or adjacent shuffling (-VCP) operation improves the performance by 5.8% or 3.0%, where the performance is further improved to 68.0% when using both temporal operations (-VCP). Combining the spatial and temporal operations (-VCP) finally improves the performance to 69.7%, significantly outperforming the baseline by 7.7%. The experiments show that the options can be used in a flexible way including using standalone or combining with each other. VCP can learn more representative features by adding rich and complementary options.
|Data Strategy||VCOP  (%)||VCP(ours) (%)|
To further validate the generality of VCP, we conduct experiments for VCP under different data strategies, with C3D as the backbone. Firstly, we pre-train VCP on UCF101 and HMDB51, and then respectively fine-tune each pre-trained model on UCF101 and HMDB51 for action recognition, Table 3. Specially, the supervised action recognition task is directly trained on the target datasets, with random initialization.
It can be seen that when pre-training and fine-tuning on UCF101, VCP outperforms VCOP by 2.9%; when pre-training and fine-tuning on HMDB51, VCP slightly outperforms VCOP, showing that the strategy used in VCP is better than that in VCOP. Note that using VCP as a pre-train model further significantly improves the performance of supervised methods by 6.7% (68.5% vs. 61.8%) on UCF101 and 6.8% (31.5% vs. 24.7%) on HMDB51, which shows that VCP is complementary to supervised model learning. After pre-training on UCF101 and fine-tuning models on HMDB51, VCP significantly outperforms VCOP by 4.1%. It is noteworthy that when pre-training on the smaller dataset HMDB51 but fine-tuning on the larger dataset UCF101, the performance of VCP also outperforms that of VCOP by 2.6%, which shows the generality of VCP.
5.3 Model Assessment
Regarding VCP as a target task, we only fine-tune the fully connected layer with the parameters of the self-supervised model fixed to get the operation classification accuracy curve, Fig. 4. We fine-tune the fully connected layer for 30 epochs and then output the test scores every 5 epochs.
It is obvious that the model trained with VCP can recognize the , , and operations with high accuracy (90%), Fig. 4(b)(c)(e). Nevertheless, it experiences difficulty when classifying the original clips and the remote shuffled clips, Fig. 4(a)(d). It can be seen that the accuracy of and is negatively correlated, which means the perplexity of them. In contrast, the accuracy of VCOP and 3D Cubic Puzzle is divergent, which implies they fail to classify the two categories.
For spatial operation classification, Fig. 4(b)(c), ST-Puzzle and S-Puzzle outperform T-Puzze and VCOP, while for temporal operation classification, Fig. 4(d)(e), they underperform T-Puzze and VCOP. It shows that spatial representation learning is not consistent with temporal representation learning. Consequently, VCP benefits from integrating existing and newly designed spatial and temporal operations.
5.4 Action Recognition
Once a 3D CNN is pre-trained by VCP, we use it to initialize and fine-tune models for action recognition. For action recognition, we feed the features extracted by backbones to fully-connected layers for classification. During fine-tuning, we initialize the backbones from VCP while the fully-connected layers are randomly initialized. The hyper-parameters and data pre-processing are the same as VCP training process. The fine-tune procedures are carried out for 150 epochs. During test, we follow the protocol of and sample 10 clips for each video. The predictions on the clips are averaged to obtain the video prediction.
The classification accuracy over 3 splits are averaged to obtain the final accuracy. As shown in Table 4, with a C3D backbone, VCP (ours) outperforms the randomly initialized C3D (random) by 6.7% and 7.8% on UCF101 and HMDB51 respectively. It also outperforms the state-of-the-art VCOP approach  by 2.9% and 4.1%. With an R3D backbone, VCP has 11.5% (54.5% vs. 66%) and 9.8% (32.5% vs. 27.4%) performance gain over the random initialization (random) approach. It also outperforms the state-of-the-art VCOP  with significant margins. The good performance validates that VCP can learn richer and more discriminative features than other methods.
5.5 Video Retrieval
VCP is also validated on the target task of nearest-neighbor video retrieval. As it does not require training data annotation, it largely relies on the pre-trained representation models. We conduct this experiment with the first split of UCF101, following the protocol in . The model trained by VCP is to used to extract convolutional (conv5) features for all samples (videos) in the training and test sets. Each video in the test set is used to query nearest videos from the training set. If a video of the same category is matched, a correct retrieval is counted.
that VCP significantly outperforms the compared approaches on all evaluation metrics,top-1, top-5, top-10, top-20, and top-50 accuracy. In Fig. 5, qualitative results also shows superiority of VCP.
In this paper, we propose a novel self-supervised method, referred to as Video Cloze Procedure (VCP), to learn rich spatial-temporal representations. With VCP, we train spatial-temporal representation models (3D-CNNs) and apply such models on action recognition and video retrieval tasks. We also proposed a model assessment approach by designing VCP as a special target task, which improves the pertinence of self-supervised representation learning. Experimental results validated that VCP enhanced the representation capability and the interpretability of self-supervised models. The underlying fact is that VCP simulates the fashion of human language learning, which provides a fresh insight for self-supervised learning tasks.
This work is supported by the National Key R&D Program of China (2017YFB1002400) and the Strategic Priority Research Program of Chinese Academy of Sciences (XDC02000000)
-  (1970) The cloze procedure: a conspectus. Journal of Reading Behavior 2 (3), pp. 232–249. Cited by: §3.1, §3.
Improving spatiotemporal self-supervision by deep reinforcement learning. In Proceedings of the European Conference on Computer Vision, pp. 770–786. Cited by: Table 4, Table 5.
Context encoders: feature learning by inpainting.
Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2536–2544. Cited by: §2.1.
-  (2017) Learning features by watching objects move. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 6024–6033. Cited by: §2.1.
-  (2014) Auto-encoding variational bayes. In Proceedings of International Conference on Learning Representations, Cited by: §2.1.
-  (2017) Learning image representations tied to egomotion from unlabeled video. International Journal of Computer Vision 125 (1-3), pp. 136–161. Cited by: §2.1.
-  (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430. Cited by: §1, §2.1.
-  (2017) Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3636–3645. Cited by: §1, §1, §2.1.
-  (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §2.1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §5.1.
-  (2011) A large video database for human motion recognition. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 4, pp. 6. Cited by: §5.1.
-  (2009) ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §1.
-  (2016) Look, listen and learn- A multimodal LSTM for speaker identification. In Proceedings of the Thirtieth Conference on Artificial Intelligence, pp. 3581–3587. Cited by: §2.1.
-  (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §1.
-  (2019) Self-supervised video representation learning with space-time cubic puzzles. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 8545–8552. Cited by: §1, §2.1, §4.3, Figure 4, Table 4.
-  (2018) Learning image representations by completing damaged jigsaw puzzles. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 793–802. Cited by: §2.1.
-  (2017) Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6874–6883. Cited by: §2.1.
-  (2017) Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676. Cited by: §2.1, §4.3, Table 4, Table 5.
Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pp. 527–544. Cited by: §2.1.
-  (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: Table 4, Table 5.
-  (2015) Learning to see by moving. In Proceedings of the IEEE International Conference on Computer Vision, pp. 37–45. Cited by: §2.1.
-  (2017) Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 609–617. Cited by: §2.1.
-  (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §5.1.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497. Cited by: §2.2, §5.1.
-  (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §2.2, §5.1, §5.4.
-  (2019) Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4006–4015. Cited by: §1, §2.1, Table 4.
-  (2015) Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2794–2802. Cited by: §1, §2.1.
-  (2017) Transitive invariance for self-supervised visual representation learning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1338–1347. Cited by: §2.1.
-  (2019) Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10334–10343. Cited by: §2.1, §3.1, §3.2, §4.3, Figure 4, §5.4, §5.5, Table 3, Table 4, Table 5, Table 6.