Official implementation of ACMMM'20 paper 'Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework'
Recently, 3D convolutional networks (3D ConvNets) yield good performance in action recognition. However, optical flow stream is still needed to ensure better performance, the cost of which is very high. In this paper, we propose a fast but effective way to extract motion features from videos utilizing residual frames as the input data in 3D ConvNets. By replacing traditional stacked RGB frames with residual ones, 35.6 top-1 accuracy can be obtained on the UCF101 and HMDB51 datasets when ResNet-18 models are trained from scratch. And we achieved the state-of-the-art results in this training mode. Analysis shows that better motion features can be extracted using residual frames compared to RGB counterpart. By combining with a simple appearance path, our proposal can be even better than some methods using optical flow streams.READ FULL TEXT VIEW PDF
Official implementation of ACMMM'20 paper 'Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework'
Unofficial implement of "Video cloze procedure for self-supervised spatio-temporal learning" [AAAI20]
For action recognition, motion representation is an important challenge to extract motion features among multiple frames. Various methods have been designed to capture the movement. 2D ConvNet based methods use interactions in the temporal axis to include temporal information [1, 2, 3, 4, 5]. 3D ConvNet based methods improved the recognition performance by extending 2D convolution kernel to 3D, and computations among temporal axis in each convolutional layers are believed to handle the movements [6, 7, 8, 9, 10, 11]. State-of-the-art methods showed further improvements by increasing the number of used frames and the size of the input data as well as deeper backbone networks [12, 13].
In a typical implementation of 3D ConvNets, these methods used stacked RGB frames (or called video clips, we use both in the following descriptions) as the input data. However, this kind of input is considered not enough for motion representation because the features captured from the stacked RGB frames may pay more attention to the appearance feature including backgrounds and objects rather than the movement itself, as shown in the top example in Fig. 1. Thus, combining with an optical flow stream is necessary to further represent the movement and improve the performance, such as the two-stream models [14, 15, 16]. However, the processing of optical flow greatly increases computation. Besides, two-stream results activation of the optical flow stream can only be obtained after the optical flow data are extracted, which causes high latency.
In this paper, we propose an effective strategy based on 3D convolutional networks to pre-process RGB frames for the generation and replacement of input data. Our method retains what we call residual frames, which contain more motion-specific features by removing still objects and background information and leaving mainly the changes between frames. Through this, the movement can be extracted more clearly and recognition performance can be improved, as shown in the bottom sample in Fig. 1. Our experiments reveal that our approach can yield significant improvements over top-1 accuracies when those ConvNets are trained from scratch on UCF101  and HMDB51  datasets.
For some specific category pairs such as Playing Guitar and Playing Ukelele, the movements are highly similar while the instruments are different, guitar or ukelele. In this case, it is difficult to distinguish by only motion representation without enough appearance features. Therefore, we propose a two-path solution, which combines the residual input path with a simple 2D ConvNet to extract appearance features from a single frame. Experiments show that our proposed two-path method obtains better performance over some two-stream models on UCF101 / HMDB51 datasets when using the same input sizes and similar or even shallower network architectures.
Our contributions are summarized as follows:
We are the first to use residual frames with 3D ConvNets for action recognition, which is simple, fast, but effective.
Analysis indicates that our proposal can extract better motion representation for actions than RGB counterparts.
Our proposal can achieve the state of the art when models are trained from scratch on two benchmarks. Our results can even achieve better performance with less computation cost than some methods using optical flow.
When subtracting adjacent frames to get a residual frame, only the frame differences are kept. In a single residual frame, movements exist in the spatial axis. Using residual frames with 2D ConvNets has been attempted and proved to be somewhat effective [19, 2]. However, because actions or activities are complex with much longer durations, stacked frames are still necessary. In stacked residual frames, the movement does not only exist in the spatial axis, but also in the temporal axis, which is more suitable for 3D ConvNets because 3D convolution kernels will process data in both spatial and temporal axes. Using stacked residual frames helps 3D convolutional kernel to concentrate on capturing motion features because the network does not need to consider the appearance information of objects or backgrounds in videos.
Here we introduced the detail calculation of proposed residual frames input for 3D ConvNets. We use to represent the frame data, and denotes the stacked frames from the frame to the frame. The process to get residual frames can be formulated as follows,
The computational cost is cheap and can even be ignored when compared with the network itself or optical flow calculation. With this change, 3D ConvNet can extract motion features by focusing more on the movements in videos while ignoring some unnecessary objects and backgrounds.
We also pay attention to some cases that similar movements could exist in different actions, where only good motion representation is not enough. For example, for actions Apply Eye Makeup and Apply Lipstick, the main difference lies in the location (around the eye or mouth) of the similar movement. In this example, 3D ConvNets may be able to distinguish them to some extent but the loss of appearance information does increase the difficulty. Therefore, we use a 2D ConvNet to process the lost appearance information and combine with a 3D ConvNet using residual frames as input to form a two-path network.
To distinguish our proposal from the existed two-stream methods [14, 15, 16], we refer to our method as ‘two-path’ because we do not use any pre-computed motion features such as optical flow. Our two-path network is formed by a motion path and an appearance path, which is illustrated in Fig. 2
Motion path. Because residual frames are used in this path, movements then exist in both spatial axis and the temporal axis. The convolutional kernel for each 3D convolutional layer is in 3 dimensions. For each 3D convolutional layer, data will be computed among three dimensions simultaneously. Therefore, 3D convolutional layers are used in this path. Because there are many existing 3D convolution based network architectures which have been proved effective in many action recognition datasets, we do not focus on designing a new network architecture in this paper. To verify the robustness and versatility of our proposal, we conduct experiments on various models, such as ResNet-18-3D , R(2+1)D , I3D , and S3D .
Appearance path. By using residual frames with 3D ConvNets, motion features can be better extracted, while background features which contains object appearances are lost. Here, we simply use a naive 2D ConvNet which treats action recognition as a simple image classification problem. This path is a supplementary of motion path.
For the combination of these two paths, we average the predictions for the same video sample.
We mainly focus on the following benchmarks: UCF101 , HMDB51 , and Kinetics400 . UCF101 consists of 13,320 videos in 101 action categories. HMDB51 is comprised of 7,000 videos with a total of 51 action classes. Kinetics400 consists 400 action classes and contains around 240k videos for training, 20k videos for validation and 40k videos for testing. We mainly conduct our experiments on UCF101 and HMDB51. Results on Kinetics400 will also be reported to prove the effectiveness of our proposal.
Motion path. In this path, stacked residual frames are set as the network input data. Residual frames are used identically to traditional RGB frame clips. For 3D ConvNets in action recognition, when ignoring the image channel number, is used to denote the data shape, where frames are stacked together with height and width . There are several setting choices for input data shape, such as , , , and . For fair comparison, in all of our motion path, following , frames are resized to and consecutive frames are stacked to form one clip. Then, random spatial cropping is conducted to generate an input data of size . Random horizontal flipping and Jittering are also applied during training. We tried two variants of ResNet-18-3D. In , models are directly from image classification tasks. However, the height and weight for video clips are both 112, which is half of 224. Therefore, we delete the pooling layer after the first convolution layer. R(2+1)D , I3D , and S3D 
are also reimplemented to verify the robustness of our proposal. The batch size is set to 32. When models are trained from scratch, the initial learning rate is set to 0.1. We trained models for 100 epochs on UCF101 and HMDB51. When fine-tuning on UCF101 and HMDB51 using Kinetics400 pre-trained models, model weights are directly from and the network architecture remains the same as . The initial learning rate became 0.001, and 50 epochs were sufficient.
Our appearance path is just a supplemental to our motion path. Therefore, we make it simple and treat action recognition as image classification. The goal for this path is to capture appearance features for background and objects. This progress is standard in image classification to enable the use of ImageNet pre-trained models. ResNeXt-101 is used in this path.
Testing and Results Accumulation. For the motion path, 16 clips are uniformly sampled from one video regardless of the video length. The predictions are averaged over all video clips to generate the final result. For the appearance path, 16 frames are sampled to match the motion path. And predictions are averaged to generate the final results.
In this session, results on motion path are mainly reported because appearance path is only a supplemental part to the motion path. We try our best to make fair comparisons with previous methods. Therefore, for all the comparative methods using 3D convolution, inputs in size will be reported if available. We do not compare with some methods with state-of-the-art performance such as I3D which used as input together with optical flow stream due to the large difference on the setting and too high computing cost. Using larger input data and deeper network usually can ensure better performance, while we focus more on the motion representation in this paper.
|ResNet-18 (delete first pooling layer)||61.6||84.9|
|ResNet-18 (delete first pooling layer)||✓||78.0||94.0|
Significant improvements can be obtained using our residual inputs when trained from scratch. As shown in Table. 1. In addition to ResNet-18, we also reimplement R(2+1)D, I3D and S3D. By replacing the RGB clips with residual clips, for all models, more than 10% gain can be achieved. By using our ResNet variant together with residual clips as inputs, the top-1 accuracy can be improved from 61.6% to 78.0%.
We also conduct some case study to see what kind of feature the model has learned. In Fig. 3, we can find that for category Jump Rope, movements in different samples are in consistent while backgrounds vary from one to another. Our residual-input model can handle these cases easily. Just because the residual-input model can represent the motion itself well, only using residual-input model can not distinguish different actions with similar movements, such as category Apply Lipstick and category Apply Eye Makeup shown on the right of Fig. 3. Moreover, visualizations using Grad-Cam  in Fig. 1 also indicate that, RGB-input model will still pay more attention to the background while our motion path will focus on the movement part.
|ResNet-18 baseline ||42.4||17.1||54.2|
|TSN (RGB only) ||48.7||-||-|
|Motion path (ours)||78.0||43.7||-|
|Single path (fine-tuning))|
|CoViAR (Residuals) ||79.9||44.6||-|
|TSN (RGB difference) ||83.8||-||-|
|ResNet-18 baseline ||84.4||56.4||-|
|C3D (+SVM) ||82.3||51.6||-|
|TSN (RGB, ImNet pretrain) ||85.7||51.0||-|
|I3D (RGB, ImNet pretrain) ||84.5||49.8||71.1|
|Motion path (ours)||89.0||58.1||60.3|
|Two-stream (+SVM) ||88.0||59.0||-|
|CoViAR (3 nets) ||90.4||59.1||-|
First, we compared our motion path with other methods in the first part of Table 2. All methods are trained from scratch. NAS  used network architecture search technology to search better network architecture for action recognition and achieve high performance for scratch training. Our motion path has 23.8% improvement over it. Our proposal method achieves the state-of-the-art when trained from scratch on UCF101 and HMDB51.
Then we introduce the results of our motion path with fine-tuning comparing to other methods shown in the second part in Table 2. We are not proposing new network architecture, therefore, weights from  are directly used for fine-tuning our motion path to fit residual inputs. Residual frame / frame differences has been tried or used in 2D convolution networks such as CoViAR  and TSN . We also make an apple-to-apple comparison with them. Results show that by using residual frames with 2D CNN, on UCF101, 83.8% and 79.9% points can be achieved for TSN (RGB difference) and CoViAR (residuals) at top-1 accuracy while our motion path can get 89.0%, which reveals that 3D CNNs are more capable of processing residual frames. I3D can achieve the state-of-the-art result (98.0% at top-1 accuracy) on UCF101 because the input size is ours and optical flow is added, together with both ImageNet and Kinetics400 knowledge. With only RGB input and knowledge from ImageNet, the result for I3D (RGB) is 84.5%, and our motion path is better. On HMDB51, the same trend can be found. The proposed motion path is even better than I3D (RGB) even though our input size is smaller. Considering the time cost of scratch training on Kinetics400, we directly fine-tuned the pretrained RGB weights to fit our residual inputs. It is interesting that this also works and the top-1 accuracy for our motion path is 60.3%. This result is higher than 54.2%, which is achieved by using ResNet-18-3D model with RGB input in .
As shown in the third part of Table 2, by using an additional appearance path, results on UCF101 can be further enhanced to 90.6%, which is higher than CoViAR which even used 3 networks. On Kinetics400, our results are 67.7%, 2.1% higher than two-stream method . Moreover, for two-stream methods, optical flow features need to be extracted first. It takes about 48 seconds for a 6-second video (165 frames) using TV-L1 optical flow algorithm  and OpenCV on CPU. Though it can be accelerated by parallel computing, it is still time consuming compared to the inference time of our motion path (less than 0.19 second/video). Although our results are lower than I3D , it is acceptable considering the input size and lower cost without need of calculating optical flow. A descent is observed on HMDB51 can be considered that the current path is too naive which does not utilize any temporal information. However, the still helps boost the result using single motion path by 1.6% and 7.4% on UCF101 and Kinetics400, separately.
In this paper, we mainly focused on extracting motion features without using optical flow. We improved use of 3D convolution by using stacked residual frames as the network input. Results of our proposal could be improved significantly when trained from scratch on UCF101 and HMDB51 datasets. Analysis implied that residual frames can be a fast but effective way for a network to capture motion features and they are a good choice for avoiding complex computation for optical flow.
This work was partially financially supported by the Grants-in-Aid for Scientific Research Numbers JP19K20289 and JP19K22863 from JSPS, Japan.
“Large-scale video classification with convolutional neural networks,”in , 2014, pp. 1725–1732.
Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2019, vol. 33, pp. 8674–8681.
“Tv-l1 optical flow estimation,”Image Processing On Line, vol. 2013, pp. 137–150, 2013.