Motion Representation Using Residual Frames with 3D CNN

by   Li Tao, et al.
The University of Tokyo

Recently, 3D convolutional networks (3D ConvNets) yield good performance in action recognition. However, optical flow stream is still needed to ensure better performance, the cost of which is very high. In this paper, we propose a fast but effective way to extract motion features from videos utilizing residual frames as the input data in 3D ConvNets. By replacing traditional stacked RGB frames with residual ones, 35.6 top-1 accuracy can be obtained on the UCF101 and HMDB51 datasets when ResNet-18 models are trained from scratch. And we achieved the state-of-the-art results in this training mode. Analysis shows that better motion features can be extracted using residual frames compared to RGB counterpart. By combining with a simple appearance path, our proposal can be even better than some methods using optical flow streams.



page 1

page 2

page 3

page 4

page 5


Rethinking Motion Representation: Residual Frames with 3D ConvNets for Better Action Recognition

Recently, 3D convolutional networks yield good performance in action rec...

Challenge report:VIPriors Action Recognition Challenge

This paper is a brief report to our submission to the VIPriors Action Re...

Predictive Coding Networks Meet Action Recognition

Action recognition is a key problem in computer vision that labels video...

Residual Frames with Efficient Pseudo-3D CNN for Human Action Recognition

Human action recognition is regarded as a key cornerstone in domains suc...

Ordered Pooling of Optical Flow Sequences for Action Recognition

Training of Convolutional Neural Networks (CNNs) on long video sequences...

Saliency-guided video classification via adaptively weighted learning

Video classification is productive in many practical applications, and t...

DeepMoCap: Deep Optical Motion Capture Using Multiple Depth Sensors and Retro-Reflectors

In this paper, a marker-based, single-person optical motion capture meth...

Code Repositories


Official implementation of ACMMM'20 paper 'Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework'

view repo


Unofficial implement of "Video cloze procedure for self-supervised spatio-temporal learning" [AAAI20]

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For action recognition, motion representation is an important challenge to extract motion features among multiple frames. Various methods have been designed to capture the movement. 2D ConvNet based methods use interactions in the temporal axis to include temporal information [1, 2, 3, 4, 5]. 3D ConvNet based methods improved the recognition performance by extending 2D convolution kernel to 3D, and computations among temporal axis in each convolutional layers are believed to handle the movements [6, 7, 8, 9, 10, 11]. State-of-the-art methods showed further improvements by increasing the number of used frames and the size of the input data as well as deeper backbone networks [12, 13].

In a typical implementation of 3D ConvNets, these methods used stacked RGB frames (or called video clips, we use both in the following descriptions) as the input data. However, this kind of input is considered not enough for motion representation because the features captured from the stacked RGB frames may pay more attention to the appearance feature including backgrounds and objects rather than the movement itself, as shown in the top example in Fig. 1. Thus, combining with an optical flow stream is necessary to further represent the movement and improve the performance, such as the two-stream models [14, 15, 16]. However, the processing of optical flow greatly increases computation. Besides, two-stream results activation of the optical flow stream can only be obtained after the optical flow data are extracted, which causes high latency.

In this paper, we propose an effective strategy based on 3D convolutional networks to pre-process RGB frames for the generation and replacement of input data. Our method retains what we call residual frames, which contain more motion-specific features by removing still objects and background information and leaving mainly the changes between frames. Through this, the movement can be extracted more clearly and recognition performance can be improved, as shown in the bottom sample in Fig. 1. Our experiments reveal that our approach can yield significant improvements over top-1 accuracies when those ConvNets are trained from scratch on UCF101 [17] and HMDB51 [18] datasets.

Figure 1: An example of our residual frames compared with normal RGB inputs. The residual-input model focused on the movement part while RGB-input model paid more attention on background, which lead to lower accuracy for prediction.

For some specific category pairs such as Playing Guitar and Playing Ukelele, the movements are highly similar while the instruments are different, guitar or ukelele. In this case, it is difficult to distinguish by only motion representation without enough appearance features. Therefore, we propose a two-path solution, which combines the residual input path with a simple 2D ConvNet to extract appearance features from a single frame. Experiments show that our proposed two-path method obtains better performance over some two-stream models on UCF101 / HMDB51 datasets when using the same input sizes and similar or even shallower network architectures.

Our contributions are summarized as follows:

  • We are the first to use residual frames with 3D ConvNets for action recognition, which is simple, fast, but effective.

  • Analysis indicates that our proposal can extract better motion representation for actions than RGB counterparts.

  • Our proposal can achieve the state of the art when models are trained from scratch on two benchmarks. Our results can even achieve better performance with less computation cost than some methods using optical flow.

2 Proposed method

2.1 Residual frames

When subtracting adjacent frames to get a residual frame, only the frame differences are kept. In a single residual frame, movements exist in the spatial axis. Using residual frames with 2D ConvNets has been attempted and proved to be somewhat effective [19, 2]. However, because actions or activities are complex with much longer durations, stacked frames are still necessary. In stacked residual frames, the movement does not only exist in the spatial axis, but also in the temporal axis, which is more suitable for 3D ConvNets because 3D convolution kernels will process data in both spatial and temporal axes. Using stacked residual frames helps 3D convolutional kernel to concentrate on capturing motion features because the network does not need to consider the appearance information of objects or backgrounds in videos.

Here we introduced the detail calculation of proposed residual frames input for 3D ConvNets. We use to represent the frame data, and denotes the stacked frames from the frame to the frame. The process to get residual frames can be formulated as follows,


The computational cost is cheap and can even be ignored when compared with the network itself or optical flow calculation. With this change, 3D ConvNet can extract motion features by focusing more on the movements in videos while ignoring some unnecessary objects and backgrounds.

We also pay attention to some cases that similar movements could exist in different actions, where only good motion representation is not enough. For example, for actions Apply Eye Makeup and Apply Lipstick, the main difference lies in the location (around the eye or mouth) of the similar movement. In this example, 3D ConvNets may be able to distinguish them to some extent but the loss of appearance information does increase the difficulty. Therefore, we use a 2D ConvNet to process the lost appearance information and combine with a 3D ConvNet using residual frames as input to form a two-path network.

2.2 Two-path network

To distinguish our proposal from the existed two-stream methods [14, 15, 16], we refer to our method as ‘two-path’ because we do not use any pre-computed motion features such as optical flow. Our two-path network is formed by a motion path and an appearance path, which is illustrated in Fig. 2

Figure 2:

Framework of our two-path network. The motion path and the appearance path are trained separately using cross-entropy loss. In inference period, the output probabilities from two paths are averaged.

Motion path. Because residual frames are used in this path, movements then exist in both spatial axis and the temporal axis. The convolutional kernel for each 3D convolutional layer is in 3 dimensions. For each 3D convolutional layer, data will be computed among three dimensions simultaneously. Therefore, 3D convolutional layers are used in this path. Because there are many existing 3D convolution based network architectures which have been proved effective in many action recognition datasets, we do not focus on designing a new network architecture in this paper. To verify the robustness and versatility of our proposal, we conduct experiments on various models, such as ResNet-18-3D [10], R(2+1)D [11], I3D [9], and S3D [8].

Appearance path. By using residual frames with 3D ConvNets, motion features can be better extracted, while background features which contains object appearances are lost. Here, we simply use a naive 2D ConvNet which treats action recognition as a simple image classification problem. This path is a supplementary of motion path.

For the combination of these two paths, we average the predictions for the same video sample.

3 Experiments

3.1 Datasets

We mainly focus on the following benchmarks: UCF101 [17], HMDB51 [18], and Kinetics400 [20]. UCF101 consists of 13,320 videos in 101 action categories. HMDB51 is comprised of 7,000 videos with a total of 51 action classes. Kinetics400 consists 400 action classes and contains around 240k videos for training, 20k videos for validation and 40k videos for testing. We mainly conduct our experiments on UCF101 and HMDB51. Results on Kinetics400 will also be reported to prove the effectiveness of our proposal.

3.2 Implementation details

Motion path. In this path, stacked residual frames are set as the network input data. Residual frames are used identically to traditional RGB frame clips. For 3D ConvNets in action recognition, when ignoring the image channel number, is used to denote the data shape, where frames are stacked together with height and width . There are several setting choices for input data shape, such as , , , and . For fair comparison, in all of our motion path, following [6], frames are resized to and consecutive frames are stacked to form one clip. Then, random spatial cropping is conducted to generate an input data of size . Random horizontal flipping and Jittering are also applied during training. We tried two variants of ResNet-18-3D. In [10], models are directly from image classification tasks. However, the height and weight for video clips are both 112, which is half of 224. Therefore, we delete the pooling layer after the first convolution layer. R(2+1)D [11], I3D [9], and S3D [9]

are also reimplemented to verify the robustness of our proposal. The batch size is set to 32. When models are trained from scratch, the initial learning rate is set to 0.1. We trained models for 100 epochs on UCF101 and HMDB51. When fine-tuning on UCF101 and HMDB51 using Kinetics400 pre-trained models, model weights are directly from 

[10] and the network architecture remains the same as [10]. The initial learning rate became 0.001, and 50 epochs were sufficient.

Appearance path.

Our appearance path is just a supplemental to our motion path. Therefore, we make it simple and treat action recognition as image classification. The goal for this path is to capture appearance features for background and objects. This progress is standard in image classification to enable the use of ImageNet pre-trained models. ResNeXt-101 

[21] is used in this path.

Testing and Results Accumulation. For the motion path, 16 clips are uniformly sampled from one video regardless of the video length. The predictions are averaged over all video clips to generate the final result. For the appearance path, 16 frames are sampled to match the motion path. And predictions are averaged to generate the final results.

4 Results and discussion

In this session, results on motion path are mainly reported because appearance path is only a supplemental part to the motion path. We try our best to make fair comparisons with previous methods. Therefore, for all the comparative methods using 3D convolution, inputs in size will be reported if available. We do not compare with some methods with state-of-the-art performance such as I3D which used as input together with optical flow stream due to the large difference on the setting and too high computing cost. Using larger input data and deeper network usually can ensure better performance, while we focus more on the motion representation in this paper.

4.1 Effectiveness of residual inputs

Model residual top-1 top-5
ResNet-18 (baseline) 51.9 76.3
ResNet-18 (baseline) 66.4 88.0
ResNet-18 (delete first pooling layer) 61.6 84.9
ResNet-18 (delete first pooling layer) 78.0 94.0
R(2+1)D [11] 51.8 79.2
R(2+1)D[11] 66.7 88.3
I3D [9] 56.5 81.3
I3D [9] 66.6 87.0
S3D [8] 51.1 77.4
S3D [8] 64.8 86.9
Table 1: The effectiveness of our residual inputs for scratch training on UCF101 split 1. We reimplement R(2+1)D, I3D and S3D, and keep the input in the same shape. In the column residual, if checked, the models are using proposed residual clips, otherwise using original RGB clips as input.

Significant improvements can be obtained using our residual inputs when trained from scratch. As shown in Table. 1. In addition to ResNet-18, we also reimplement R(2+1)D, I3D and S3D. By replacing the RGB clips with residual clips, for all models, more than 10% gain can be achieved. By using our ResNet variant together with residual clips as inputs, the top-1 accuracy can be improved from 61.6% to 78.0%.

Figure 3: Examples for case study comparing to RGB-input models. Residual-input model has better performances for category Jump Rope while worse for Apply Lipstick than RGB-input model because it is more robust with mess background while easy to be confused with similar movements (Apply Eye Makeup).

We also conduct some case study to see what kind of feature the model has learned. In Fig. 3, we can find that for category Jump Rope, movements in different samples are in consistent while backgrounds vary from one to another. Our residual-input model can handle these cases easily. Just because the residual-input model can represent the motion itself well, only using residual-input model can not distinguish different actions with similar movements, such as category Apply Lipstick and category Apply Eye Makeup shown on the right of Fig. 3. Moreover, visualizations using Grad-Cam [22] in Fig. 1 also indicate that, RGB-input model will still pay more attention to the background while our motion path will focus on the movement part.

UCF101 HMDB51 Kinetics400
Scratch training
ResNet-18 baseline [10] 42.4 17.1 54.2
STC-ResNet-101 [23] 45.6 - 64.1
NAS [24] 58.6 - -
Slow fusion 41.3 - -
TSN (RGB only) [2] 48.7 - -
C3D [6] 51.6 24.3 55.6
Motion path (ours) 78.0 43.7 -
Single path (fine-tuning))
CoViAR (Residuals) [19] 79.9 44.6 -
TSN (RGB difference) [2] 83.8 - -
ResNet-18 baseline [10] 84.4 56.4 -
C3D (+SVM) [6] 82.3 51.6 -
TSN (RGB, ImNet pretrain) [2] 85.7 51.0 -
I3D (RGB, ImNet pretrain) [9] 84.5 49.8 71.1
Motion path (ours) 89.0 58.1 60.3
Multi-path (fine-tuning)
Two-stream [16] 86.9 58.0 65.6
Two-stream (+SVM) [16] 88.0 59.0 -
I3D [9] 98.0 80.7 74.2
CoViAR (3 nets) [19] 90.4 59.1 -
Two-path (ours) 90.6 56.6 67.7
Table 2: Accuracy using single path on UCF101 and HMDB51. Results are averaged over 3 splits except for the scratch training part, which are results on split 1 only. indicates results evaluated using the same input size () with us, otherwise, the size for single frame will be larger. indicates methods using optical flow.

4.2 Comparisons with other methods

First, we compared our motion path with other methods in the first part of Table 2. All methods are trained from scratch. NAS [24] used network architecture search technology to search better network architecture for action recognition and achieve high performance for scratch training. Our motion path has 23.8% improvement over it. Our proposal method achieves the state-of-the-art when trained from scratch on UCF101 and HMDB51.

Then we introduce the results of our motion path with fine-tuning comparing to other methods shown in the second part in Table 2. We are not proposing new network architecture, therefore, weights from [11] are directly used for fine-tuning our motion path to fit residual inputs. Residual frame / frame differences has been tried or used in 2D convolution networks such as CoViAR [19] and TSN [2]. We also make an apple-to-apple comparison with them. Results show that by using residual frames with 2D CNN, on UCF101, 83.8% and 79.9% points can be achieved for TSN (RGB difference) and CoViAR (residuals) at top-1 accuracy while our motion path can get 89.0%, which reveals that 3D CNNs are more capable of processing residual frames. I3D can achieve the state-of-the-art result (98.0% at top-1 accuracy) on UCF101 because the input size is ours and optical flow is added, together with both ImageNet and Kinetics400 knowledge. With only RGB input and knowledge from ImageNet, the result for I3D (RGB) is 84.5%, and our motion path is better. On HMDB51, the same trend can be found. The proposed motion path is even better than I3D (RGB) even though our input size is smaller. Considering the time cost of scratch training on Kinetics400, we directly fine-tuned the pretrained RGB weights to fit our residual inputs. It is interesting that this also works and the top-1 accuracy for our motion path is 60.3%. This result is higher than 54.2%, which is achieved by using ResNet-18-3D model with RGB input in [10].

As shown in the third part of Table 2, by using an additional appearance path, results on UCF101 can be further enhanced to 90.6%, which is higher than CoViAR which even used 3 networks. On Kinetics400, our results are 67.7%, 2.1% higher than two-stream method [16]. Moreover, for two-stream methods, optical flow features need to be extracted first. It takes about 48 seconds for a 6-second video (165 frames) using TV-L1 optical flow algorithm [25] and OpenCV on CPU. Though it can be accelerated by parallel computing, it is still time consuming compared to the inference time of our motion path (less than 0.19 second/video). Although our results are lower than I3D [9], it is acceptable considering the input size and lower cost without need of calculating optical flow. A descent is observed on HMDB51 can be considered that the current path is too naive which does not utilize any temporal information. However, the still helps boost the result using single motion path by 1.6% and 7.4% on UCF101 and Kinetics400, separately.

5 Conclusion

In this paper, we mainly focused on extracting motion features without using optical flow. We improved use of 3D convolution by using stacked residual frames as the network input. Results of our proposal could be improved significantly when trained from scratch on UCF101 and HMDB51 datasets. Analysis implied that residual frames can be a fast but effective way for a network to capture motion features and they are a good choice for avoiding complex computation for optical flow.


This work was partially financially supported by the Grants-in-Aid for Scientific Research Numbers JP19K20289 and JP19K22863 from JSPS, Japan.


  • [1] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei,

    “Large-scale video classification with convolutional neural networks,”


    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR)

    , 2014, pp. 1725–1732.
  • [2] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European conference on computer vision (ECCV). Springer, 2016, pp. 20–36.
  • [3] Yanghao Li, Sijie Song, Yuqi Li, and Jiaying Liu, “Temporal bilinear networks for video action recognition,” in

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)

    , 2019, vol. 33, pp. 8674–8681.
  • [4] Ji Lin, Chuang Gan, and Song Han, “Temporal shift module for efficient video understanding,” arXiv preprint arXiv:1811.08383, 2018.
  • [5] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He, “Non-local neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7794–7803.
  • [6] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision (ICCV), 2015, pp. 4489–4497.
  • [7] Zhaofan Qiu, Ting Yao, and Tao Mei, “Learning spatio-temporal representation with pseudo-3d residual networks,” in proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5533–5541.
  • [8] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy, “Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 305–321.
  • [9] Joao Carreira and Andrew Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6299–6308.
  • [10] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 18–22.
  • [11] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6450–6459.
  • [12] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He, “Slowfast networks for video recognition,” arXiv preprint arXiv:1812.03982, 2018.
  • [13] Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli, “Video classification with channel-separated convolutional networks,” arXiv preprint arXiv:1904.02811, 2019.
  • [14] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2016, pp. 1933–1941.
  • [15] Christoph Feichtenhofer, Axel Pinz, and Richard Wildes, “Spatiotemporal residual networks for video action recognition,” in Advances in neural information processing systems (NeurIPS), 2016, pp. 3468–3476.
  • [16] Karen Simonyan and Andrew Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems (NeurIPS), 2014, pp. 568–576.
  • [17] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
  • [18] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre, “Hmdb: a large video database for human motion recognition,” in International Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2556–2563.
  • [19] Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R Manmatha, Alexander J Smola, and Philipp Krähenbühl, “Compressed video action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6026–6035.
  • [20] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
  • [21] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2017, pp. 1492–1500.
  • [22] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618–626.
  • [23] Ali Diba, Mohsen Fayyaz, Vivek Sharma, M Mahdi Arzani, Rahman Yousefzadeh, Juergen Gall, and Luc Van Gool, “Spatio-temporal channel correlation networks for action classification,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 284–299.
  • [24] Wei Peng, Xiaopeng Hong, and Guoying Zhao, “Video action recognition via neural architecture searching,” in 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019, pp. 11–15.
  • [25] Javier Sánchez Pérez, Enric Meinhardt-Llopis, and Gabriele Facciolo,

    “Tv-l1 optical flow estimation,”

    Image Processing On Line, vol. 2013, pp. 137–150, 2013.