Residual Frames with Efficient Pseudo-3D CNN for Human Action Recognition

by   Jiawei Chen, et al.

Human action recognition is regarded as a key cornerstone in domains such as surveillance or video understanding. Despite recent progress in the development of end-to-end solutions for video-based action recognition, achieving state-of-the-art performance still requires using auxiliary hand-crafted motion representations, e.g., optical flow, which are usually computationally demanding. In this work, we propose to use residual frames (i.e., differences between adjacent RGB frames) as an alternative "lightweight" motion representation, which carries salient motion information and is computationally efficient. In addition, we develop a new pseudo-3D convolution module which decouples 3D convolution into 2D and 1D convolution. The proposed module exploits residual information in the feature space to better structure motions, and is equipped with a self-attention mechanism that assists to recalibrate the appearance and motion features. Empirical results confirm the efficiency and effectiveness of residual frames as well as the proposed pseudo-3D convolution module.



page 1


Rethinking Motion Representation: Residual Frames with 3D ConvNets for Better Action Recognition

Recently, 3D convolutional networks yield good performance in action rec...

Challenge report:VIPriors Action Recognition Challenge

This paper is a brief report to our submission to the VIPriors Action Re...

Motion Representation Using Residual Frames with 3D CNN

Recently, 3D convolutional networks (3D ConvNets) yield good performance...

On the Importance of Video Action Recognition for Visual Lipreading

We focus on the word-level visual lipreading, which requires to decode t...

TEAM-Net: Multi-modal Learning for Video Action Recognition with Partial Decoding

Most of existing video action recognition models ingest raw RGB frames. ...

MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition

This paper presents a pure transformer-based approach, dubbed the Multi-...

Learning Comprehensive Motion Representation for Action Recognition

For action recognition learning, 2D CNN-based methods are efficient but ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The resurgence of convolutional neural networks (CNNs) and large-scale labeled datasets have led unprecedented advances for image classification using end-to-end trainable networks. However, video-based human action recognition has not yet achieved similar success based on pure CNN features. One fundamental challenge is how to effectively model temporal information, i.e., recognizing correlation and causation through time. There is a classical branch of research focusing on modeling motion through hand-crafted optical flow, including histograms of flow 

[laptev2008learning], motion boundary histograms [dalal2006human], and trajectories [wang2013action]

. In the context of deep learning, the two-stream method 

[chen2017semi, feichtenhofer2016convolutional, simonyan2014two, zhu2018hidden] that exploits optical flow and RGB modality in separate streams is one of the most successful frameworks. However, it is methodologically unsatisfactory given that optical flow is computationally expensive, and two-stream methods are often not learned end-to-end jointly with the flow. Recent studies attempt to model appearance and motion features within a single model from solely RGB modality [feichtenhofer2019slowfast, hara2018can, tran2018closer, varol2017long]. Nevertheless, it has been shown that combining optical flow and RGB frames as input to their models can still improve performance [varol2017long].

In this paper, we propose to use residual frames, i.e., the differences between adjacent RGB frames, as an additional input modality to RGB data for action recognition. The reason is two-fold: (a) adjacent RGB frames largely share the still objects and background information, thus residual frames usually retain only motion-specific features (see Fig. 1); (b) the computational cost of residual frames is negligible compared to other motion representations. In our experiments, we verify that using residual frames can yield significant improvement for action recognition.

Following the recent trend of developing efficient 3D convolution models for video classification [lin2019tsm, qiu2017learning, tran2018closer, xie2018rethinking], we also propose a new efficient pseudo-3D convolution module wherein the standard 3D convolution is decoupled into 2D and 1D convolution. To further enhance motion features, we utilize residual information in the feature space, i.e., the differences between temporally adjacent CNN features. A novel self-attention mechanism is also proposed to recalibrate the appearance and motion features based on their significance to the end task.

Our contributions are summarized as follows:

  1. We propose a multi-modal approach for action recognition that utilizes both RGB and residual frames. We empirically verify that using residual frames is a simple yet effective approach to improve performance.

  2. We develop a novel and efficient pseudo-3D convolution module that involves residual features and a self-attention mechanism. We provide ablation studies to confirms the effectiveness of individual components in our proposed module.

2 Methodology

In this section, we first describe the definition of residual frames and its usage as an auxiliary modality for action recognition. Then, we introduce our efficient pseudo-3D convolution module.

2.1 Residual Frames

Given a video clip , where , , and denotes the clip length, height and width of each frame, and the number of channels, respectively, a residual frame can be formed by subtracting the reference frame , from the desired frame , where the step size between timestamps and is denoted as . More formally, we can define a residual frame as:

Due to nearby video frames having significant similarities in static information, a residual frame normally has little background and object appearance information but retains salient motion-specific information. Thus, residual frame is a good source for extracting motion features. Moreover, compared to other motion representations like optical flow, the computational cost of residual frames is notably cheaper.

In reality, actions and activities are complex and could involve different motion speed and duration. In order to cope with such uncertainty, we can stack consecutive residual frames to form a residual clip , , which tends to capture fast motion in the spatial axis and slow/long-duration motion in the temporal axis. Thus, a residual clip is naturally suitable for 3D convolution wherein short- and long-duration motion cues can be extracted simultaneously.

However, residual frames alone may be insufficient to solve human action recognition due to the object appearance and background scene can also provide important cues to discriminate actions, e.g., Apply Eye Makeup and Apply Lipstick are similar in motion but different in the location of the movement. Therefore, it is necessary to utilize both RGB and residual frames for action recognition. To this end, we develop a pseudo-3D CNN with the capability to operates both modalities. We will present the details in sec. 2.2

2.2 Pseudo-3D Convolution Module

We propose a new pseudo-3D convolution module in which standard 3D filters are decoupled into parallel 2D spatial and 1D temporal filters. The reason we use decoupled 3D convolution is two-fold. First, replacing 3D convolution with separable 2D and 1D convolution greatly reduces model size and computational cost, which is in line with the recent trend in the development of efficient 3D networks. Second, placing 2D and 1D convolution in separate pathways allows modeling appearance and motion features differently. In particular, we extend the idea of residual frames from pixel-level to feature-level when modeling motions. Given an output feature from 1D temporal convolution

, we first shift it along the temporal dimension by a stride of 1 and then generate a residual feature by subtracting the shifted feature from the original version:

As a result, three features are created after the pseudo-3D convolution, where is the output of 2D convolution which maintains the appearance information and , preserve distinctive motion structures.

To facilitate effective fusion of appearance and motion features, we propose to apply a channel self-attention mechanism to recalibrate features. Specifically, we first concatenate output features in the channel dimension:

where represents concatenation. Then, we produce an channel attention mask as:

where represents a weight matrix parameterized by a one-layer neural network, is a bias term, is a global pooling operation averaging the dimensions of across space and time,

indicates the sigmoid function. Our goal is to introduce dynamics conditioned on the input feature and reweight channels based on their significance to the end task. Thus, we conduct channel-wise multiplication between the input

and attention mask . To further promote robustness, we adopts residual connection in our module.

Figure 2 presents the detailed design of the proposed pseudo-3D convolution module. The top and bottom convolutions are applied for reducing and restoring dimensions.

Figure 2: Proposed pseudo-3D convolution module. represents concatenation.

The proposed module can be integrated to any standard CNN architectures, e.g., ResNet [he2016deep]. In our experiments, we develop variants of ResNet-50 by replacing all the bottleneck block with our pseudo-3D convolution module. In order to operate both RGB and residual frames concurrently, we modify the original data layer (the first convolutional layer) into two streams with parallel building blocks, one for each modality. The resulting features from two streams are concatenated and passed to the succeeding layer. Please refer to Table. 1 for detailed network architecture.

3 Experimental Evaluation

We evaluate the performance of the proposed approach on the UCF101 [soomro2012ucf101] dataset, which consists of 13,320 videos in 101 action categories. We preprocess each video by fixing the frame rate to 15 and resizing frames to let the short side to be 256. During training, random scaling and corner cropping are utilized for data augmentation, and the cropped region is resized to 112112 for each frame. During testing, we uniformly sample 10 clips from each video and obtain the final prediction by averaging clip scores. We report video-level top-1 and top-5 accuracy on the validation set of split-1 for all experiments. It is worthwhile to mention that state-of-the-art performance on UCF101 is achieved by using deep models that have been pretrained on large-scale video datasets [carreira2017quo, lin2019tsm]. However, we train all the models from scratch since pushing the performance limit is out of our current scope.

Stage Filters Output size
raw clip
conv1 RGB: [, 64], [, 64] RGB:
Residual: [, 64], [, 64] Residual:
, 64
[,64], , 64 ]
, 64
, 128
[,128], , 128 ]
, 128
, 256
[,256], , 256]
, 256
, 512
[,512], , 512 ]
, 512
fc1 , 2048
fc2 , classes
Table 1: The proposed network architecture. The dimensions of filters are denoted by for temporal, spatial and channel sizes. Please note we omit the attention layer for simplicity.

In specific, we first conduct an evaluation of the effectiveness of different data modalities by training action classifiers using solely RGB frames, residual frames and a combined input, respectively. We also study the impact of the step size of residual frames for action recognition. Finally, we perform ablation studies to investigate the effectiveness of individual components in the proposed pseudo-3D convolution module.

Performance comparisons for different data modalities. Table. 2 shows the action recognition performance of various combinations of input modality and network architecture on UCF101. Note for experiments using solely RGB or residual frames, we keep only one stream in the data layer (conv1) but double the number of channels for a fair comparison. We first observe that using solely residual frames outperforms solely RGB frames by in top-1 and top-5 accuracy. It indicates residual frames indeed contain salient motion information which is important for action recognition. When leveraging both RGB and residual frames, the top-1 accuracy is further increased by ( to ), which suggests the two data modalities maintain complementary information. Remarkably, we also observe that using our pseudo-3D convolution module significantly reduces computing flops (from 163G to 30G) compared to the case of using standard 3D convolution, while still providing better performance.


Method Input modality Val top-1 Val top-5 GFLOPs
P.3D ResNet50 RGB 28
P.3D ResNet50 Residual () 28
3D ResNet50 RGB + 163
Residual ()
P.3D ResNet50 RGB + 30
Residual ()


Table 2: Performance comparisons for different input modalities.

Impact of step size . When generating residual frames, we can change the step size to capture motion features at different time scales. However, it is unclear what the optimal step size is for the action recognition task. Therefore, we conducted study on the impact of the step size and show the results in Table. 3. We experimented three settings where the input data is solely residual frames with step size , respectively. Interestingly, the classification accuracy decreases with the increase of step size. We suspect it is because motion will cause spatial displacements for the same objects between two frames, and it may cause a mismatch in motion representations when using a large step size.


Step size Top-1 accuracy Top-5 accuracy


Table 3: Performance comparisons for different residual frame step size .

Ablation studies. We perform ablation studies to verify the effectiveness of different components in our proposed pseudo-3D convolution module. Without loss of generality, the models are trained with a combined input of RGB and residual frames (). As shown in Table. 4, removing the self-attention mechanism leads to a drop in top-1 accuracy ( to ). Meanwhile, the performance is also reduced from to when ignoring the residual information in the feature space. If the self-attention mechanism and residual features are eliminated at the same time, the top-1 accuracy will be further reduced to . These results confirm the self-attention mechanism and residual features are effective to improve the performance of action recognition.


Method Top-1 accuracy Top-5 accuracy
P.3D module w.o.
self attention residual feature
P.3D module w.o. self-attention
P.3D module w.o. residual feature
Pseudo 3D module


Table 4: Performance comparisons for various pseudo-3D module settings.

4 Conclusion

In this paper, we propose a multi-modal framework that exploits both RGB and residual frames for human action recognition. We empirically confirm the benefit of using residual frames ( increase on top-1 accuracy) and our study shows that using small step size for residual frames will lead to better performance. One additional contribution of this paper is the development of a novel and efficient pseudo-3D convolution module that involves residual features and a self-attention mechanism. Quantitative results show that the proposed module can significantly reduce computational cost without compromising performance.