StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

11/05/2018
by   Dongliang He, et al.
0

Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for the spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatial temporal network (StNet) architecture for both local and global spatial-temporal modeling in videos. Particularly, StNet stacks N successive video frames into a super-image which has 3N channels and applies 2D convolution on super-images to capture local spatial-temporal relationship. To model global spatial-temporal relationship, we apply temporal convolution on the local spatial-temporal feature maps. Specifically, a novel temporal Xception block is proposed in StNet. It employs a separate channel-wise and temporal-wise convolution over the feature sequence of video. Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. We further demonstrate the generalization performance of the leaned video representations on the UCF101 dataset.

READ FULL TEXT

page 1

page 3

page 7

research
03/18/2020

STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition

Effective and Efficient spatio-temporal modeling is essential for action...
research
05/19/2018

DenseImage Network: Video Spatial-Temporal Evolution Encoding and Understanding

Many of the leading approaches for video understanding are data-hungry a...
research
12/17/2020

Weakly-Supervised Action Localization and Action Recognition using Global-Local Attention of 3D CNN

3D Convolutional Neural Network (3D CNN) captures spatial and temporal i...
research
02/27/2015

Describing Videos by Exploiting Temporal Structure

Recent progress in using recurrent neural networks (RNNs) for image desc...
research
08/29/2019

DWnet: Deep-Wide Network for 3D Action Recognition

We propose in this paper a deep-wide network (DWnet) which combines the ...
research
06/09/2022

Spatial-temporal Concept based Explanation of 3D ConvNets

Recent studies have achieved outstanding success in explaining 2D image ...
research
06/10/2019

UniDual: A Unified Model for Image and Video Understanding

Although a video is effectively a sequence of images, visual perception ...

Please sign up or login with your details

Forgot password? Click here to reset