Spatiotemporal Residual Networks for Video Action Recognition

11/07/2016
by   Christoph Feichtenhofer, et al.
0

Two-stream Convolutional Networks (ConvNets) have shown strong performance for human action recognition in videos. Recently, Residual Networks (ResNets) have arisen as a new technique to train extremely deep architectures. In this paper, we introduce spatiotemporal ResNets as a combination of these two approaches. Our novel architecture generalizes ResNets for the spatiotemporal domain by introducing residual connections in two ways. First, we inject residual connections between the appearance and motion pathways of a two-stream architecture to allow spatiotemporal interaction between the two streams. Second, we transform pretrained image ConvNets into spatiotemporal networks by equipping these with learnable convolutional filters that are initialized as temporal residual connections and operate on adjacent feature maps in time. This approach slowly increases the spatiotemporal receptive field as the depth of the model increases and naturally integrates image ConvNet design principles. The whole model is trained end-to-end to allow hierarchical learning of complex spatiotemporal features. We evaluate our novel spatiotemporal ResNet using two widely used action recognition benchmarks where it exceeds the previous state-of-the-art.

READ FULL TEXT
research
08/07/2019

STM: SpatioTemporal and Motion Encoding for Action Recognition

Spatiotemporal and motion features are two complementary and crucial inf...
research
06/27/2017

Recurrent Residual Learning for Action Recognition

Action recognition is a fundamental problem in computer vision with a lo...
research
11/13/2018

Two-stream convolutional networks for end-to-end learning of self-driving cars

We propose a methodology to extend the concept of Two-Stream Convolution...
research
06/06/2022

A Deeper Dive Into What Deep Spatiotemporal Networks Encode: Quantifying Static vs. Dynamic Information

Deep spatiotemporal models are used in a variety of computer vision task...
research
11/24/2017

Appearance-and-Relation Networks for Video Classification

Spatiotemporal feature learning in videos is a fundamental and difficult...
research
03/30/2017

TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition

Recent two-stream deep Convolutional Neural Networks (ConvNets) have mad...
research
07/11/2019

Two-stream Spatiotemporal Feature for Video QA Task

Understanding the content of videos is one of the core techniques for de...

Please sign up or login with your details

Forgot password? Click here to reset