Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

08/25/2017
by   Kensho Hara, et al.
0

Convolutional neural networks with spatio-temporal 3D kernels (3D CNNs) have an ability to directly extract spatio-temporal features from videos for action recognition. Although the 3D kernels tend to overfit because of a large number of their parameters, the 3D CNNs are greatly improved by using recent huge video databases. However, the architecture of 3D CNNs is relatively shallow against to the success of very deep neural networks in 2D-based CNNs, such as residual networks (ResNets). In this paper, we propose a 3D CNNs based on ResNets toward a better action representation. We describe the training procedure of our 3D ResNets in details. We experimentally evaluate the 3D ResNets on the ActivityNet and Kinetics datasets. The 3D ResNets trained on the Kinetics did not suffer from overfitting despite the large number of parameters of the model, and achieved better performance than relatively shallow networks, such as C3D. Our code and pretrained models (e.g. Kinetics and ActivityNet) are publicly available at https://github.com/kenshohara/3D-ResNets.

READ FULL TEXT
research
12/05/2021

STSM: Spatio-Temporal Shift Module for Efficient Action Recognition

The modeling, computational cost, and accuracy of traditional Spatio-tem...
research
11/27/2017

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

The purpose of this study is to determine whether current video datasets...
research
02/18/2020

V4D:4D Convolutional Neural Networks for Video-level Representation Learning

Most existing 3D CNNs for video representation learning are clip-based m...
research
01/15/2020

Epileptic Seizure Classification with Symmetric and Hybrid Bilinear Models

Epilepsy affects nearly 1 be treated by anti-epileptic drugs and a much ...
research
07/24/2022

MAR: Masked Autoencoders for Efficient Action Recognition

Standard approaches for video recognition usually operate on the full in...
research
10/24/2019

Spatiotemporal Tile-based Attention-guided LSTMs for Traffic Video Prediction

This extended abstract describes our solution for the Traffic4Cast Chall...
research
03/16/2022

Gate-Shift-Fuse for Video Action Recognition

Convolutional Neural Networks are the de facto models for image recognit...

Please sign up or login with your details

Forgot password? Click here to reset