Log In Sign Up

Right on Time: Multi-Temporal Convolutions for Human Action Recognition in Videos

by   Alexandros Stergiou, et al.

The variations in the temporal performance of human actions observed in videos present challenges for their extraction using fixed-sized convolution kernels in CNNs. We present an approach that is more flexible in terms of processing the input at multiple timescales. We introduce Multi-Temporal networks that model spatio-temporal patterns of different temporal durations at each layer. To this end, they employ novel 3D convolution (MTConv) blocks that consist of a short stream for local space-time features and a long stream for features spanning across longer times. By aligning features of each stream with respect to the global motion patterns using recurrent cells, we can discover temporally coherent spatio-temporal features with varying durations. We further introduce sub-streams within each of the block pathways to reduce the computation requirements. The proposed MTNet architectures outperform state-of-the-art 3D-CNNs on five action recognition benchmark datasets. Notably, we achieve at 87.22 Kinectics-700. We further demonstrate the favorable computational requirements. Using sub-streams, we can further achieve a drastic reduction in parameters ( 60 generalization capabilities of the multi-temporal features


page 1

page 2

page 3


Spatio-Temporal FAST 3D Convolutions for Human Action Recognition

Effective processing of video input is essential for the recognition of ...

Efficient Modelling Across Time of Human Actions and Interactions

This thesis focuses on video understanding for human action and interact...

Focus of Attention Improves Information Transfer in Visual Features

Unsupervised learning from continuous visual streams is a challenging pr...

Depthwise Spatio-Temporal STFT Convolutional Neural Networks for Human Action Recognition

Conventional 3D convolutional neural networks (CNNs) are computationally...

Representing Videos as Discriminative Sub-graphs for Action Recognition

Human actions are typically of combinatorial structures or patterns, i.e...

GTM: Gray Temporal Model for Video Recognition

Data input modality plays an important role in video action recognition....

Deep Spatio-temporal Manifold Network for Action Recognition

Visual data such as videos are often sampled from complex manifold. We p...

Code Repositories


Implementation of Squeeze and Recursion Temporal Gates blocks for action recognition

view repo