Right on Time: Multi-Temporal Convolutions for Human Action Recognition in Videos

11/08/2020
by   Alexandros Stergiou, et al.
0

The variations in the temporal performance of human actions observed in videos present challenges for their extraction using fixed-sized convolution kernels in CNNs. We present an approach that is more flexible in terms of processing the input at multiple timescales. We introduce Multi-Temporal networks that model spatio-temporal patterns of different temporal durations at each layer. To this end, they employ novel 3D convolution (MTConv) blocks that consist of a short stream for local space-time features and a long stream for features spanning across longer times. By aligning features of each stream with respect to the global motion patterns using recurrent cells, we can discover temporally coherent spatio-temporal features with varying durations. We further introduce sub-streams within each of the block pathways to reduce the computation requirements. The proposed MTNet architectures outperform state-of-the-art 3D-CNNs on five action recognition benchmark datasets. Notably, we achieve at 87.22 Kinectics-700. We further demonstrate the favorable computational requirements. Using sub-streams, we can further achieve a drastic reduction in parameters ( 60 generalization capabilities of the multi-temporal features

READ FULL TEXT

page 1

page 2

page 3

research
09/30/2019

Spatio-Temporal FAST 3D Convolutions for Human Action Recognition

Effective processing of video input is essential for the recognition of ...
research
10/05/2021

Efficient Modelling Across Time of Human Actions and Interactions

This thesis focuses on video understanding for human action and interact...
research
12/14/2018

TAN: Temporal Aggregation Network for Dense Multi-label Action Recognition

We present Temporal Aggregation Network (TAN) which decomposes 3D convol...
research
07/22/2020

Depthwise Spatio-Temporal STFT Convolutional Neural Networks for Human Action Recognition

Conventional 3D convolutional neural networks (CNNs) are computationally...
research
01/11/2022

Representing Videos as Discriminative Sub-graphs for Action Recognition

Human actions are typically of combinatorial structures or patterns, i.e...
research
06/16/2020

Focus of Attention Improves Information Transfer in Visual Features

Unsupervised learning from continuous visual streams is a challenging pr...
research
10/20/2021

GTM: Gray Temporal Model for Video Recognition

Data input modality plays an important role in video action recognition....

Please sign up or login with your details

Forgot password? Click here to reset