Gate-Shift Networks for Video Action Recognition

12/01/2019
by   Swathikiran Sudhakaran, et al.
12

Deep 3D CNNs for video action recognition are designed to learn powerful representations in the joint spatio-temporal feature space. In practice however, because of the large number of parameters and computations involved, they may under-perform in the lack of sufficiently large datasets for training them at scale. In this paper we introduce spatial gating in spatial-temporal decomposition of 3D kernels. We implement this concept with Gate-Shift Module (GSM). GSM is lightweight and turns a 2D-CNN into a highly efficient spatio-temporal feature extractor. With GSM plugged in, a 2D-CNN learns to adaptively route features through time and combine them, at almost no additional parameters and computational overhead. We perform an extensive evaluation of the proposed module to study its effectiveness in video action recognition, achieving state-of-the-art results on Something Something-V1 and Diving48 datasets, and obtaining competitive results on EPIC-Kitchens with far less model complexity. With GSM plugged into TSN, on Something Something-V1 we obtain an absolute +32 with less than 1 trained at different temporal scales, we reach beyond 55

READ FULL TEXT

page 9

page 11

page 12

page 13

page 14

research
03/16/2022

Gate-Shift-Fuse for Video Action Recognition

Convolutional Neural Networks are the de facto models for image recognit...
research
12/05/2021

STSM: Spatio-Temporal Shift Module for Efficient Action Recognition

The modeling, computational cost, and accuracy of traditional Spatio-tem...
research
03/04/2019

Collaborative Spatio-temporal Feature Learning for Video Action Recognition

Spatio-temporal feature learning is of central importance for action rec...
research
10/14/2022

Trailers12k: Evaluating Transfer Learning for Movie Trailer Genre Classification

Transfer learning is a cornerstone for a wide range of computer vision p...
research
03/21/2022

Efficient Remote Photoplethysmography with Temporal Derivative Modules and Time-Shift Invariant Loss

We present a lightweight neural model for remote heart rate estimation f...
research
11/19/2015

Delving Deeper into Convolutional Networks for Learning Video Representations

We propose an approach to learn spatio-temporal features in videos from ...
research
01/08/2023

STPrivacy: Spatio-Temporal Tubelet Sparsification and Anonymization for Privacy-preserving Action Recognition

Recently privacy-preserving action recognition (PPAR) has been becoming ...

Please sign up or login with your details

Forgot password? Click here to reset