Learning Efficient Video Representation with Video Shuffle Networks

11/26/2019
by   Pingchuan Ma, et al.
25

3D CNN shows its strong ability in learning spatiotemporal representation in recent video recognition tasks. However, inflating 2D convolution to 3D inevitably introduces additional computational costs, making it cumbersome in practical deployment. We consider whether there is a way to equip the conventional 2D convolution with temporal vision no requiring expanding its kernel. To this end, we propose the video shuffle, a parameter-free plug-in component that efficiently reallocates the inputs of 2D convolution so that its receptive field can be extended to the temporal dimension. In practical, video shuffle firstly divides each frame feature into multiple groups and then aggregate the grouped features via temporal shuffle operation. This allows the following 2D convolution aggregate the global spatiotemporal features. The proposed video shuffle can be flexibly inserted into popular 2D CNNs, forming the Video Shuffle Networks (VSN). With a simple yet efficient implementation, VSN performs surprisingly well on temporal modeling benchmarks. In experiments, VSN not only gains non-trivial improvements on Kinetics and Moments in Time, but also achieves state-of-the-art performance on Something-Something-V1, Something-Something-V2 datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/01/2020

Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

Video classification researches that have recently attracted attention a...
research
09/30/2020

Dissected 3D CNNs: Temporal Skip Connections for Efficient Online Video Processing

Convolutional Neural Networks with 3D kernels (3D CNNs) currently achiev...
research
03/11/2022

TFCNet: Temporal Fully Connected Networks for Static Unbiased Temporal Reasoning

Temporal Reasoning is one important functionality for vision intelligenc...
research
11/17/2018

Recurrence to the Rescue: Towards Causal Spatiotemporal Representations

Recently, three dimensional (3D) convolutional neural networks (CNNs) ha...
research
01/19/2020

MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recogntion

To efficiently extract spatiotemporal features of video for action recog...
research
12/02/2014

Learning Spatiotemporal Features with 3D Convolutional Networks

We propose a simple, yet effective approach for spatiotemporal feature l...
research
11/23/2021

Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video Analysis

Event analysis in untrimmed videos has attracted increasing attention du...

Please sign up or login with your details

Forgot password? Click here to reset