MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recogntion

01/19/2020
by   Kaiyu Shan, et al.
0

To efficiently extract spatiotemporal features of video for action recognition, most state-of-the-art methods integrate 1D temporal convolution into a conventional 2D CNN backbone. However, they all exploit 1D temporal convolution of fixed kernel size (i.e., 3) in the network building block, thus have suboptimal temporal modeling capability to handle both long-term and short-term actions. To address this problem, we first investigate the impacts of different kernel sizes for the 1D temporal convolutional filters. Then, we propose a simple yet efficient operation called Mixed Temporal Convolution (MixTConv), which consists of multiple depthwise 1D convolutional filters with different kernel sizes. By plugging MixTConv into the conventional 2D CNN backbone ResNet-50, we further propose an efficient and effective network architecture named MSTNet for action recognition, and achieve state-of-the-art results on multiple benchmarks.

READ FULL TEXT
research
08/03/2020

Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition

In this work, we combine 3D convolution with late temporal modeling for ...
research
11/22/2017

Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification

The work in this paper is driven by the question how to exploit the temp...
research
02/08/2020

CTM: Collaborative Temporal Modeling for Action Recognition

With the rapid development of digital multimedia, video understanding ha...
research
09/30/2020

Dissected 3D CNNs: Temporal Skip Connections for Efficient Online Video Processing

Convolutional Neural Networks with 3D kernels (3D CNNs) currently achiev...
research
08/09/2021

Development of Human Motion Prediction Strategy using Inception Residual Block

Human Motion Prediction is a crucial task in computer vision and robotic...
research
12/09/2019

Temporal Factorization of 3D Convolutional Kernels

3D convolutional neural networks are difficult to train because they are...
research
11/26/2019

Learning Efficient Video Representation with Video Shuffle Networks

3D CNN shows its strong ability in learning spatiotemporal representatio...

Please sign up or login with your details

Forgot password? Click here to reset