Temporal Pyramid Pooling Based Convolutional Neural Networks for Action Recognition

03/04/2015
by   Peng Wang, et al.
0

Encouraged by the success of Convolutional Neural Networks (CNNs) in image classification, recently much effort is spent on applying CNNs to video based action recognition problems. One challenge is that video contains a varying number of frames which is incompatible to the standard input format of CNNs. Existing methods handle this issue either by directly sampling a fixed number of frames or bypassing this issue by introducing a 3D convolutional layer which conducts convolution in spatial-temporal domain. To solve this issue, here we propose a novel network structure which allows an arbitrary number of frames as the network input. The key of our solution is to introduce a module consisting of an encoding layer and a temporal pyramid pooling layer. The encoding layer maps the activation from previous layers to a feature vector suitable for pooling while the temporal pyramid pooling layer converts multiple frame-level activations into a fixed-length video-level representation. In addition, we adopt a feature concatenation layer which combines appearance information and motion information. Compared with the frame sampling strategy, our method avoids the risk of missing any important frames. Compared with the 3D convolutional method which requires a huge video dataset for network training, our model can be learned on a small target dataset because we can leverage the off-the-shelf image-level CNN for model parameter initialization. Experiments on two challenging datasets, Hollywood2 and HMDB51, demonstrate that our method achieves superior performance over state-of-the-art methods while requiring much fewer training data.

READ FULL TEXT

page 1

page 3

page 5

page 7

research
01/31/2016

Order-aware Convolutional Pooling for Video Based Action Recognition

Most video based action recognition approaches create the video-level re...
research
06/18/2014

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Existing deep convolutional neural networks (CNNs) require a fixed-size ...
research
11/22/2017

Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification

The work in this paper is driven by the question how to exploit the temp...
research
12/22/2021

Recur, Attend or Convolve? Frame Dependency Modeling Matters for Cross-Domain Robustness in Action Recognition

Most action recognition models today are highly parameterized, and evalu...
research
05/30/2017

Discriminatively Learned Hierarchical Rank Pooling Networks

In this work, we present novel temporal encoding methods for action and ...
research
05/02/2019

Human Action Recognition with Deep Temporal Pyramids

Deep convolutional neural networks (CNNs) are nowadays achieving signifi...
research
07/15/2020

Temporal Distinct Representation Learning for Action Recognition

Motivated by the previous success of Two-Dimensional Convolutional Neura...

Please sign up or login with your details

Forgot password? Click here to reset