Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification

04/07/2015
by   Zuxuan Wu, et al.
0

Classifying videos according to content semantics is an important problem with a wide range of applications. In this paper, we propose a hybrid deep learning framework for video classification, which is able to model static spatial information, short-term motion, as well as long-term temporal clues in the videos. Specifically, the spatial and the short-term motion features are extracted separately by two Convolutional Neural Networks (CNN). These two types of CNN-based features are then combined in a regularized feature fusion network for classification, which is able to learn and utilize feature relationships for improved performance. In addition, Long Short Term Memory (LSTM) networks are applied on top of the two features to further model longer-term temporal clues. The main contribution of this work is the hybrid learning framework that can model several important aspects of the video data. We also show that (1) combining the spatial and the short-term motion features in the regularized fusion network is better than direct classification and fusion using the CNN with a softmax layer, and (2) the sequence-based LSTM is highly complementary to the traditional classification strategy without considering the temporal frame orders. Extensive experiments are conducted on two popular and challenging benchmarks, the UCF-101 Human Actions and the Columbia Consumer Videos (CCV). On both benchmarks, our framework achieves to-date the best reported performance: 91.3% on the UCF-101 and 83.5% on the CCV.

READ FULL TEXT

page 2

page 9

research
06/14/2017

Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

Videos are inherently multimodal. This paper studies the problem of how ...
research
12/20/2017

Combining Static and Dynamic Features for Multivariate Sequence Classification

Model precision in a classification task is highly dependent on the feat...
research
08/02/2016

Modeling Spatial and Temporal Cues for Multi-label Facial Action Unit Detection

Facial action units (AUs) are essential to decode human facial expressio...
research
05/02/2020

Video2Shop: Exact Matching Clothes in Videos to Online Shopping Images

In recent years, both online retail and video hosting service have been ...
research
04/14/2018

Video2Shop: Exactly Matching Clothes in Videos to Online Shopping Images

In recent years, both online retail and video hosting service are expone...
research
01/11/2021

A deep learning modeling framework to capture mixing patterns in reactive-transport systems

Prediction and control of chemical mixing are vital for many scientific ...
research
04/08/2015

Evaluating Two-Stream CNN for Video Classification

Videos contain very rich semantic information. Traditional hand-crafted ...

Please sign up or login with your details

Forgot password? Click here to reset