Appearance-and-Relation Networks for Video Classification

11/24/2017
by   Limin Wang, et al.
0

Spatiotemporal feature learning in videos is a fundamental and difficult problem in computer vision. This paper presents a new architecture, termed as Appearance-and-Relation Network (ARTNet), to learn video representation in an end-to-end manner. ARTNets are constructed by stacking multiple generic building blocks, called as SMART, whose goal is to simultaneously model appearance and relation from RGB input in a separate and explicit manner. Specifically, SMART blocks decouple the spatiotemporal learning module into an appearance branch for spatial modeling and a relation branch for temporal modeling. The appearance branch is implemented based on the linear combination of pixels or filter responses in each frame, while the relation branch is designed based on the multiplicative interactions between pixels or filter responses across multiple frames. We perform experiments on three action recognition benchmarks: Kinetics, UCF101, and HMDB51, demonstrating that SMART blocks obtain an evident improvement over 3D convolutions for spatiotemporal feature learning. Under the same training setting, ARTNets achieve superior performance on these three datasets to the existing state-of-the-art methods.

READ FULL TEXT

page 4

page 10

page 11

page 12

research
11/25/2018

Temporal Bilinear Networks for Video Action Recognition

Temporal modeling in videos is a fundamental yet challenging problem in ...
research
11/19/2018

High Order Neural Networks for Video Classification

Capturing spatiotemporal correlations is an essential topic in video cla...
research
04/04/2019

Spatiotemporal CNN for Video Object Segmentation

In this paper, we present a unified, end-to-end trainable spatiotemporal...
research
11/07/2016

Spatiotemporal Residual Networks for Video Action Recognition

Two-stream Convolutional Networks (ConvNets) have shown strong performan...
research
08/07/2019

STM: SpatioTemporal and Motion Encoding for Action Recognition

Spatiotemporal and motion features are two complementary and crucial inf...
research
10/08/2019

Graph-based Spatial-temporal Feature Learning for Neuromorphic Vision Sensing

Neuromorphic vision sensing (NVS) allows for significantly higher event ...
research
12/28/2020

DeepSurfels: Learning Online Appearance Fusion

We present DeepSurfels, a novel hybrid scene representation for geometry...

Please sign up or login with your details

Forgot password? Click here to reset