Two-stream Collaborative Learning with Spatial-Temporal Attention for Video Classification

11/09/2017
by   Yuxin Peng, et al.
0

Video classification is highly important with wide applications, such as video search and intelligent surveillance. Video naturally consists of static and motion information, which can be represented by frame and optical flow. Recently, researchers generally adopt the deep networks to capture the static and motion information separately, which mainly has two limitations: (1) Ignoring the coexistence relationship between spatial and temporal attention, while they should be jointly modelled as the spatial and temporal evolutions of video, thus discriminative video features can be extracted.(2) Ignoring the strong complementarity between static and motion information coexisted in video, while they should be collaboratively learned to boost each other. For addressing the above two limitations, this paper proposes the approach of two-stream collaborative learning with spatial-temporal attention (TCLSTA), which consists of two models: (1) Spatial-temporal attention model: The spatial-level attention emphasizes the salient regions in frame, and the temporal-level attention exploits the discriminative frames in video. They are jointly learned and mutually boosted to learn the discriminative static and motion features for better classification performance. (2) Static-motion collaborative model: It not only achieves mutual guidance on static and motion information to boost the feature learning, but also adaptively learns the fusion weights of static and motion streams, so as to exploit the strong complementarity between static and motion information to promote video classification. Experiments on 4 widely-used datasets show that our TCLSTA approach achieves the best performance compared with more than 10 state-of-the-art methods.

READ FULL TEXT

page 2

page 5

page 9

page 14

research
08/22/2019

Multi-Stream Single Shot Spatial-Temporal Action Detection

We present a 3D Convolutional Neural Networks (CNNs) based single shot d...
research
11/19/2018

Multi-scale 3D Convolution Network for Video Based Person Re-Identification

This paper proposes a two-stream convolution network to extract spatial ...
research
12/03/2018

Spatial-temporal Fusion Convolutional Neural Network for Simulated Driving Behavior Recognition

Abnormal driving behaviour is one of the leading cause of terrible traff...
research
06/24/2019

LMVP: Video Predictor with Leaked Motion Information

We propose a Leaked Motion Video Predictor (LMVP) to predict future fram...
research
03/23/2017

Saliency-guided video classification via adaptively weighted learning

Video classification is productive in many practical applications, and t...
research
04/12/2017

Predictive-Corrective Networks for Action Detection

While deep feature learning has revolutionized techniques for static-ima...
research
05/14/2021

Confidence-guided Adaptive Gate and Dual Differential Enhancement for Video Salient Object Detection

Video salient object detection (VSOD) aims to locate and segment the mos...

Please sign up or login with your details

Forgot password? Click here to reset