Weakly-Supervised Action Localization and Action Recognition using Global-Local Attention of 3D CNN

12/17/2020
by   Novanto Yudistira, et al.
19

3D Convolutional Neural Network (3D CNN) captures spatial and temporal information on 3D data such as video sequences. However, due to the convolution and pooling mechanism, the information loss seems unavoidable. To improve the visual explanations and classification in 3D CNN, we propose two approaches; i) aggregate layer-wise global to local (global-local) discrete gradients using trained 3DResNext network, and ii) implement attention gating network to improve the accuracy of the action recognition. The proposed approach intends to show the usefulness of every layer termed as global-local attention in 3D CNN via visual attribution, weakly-supervised action localization, and action recognition. Firstly, the 3DResNext is trained and applied for action classification using backpropagation concerning the maximum predicted class. The gradients and activations of every layer are then up-sampled. Later, aggregation is used to produce more nuanced attention, which points out the most critical part of the predicted class's input videos. We use contour thresholding of final attention for final localization. We evaluate spatial and temporal action localization in trimmed videos using fine-grained visual explanation via 3DCam. Experimental results show that the proposed approach produces informative visual explanations and discriminative attention. Furthermore, the action recognition via attention gating on each layer produces better classification results than the baseline model.

READ FULL TEXT

page 9

page 11

page 12

page 13

page 14

research
11/05/2018

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

Despite the success of deep learning for static image understanding, it ...
research
09/18/2019

Global Temporal Representation based CNNs for Infrared Action Recognition

Infrared human action recognition has many advantages, i.e., it is insen...
research
08/03/2020

Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition

In this work, we combine 3D convolution with late temporal modeling for ...
research
08/01/2016

Top-down Neural Attention by Excitation Backprop

We aim to model the top-down attention of a Convolutional Neural Network...
research
12/04/2018

Multimodal Explanations by Predicting Counterfactuality in Videos

This study addresses generating counterfactual explanations with multimo...
research
01/25/2017

Deep Local Video Feature for Action Recognition

We investigate the problem of representing an entire video using CNN fea...
research
11/27/2019

AdapNet: Adaptability Decomposing Encoder-Decoder Network for Weakly Supervised Action Recognition and Localization

The point process is a solid framework to model sequential data, such as...

Please sign up or login with your details

Forgot password? Click here to reset