Top-down Attention Recurrent VLAD Encoding for Action Recognition in Videos

08/29/2018
by   Swathikiran Sudhakaran, et al.
2

Most recent approaches for action recognition from video leverage deep architectures to encode the video clip into a fixed length representation vector that is then used for classification. For this to be successful, the network must be capable of suppressing irrelevant scene background and extract the representation from the most discriminative part of the video. Our contribution builds on the observation that spatio-temporal patterns characterizing actions in videos are highly correlated with objects and their location in the video. We propose Top-down Attention Action VLAD (TA-VLAD), a deep recurrent architecture with built-in spatial attention that performs temporally aggregated VLAD encoding for action recognition from videos. We adopt a top-down approach of attention, by using class specific activation maps obtained from a deep CNN pre-trained for image classification, to weight appearance features before encoding them into a fixed-length video descriptor using Gated Recurrent Units. Our method achieves state of the art recognition accuracy on HMDB51 and UCF101 benchmarks.

READ FULL TEXT

page 3

page 4

page 9

research
08/25/2016

Sympathy for the Details: Dense Trajectories and Hybrid Classification Architectures for Action Recognition

Action recognition in videos is a challenging task due to the complexity...
research
07/06/2016

VideoLSTM Convolves, Attends and Flows for Action Recognition

We present a new architecture for end-to-end sequence learning of action...
research
02/02/2021

GCF-Net: Gated Clip Fusion Network for Video Action Recognition

In recent years, most of the accuracy gains for video action recognition...
research
05/08/2018

Visual Attribute-augmented Three-dimensional Convolutional Neural Network for Enhanced Human Action Recognition

Visual attributes in individual video frames, such as the presence of ch...
research
02/16/2021

Learning to Recognize Actions on Objects in Egocentric Video with Attention Dictionaries

We present EgoACO, a deep neural architecture for video action recogniti...
research
06/17/2020

A Real-time Action Representation with Temporal Encoding and Deep Compression

Deep neural networks have achieved remarkable success for video-based ac...
research
10/27/2018

A^2-Nets: Double Attention Networks

Learning to capture long-range relations is fundamental to image/video r...

Please sign up or login with your details

Forgot password? Click here to reset