Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention

04/02/2020
by   Juan-Manuel Pérez-Rúa, et al.
14

Attentive video modeling is essential for action recognition in unconstrained videos due to their rich yet redundant information over space and time. However, introducing attention in a deep neural network for action recognition is challenging for two reasons. First, an effective attention module needs to learn what (objects and their local motion patterns), where (spatially), and when (temporally) to focus on. Second, a video attention module must be efficient because existing action recognition models already suffer from high computational cost. To address both challenges, a novel What-Where-When (W3) video attention module is proposed. Departing from existing alternatives, our W3 module models all three facets of video attention jointly. Crucially, it is extremely efficient by factorizing the high-dimensional video feature data into low-dimensional meaningful spaces (1D channel vector for `what' and 2D spatial tensors for `where'), followed by lightweight temporal attention reasoning. Extensive experiments show that our attention model brings significant improvements to existing action recognition models, achieving new state-of-the-art performance on a number of benchmarks.

READ FULL TEXT
research
11/21/2019

TEINet: Towards an Efficient Architecture for Video Recognition

Efficiency is an important issue in designing video architectures for ac...
research
11/04/2017

Attentional Pooling for Action Recognition

We introduce a simple yet surprisingly powerful model to incorporate att...
research
10/01/2018

Where and When to Look? Spatio-temporal Attention for Action Recognition in Videos

Inspired by the observation that humans are able to process videos effic...
research
06/06/2021

Transformed ROIs for Capturing Visual Transformations in Videos

Modeling the visual changes that an action brings to a scene is critical...
research
11/18/2021

M2A: Motion Aware Attention for Accurate Video Action Recognition

Advancements in attention mechanisms have led to significant performance...
research
12/14/2021

Temporal Transformer Networks with Self-Supervision for Action Recognition

In recent years, 2D Convolutional Networks-based video action recognitio...
research
09/18/2023

Selective Volume Mixup for Video Action Recognition

The recent advances in Convolutional Neural Networks (CNNs) and Vision T...

Please sign up or login with your details

Forgot password? Click here to reset