Attention Filtering for Multi-person Spatiotemporal Action Detection on Deep Two-Stream CNN Architectures

07/21/2019
by   João Antunes, et al.
0

Action detection and recognition tasks have been the target of much focus in the computer vision community due to their many applications, namely, security, robotics and recommendation systems. Recently, datasets like AVA, provide multi-person, multi-label, spatiotemporal action detection and recognition challenges. Being unable to discern which portions of the input to use for classification is a limitation of two-stream CNN approaches, once the vision task involves several people with several labels. We address this limitation and improve the state-of-the-art performance of two-stream CNNs. In this paper we present four contributions: our fovea attention filtering that highlights targets for classification without discarding background; a generalized binary loss function designed for the AVA dataset; miniAVA, a partition of AVA that maintains temporal continuity and class distribution with only one tenth of the dataset size; and ablation studies on alternative attention filters. Our method, using fovea attention filtering and our generalized binary loss, achieves a relative video mAP improvement of 20 in AVA, and is competitive with the state-of-the-art in the UCF101-24. We also show a relative video mAP improvement of 12.6 binary loss over the standard sum-of-sigmoids.

READ FULL TEXT

page 3

page 5

research
12/19/2018

D3D: Distilled 3D Networks for Video Action Recognition

State-of-the-art methods for video action recognition commonly use an en...
research
03/17/2019

Spatiotemporal Filtering for Event-Based Action Recognition

In this paper, we address the challenging problem of action recognition,...
research
03/04/2019

Spatiotemporal Pyramid Network for Video Action Recognition

Two-stream convolutional networks have shown strong performance in video...
research
12/06/2018

Video Action Transformer Network

We introduce the Action Transformer model for recognizing and localizing...
research
09/05/2017

Multi-label Class-imbalanced Action Recognition in Hockey Videos via 3D Convolutional Neural Networks

Automatic analysis of the video is one of most complex problems in the f...
research
05/05/2018

Revisiting Temporal Modeling for Video-based Person ReID

Video-based person reID is an important task, which has received much at...
research
03/24/2022

Egocentric Prediction of Action Target in 3D

We are interested in anticipating as early as possible the target locati...

Please sign up or login with your details

Forgot password? Click here to reset