Pose-conditioned Spatio-Temporal Attention for Human Action Recognition

03/29/2017
by   Fabien Baradel, et al.
0

We address human action recognition from multi-modal video data involving articulated pose and RGB frames and propose a two-stream approach. The pose stream is processed with a convolutional model taking as input a 3D tensor holding data from a sub-sequence. A specific joint ordering, which respects the topology of the human body, ensures that different convolutional layers correspond to meaningful levels of abstraction. The raw RGB stream is handled by a spatio-temporal soft-attention mechanism conditioned on features from the pose network. An LSTM network receives input from a set of image locations at each instant. A trainable glimpse sensor extracts features on a set of predefined locations specified by the pose stream, namely the 4 hands of the two people involved in the activity. Appearance features give important cues on hand motion and on objects held in each hand. We show that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself. Finally a temporal attention mechanism learns how to fuse LSTM features over time. We evaluate the method on 3 datasets. State-of-the-art results are achieved on the largest dataset for human activity recognition, namely NTU-RGB+D, as well as on the SBU Kinect Interaction dataset. Performance close to state-of-the-art is achieved on the smaller MSR Daily Activity 3D dataset.

READ FULL TEXT

page 6

page 7

research
07/06/2020

VPN: Learning Video-Pose Embedding for Activities of Daily Living

In this paper, we focus on the spatio-temporal aspect of recognizing Act...
research
09/29/2020

Attention-Driven Body Pose Encoding for Human Activity Recognition

This article proposes a novel attention-based body pose encoding for hum...
research
02/22/2018

Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points

We propose a method for human activity recognition from RGB data which d...
research
02/21/2021

Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM

Automatically detecting violence from surveillance footage is a subset o...
research
12/20/2017

Human Action Recognition: Pose-based Attention draws focus to Hands

We propose a new spatio-temporal attention based mechanism for human act...
research
01/17/2021

Coarse Temporal Attention Network (CTA-Net) for Driver's Activity Recognition

There is significant progress in recognizing traditional human activitie...
research
06/05/2019

Two-Stream Region Convolutional 3D Network for Temporal Activity Detection

We address the problem of temporal activity detection in continuous, unt...

Please sign up or login with your details

Forgot password? Click here to reset