Follow the Attention: Combining Partial Pose and Object Motion for Fine-Grained Action Detection

by   Mohammad Mahdi Kazemi Moghaddam, et al.

Activity recognition in shopping environments is an important and challenging computer vision task. We introduce a framework for integrating human body pose and object motion to both temporally detect and classify the activities in a fine-grained manner (very short and similar activities). We achieve this by proposing a multi-stream recurrent convolutional neural network architecture guided by the spatiotemporal attention mechanism for both activity recognition and detection. To this end, in the absence of accurate pose supervision, we incorporate generative adversarial networks (GANs) to generate candidate body joints. Additionally, based on the intuition that complex actions demand more than one source of information to be precisely identified even by humans, we integrate the second stream of the object motion to our network that acts as a prior knowledge which we quantitatively show improves the results. Furthermore, we empirically show the capabilities of our approach by achieving state-of-the-art results on MERL shopping dataset. Finally, we further investigate the effectiveness of this approach on a new shopping dataset that we have collected to address existing shortcomings in this area including but not limited to lack of training data.


page 3

page 4

page 6

page 7

page 8


Recognizing Fine-Grained and Composite Activities using Hand-Centric Features and Script Data

Activity recognition has shown impressive progress in recent years. Howe...

LSTA: Long Short-Term Attention for Egocentric Action Recognition

Egocentric activity recognition is one of the most challenging tasks in ...

Attention-Driven Body Pose Encoding for Human Activity Recognition

This article proposes a novel attention-based body pose encoding for hum...

Fine-grained Activity Recognition with Holistic and Pose based Features

Holistic methods based on dense trajectories are currently the de facto ...

Adding Knowledge to Unsupervised Algorithms for the Recognition of Intent

Computer vision algorithms performance are near or superior to humans in...

Regional Attention Network (RAN) for Head Pose and Fine-grained Gesture Recognition

Affect is often expressed via non-verbal body language such as actions/g...

Activity Recognition with Moving Cameras and Few Training Examples: Applications for Detection of Autism-Related Headbanging

Activity recognition computer vision algorithms can be used to detect th...