Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework

by   Xiaodong Chen, et al.

Action recognition from videos, i.e., classifying a video into one of the pre-defined action types, has been a popular topic in the communities of artificial intelligence, multimedia, and signal processing. However, existing methods usually consider an input video as a whole and learn models, e.g., Convolutional Neural Networks (CNNs), with coarse video-level class labels. These methods can only output an action class for the video, but cannot provide fine-grained and explainable cues to answer why the video shows a specific action. Therefore, researchers start to focus on a new task, Part-level Action Parsing (PAP), which aims to not only predict the video-level action but also recognize the frame-level fine-grained actions or interactions of body parts for each person in the video. To this end, we propose a coarse-to-fine framework for this challenging task. In particular, our framework first predicts the video-level class of the input video, then localizes the body parts and predicts the part-level action. Moreover, to balance the accuracy and computation in part-level action parsing, we propose to recognize the part-level actions by segment-level features. Furthermore, to overcome the ambiguity of body parts, we propose a pose-guided positional embedding method to accurately localize body parts. Through comprehensive experiments on a large-scale dataset, i.e., Kinetics-TPS, our framework achieves state-of-the-art performance and outperforms existing methods over a 31.10 score.


page 1

page 3


Technical Report: Disentangled Action Parsing Networks for Accurate Part-level Action Parsing

Part-level Action Parsing aims at part state parsing for boosting action...

DAP3D-Net: Where, What and How Actions Occur in Videos?

Action parsing in videos with complex scenes is an interesting but chall...

Spatial Transformer Network with Transfer Learning for Small-scale Fine-grained Skeleton-based Tai Chi Action Recognition

Human action recognition is a quite hugely investigated area where most ...

Hierarchical Modeling for Task Recognition and Action Segmentation in Weakly-Labeled Instructional Videos

This paper focuses on task recognition and action segmentation in weakly...

Deep Sequence Learning for Video Anticipation: From Discrete and Deterministic to Continuous and Stochastic

Video anticipation is the task of predicting one/multiple future represe...

Focusing on Persons: Colorizing Old Images Learning from Modern Historical Movies

In industry, there exist plenty of scenarios where old gray photos need ...

An Invariant Model of the Significance of Different Body Parts in Recognizing Different Actions

In this paper, we show that different body parts do not play equally imp...

Please sign up or login with your details

Forgot password? Click here to reset