, which can be treated as a high-level video classification problem, is a hot topic of video understanding in computer vision. With the development of rich representations based on neural networks[8, 15, 22], significant progress has been made on this task. Although these existing action recognition methods [26, 17, 2, 16] can predict the high-level human action of the whole video, they neglect the detailed and middle-level understanding of human actions.
To fill this gap, the Part-level Action Parsing (PAP) task, which aims to recognize frame-level human action of all body parts and the whole body from a video in the wild, was firstly focused on by researchers recently. The PAP task is to address the following problem: given a human action video, a system needs to predict the human location, body part location, part state/action in each frame, then integrates these results to predict human action in the video level. For instance, as video of the fitness center shown in Fig. 1, we not only need to predict its video-level label as “clean_and_jerk”, but also need to detect each body part such as “right_arm” and “right_hand” in each frame and predict their part-level action labels like “carry” and “hold”.
By decomposing an action into a human part graph, the PAP task advances the area of human action understanding with a shift from the traditional action recognition task to deeper understanding tasks of part-level action parsing. Moreover, it may have many potential applications such as intelligent manufacturing, sports analysis, fitness instruction, and so on. Therefore, this paper concentrates on the part-level action parsing task which is valuable yet overlooked by the community of multimedia and computer vision.
However, part-level action parsing is a non-trivial task that faces several challenges as shown in Fig. 1. First of all, there are many obstacles when predicting the spatial position of the human body and body parts accurately. Different from traditional object detection tasks, body part detection needs to overcome the ambiguity of body parts and capture the prior of human body structure. Moreover, even if there are only two or three people in every frame of the video, the number of part-level actions that need to be predicted is very large due to the fine-grained division of human body parts. It is more strenuous to represent the relationship between these parts. Furthermore, the trade-off between computational power and accuracy in prediction should also be considered due to the densely predicted frame-level and part-level actions.
Action recognition has been studied for several years. From data-driven representations learned by deep Convolutional Neural Network (CNN) [22, 24, 11] to Transform-based Neural Network  with large parameters, the accuracy of action recognition has been significantly improved. However, traditional action recognition methods often consider the whole video or clip as the smallest unit. Despite their excellent performance for video-level action recognition, these methods cannot work well for frame-level part actions. In addition, although some researchers have studied frame-level human action location , they only focus on the whole body action and ignore the fine-grained part-level actions. Due to the small size of body parts, the traditional methods, e.g., RoIAlign , that used in frame-level human action location have little performance improvement on the PAP task.
To this end, we propose a Pose-guided Coarse-to-Fine framework, named PCF, for part-level action parsing. We first adopt the existing action recognition methods, e.g., CSN , to predict the coarse action of the whole video, since it is the State-of-The-Art (SoTA) CNN-based model in the action recognition task. After that, we predict the fine-grained segment-level body part action instead of the frame-level action based on the persistence of human actions, which greatly improves the computational efficiency with less precision reduction. Moreover, due to the ambiguity of body parts, e.g., the similarity of the appearance of the left leg and the right leg, traditional existing object detectors are often unable to predict the body part effectively. To solve this problem, we propose the pose-guided positional embedding method which guides the detector to predict the part locations with human pose keypoints. By encoding each human keypoints with different colored dots on the original images, the feature representations of different parts are more easily distinguished by the detector, which effectively reduces the body part ambiguity.
In summary, the contributions of this paper include: 1) we make one of the first attempts for part-level action parsing which is a valuable yet unexplored task; 2) we design a PCF framework to exploit the potential performance of existing object detectors with pose-guided positional embedding and predict both the coarse video-level action and the fine-grained body part action; 3) our method achieves SoTA results on the Kinetics-TPS dataset, which shows the effectiveness of our method.
Ii The Proposed Framework
Ii-a Overall Framework
Figure 2 shows the overall structure of the Pose-guided Coarse-to-Fine (PCF) framework for the PAP task. It includes three stages, i.e., instance and part detection, video-level action recognition, and part action parsing. In the first stage, as shown in the upper part of Figure 2, we adopt YOLOF 
as the backbone of the person detector and part detector to locate each person and body parts. To overcome the ambiguity of body parts, we insert the pose estimator and the positional embedding module between the person detector and part detector to improve the accuracy of the part location. The second stage is shown in the lower part of Figure2. Based on the short-term persistence of human actions, we exploit segment-level action prediction to approximate the frame-level action state to balance the accuracy and computation cost. In particular, we divide the original video into multiple segments that last three seconds or so. Then tag each segment with six segment-level pseudo action labels based on the original frame-level part action labels which significantly reduce the computation cost for frame-level action parsing. After that, we train models for segment-level action and video-level action respectively. In the final stage, we integrate the output of all previous stages to get the final output for the PAP task. Next, we will introduce each stage of our framework in detail.
Ii-B Pose-guided Part Detection
To our knowledge, body part detection is an unprecedented task in traditional object detection tasks. Different from general object detection, body part detection needs to overcome the ambiguity of body parts and model the prior of human body structure. For example, normally people only have one left foot and one right foot, but the local features of the left one and the right one are often very similar. Fortunately, this structural prior is very common in human pose estimation and has been widely explored. To maximize the usage of this structural prior, we propose a pose-guided part detection method, which is shown in the top half of Figure2.
In detail, the person detector first extracts the bounding box of a person from the input frame by
Then the pose estimator takes as the input and outputs the keypoints of the person, which is formulated by
where is the coordinates of keypoint , is the number of keypoints. After that, the keypoints are integrated by the positional embedding module with dots of different colors and radius on the original to generate an augmented person image . This process can be formulated by
By this means, we can increase the appearance difference between different body parts and facilitate the learning of body parts detector . Finally, the part detector is implemented to localize each body part by
In addition, we also fine-tune the person detection box with the results of the pose estimator. In a nutshell, the pose estimator has the ability to predict the possible human keypoints outside the person box, and we fine-tune the detected person box until all possible human keypoints are included.
Ii-C Part State Parsing and Action Recognition
Fine-grained frame-level part state parsing requires more computation and hardware cost than coarse video-level action recognition. However, we find this frame-level action parsing problem can be transformed into a simpler segment-level action recognition task due to the overwhelming “Long Tail Effect” caused by the short-term persistence of actions in video segments. For example, in the segment of “hurling sport” that lasts about three seconds, we just need to predict “None” for the heads in every frame and easily achieve 97.7% frame-level part state accuracy. To take advantage of the significant “Long Tail Effect”, as shown in the bottom half of Figure 2, we tag each video segment with six part-level pseudo labels based on the original frame-level action state label. The fine-grained part-level label consists of three parts: coarse video-level action, body part, and the most frequent frame-level action of the body part. As the example shown in Figure 2, “(hurling_sport) head: none” means the video-level action of this video is “hurling_sport”, and the most frequently frame-level action of the “head” in this segment is “none”. Through this transformation, we can directly apply individual action recognition networks without sharing parameters, such as CSN . By this means, we can predict each fine-grained segment-level label and coarse video-level label respectively without any other models related to the computation-intensive frame-level action prediction.
Iii-a Experimental Setting
|Methods||Input||backbone||Video Acc (%)||ROC Score (%)|
|baseline ||RGB||TSN-Res50 ||-||29.79|
|PCF (TSN_RGB)||RGB||TSN-Res50||74.03||49.23 (+19.44)|
|PCF (TSN_Flow)||Flow||TSN-Res50||83.48||54.33 (+5.1)|
|PCF (Ours)||RGB||ip-CSN-152 ||96.46||60.89 (+6.56)|
|✓||74.80 (+0.20)||93.80 (+0.40)|
|✓||57.10 (+20.7)||79.70 (+26.6)|
Dataset. The experiments are performed on the Kinetics-TPS dataset  that provides 7.9 M annotations of 10 body parts, 7.9 M part state (i.e., how a body part moves) of 74 body actions, and 0.5 M interactive objects of 75 objects in the video frames of 24 human action classes. Kinetics-TPS contains 3,809 training videos (4.96 GB in size) and 932 test videos (1.26 GB in size). It’s worth noting the source videos of Kinetics-TPS come from Kinetics 700. Hence, all the Kinetics-pretrained models are forbidden in the PAP task.
We adopt the official evaluation metric, i.e., ROC score, of the Kinetics-TPS dataset. ROC scores are calculated based on the Part State Correctness (PSC) and the action recognition conditioned on PSC. The PSC calculates the accuracy of the whole human detection results and body part action parsing in each frame. The action recognition conditioned on PSC draws the ROC curve and calculates the ROC score according to the top-1 video-level action recognition accuracy and PSC accuracy. Please refer to  for more details.
backbone. The model is pre-trained on the COCO dataset and then fine-tuned on Kinetics-TPS. The final models obtain 93.8 AP@50 in the person category and 79.7 AP@50 in the 10 body part categories on the Kinetics-TPS validation set. For the pose estimator, we directly adopted the HRNet-w48  pre-trained on COCO  to extract the keypoints of each person without any fine-tuning.
Details of Action Parsing and Action Recognition Network. We use the CSN networks  as the backbone in our action recognition and action parsing framework. We use the ip-CSN-152 implementation pre-trained on the IG-65M  dataset with input sampling
. In particular, we freeze the Batch Normalization (BN) layers in the backbone during fine-tuning on Kinetics-TPS.
Details of Training.
The detection model and action recognition models are trained separately. Each model is trained in an end-to-end manner. In detail, we train the YOLOF detector using SGD with a mini-batch size of six on four V100 GPUs and train it for 24 epochs with a base learning rate of 0.01, which is decreased by a factor of 10 at epoch 16 and 22. We perform linear warm-up during the first 1800 iterations. For the CSN model, we train it using SGD with a mini-batch size of four on four V100 GPUs for 58 epochs with a base learning rate of 8e-5, which is decreased by a factor of 10 at epoch 32 and 48. We perform linear warm-up 
during the first 16 iterations. By default, we use weight decay of 1e-4 and Nesterov momentum of 0.9 for all models.
Details of Inference. Following the official guideline , we extract the top-10 results from the person detector and the top-1 results of each body part from the part detector during testing. For the action recognition task and action state parsing task, we set the number of sampling clips as seven for each video segment at test time and scale the shorter side of input frames to 256 pixels.
Iii-B Main Results
To demonstrate the effectiveness of our method, We compare our PCF framework with the official baseline method. Meanwhile, to illustrate the fairness of comparison, we replace our ip-CSN-152 backbone with the TSN-Res50 used by the official baseline.
We present our results on Kinetics-TPS in Table I. The ”Input” in the second column refers to the video input form, and the calculation of optical flow is based on the tvl1 algorithm . The “video acc” in the fourth column refers to the top-1 video-level action recognition accuracy, while the “ROC score” in the fifth column refers to the final ROC score of the methods.
From the results, we can first find that directly applying our PCF framework with TSN-Res50 backbone and RGB input form, our performance achieves a significant enhancement of +19.44 ROC score. Out of our expectation simply changing the input mode from RGB to optical flow gives a total boost of +5.1 ROC score improvement. This may indicate that the body part action encoded by optical flow carries more effective information than RGB input when using 2D-CNN based network in the PAP task. Furthermore, with the strong CNN-based model ip-CSN-152 pretrained on IG-65M, our PCF framework achieves the 60.89% ROC score on the Kinetics-TPS dataset.
|Method||Duration||The action prediction accuracy||TFLOPs|
Iii-C Ablation Experiments
Effect of Adding Pose Estimator. We investigate the effect of the pose estimator on detection mAP. For person detector and part detector, we train the lightweight CNN-based model YOLOF with human location and body parts location respectively. As shown in Table II, adding the pose estimator brings consistent AP and AP@50 increases for these two models. More specifically, equipped with the pose estimator, our model achieves a significant enhancement of AP@50 on the Kinetics-TPS dataset.
Frame-level Prediction v.s. Segment-level Prediction In this subsection, we quantitatively compare the action prediction accuracy and the computation cost between the frame-level action parsing and the segment-level action parsing in Table III. The action prediction accuracy and the “TFLOPs” are calculated with the ip-CSN-152 model on the Kinetics-TPS dataset. From the results, we can see that the computation “TFLOPs” decreases greatly with the increase of segment duration, while the loss of the accuracy is just 0.4% or less. Especially when the duration of the segment is less than three seconds, the accuracy of frame-level prediction and segment-level prediction is almost the same (decrease less than 0.1%), while the computation decreases about 68.16%.
This paper presents a pose-guided coarse-to-fine framework for the part-level action parsing task. In our PCF framework, the pose-guided part detector is one of the first attempts toward body part detection and brings considerable improvement in the AP@50 (+26.60%). Meanwhile, we convert the frame-level part state parsing problem into segment-level action recognition based on the persistence of human actions, which greatly improves the computational efficiency with less precision reduction. At last, our method achieves SoTA results at the Kinetics-TPS dataset, which shows the effectiveness of our PCF framework. With these three contributions, we provide one of the first attempts for the part-level action parsing task.
This work was supported by the National Key R&D Program of China under Grant No. 2020AAA0103800.
-  (2021) You only look one-level feature. In CVPR, pp. 13039–13048. Cited by: §II-A, §III-A.
-  (2019) SlowFast networks for video recognition. In ICCV, pp. 6201–6210. Cited by: §I.
-  (2019) Large-scale weakly-supervised pre-training for video action recognition. In CVPR, pp. 12046–12055. Cited by: §III-A.
Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR abs/1706.02677. Cited by: §III-A.
-  (2017) The ”something something” video database for learning and evaluating visual common sense. In ICCV, pp. 5843–5851. Cited by: §I.
-  (2018) AVA: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, pp. 6047–6056. Cited by: §I, §I.
-  (2017) Mask R-CNN. In ICCV, pp. 2980–2988. Cited by: §I.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §I, §III-A.
-  (2014) A single-chip 600-fps real-time action recognition system employing a hardware friendly algorithm. In ISCAS, pp. 762–765. Cited by: §I.
-  (2017) The kinetics human action video dataset. CoRR abs/1705.06950. Cited by: §I.
-  (2019) CASA: A convolution accelerator using skip algorithm for deep neural network. In ISCAS, pp. 1–5. Cited by: §I.
-  (2021) Kinetics-TPS baseline. https://deeperaction.github.io/kineticstps/. Cited by: TABLE I.
-  (2021) Kinetics-TPS dataset. https://github.com/Hypnosx/Kinetics-TPS/. Cited by: §I, §III-A, §III-A.
-  (2021) Kinetics-TPS evaluation. https://github.com/xiadingZ/Kinetics-TPS-evaluation/. Cited by: §III-A, §III-A.
-  (2017) ImageNet classification with deep convolutional neural networks. Commun. ACM 60 (6), pp. 84–90. Cited by: §I.
-  (2019) Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans. Multim. 21 (2), pp. 416–428. Cited by: §I.
-  (2019) TSM: temporal shift module for efficient video understanding. In ICCV, pp. 7082–7092. Cited by: §I.
-  (2014) Microsoft COCO: common objects in context. In ECCV, pp. 740–755. Cited by: §III-A.
-  (2021) Swin transformer: hierarchical vision transformer using shifted windows. In ICCV, Cited by: §I.
-  (2014) A bag-of-importance model with locality-constrained coding based feature learning for video summarization. IEEE Trans. Multim. 16 (6), pp. 1497–1509. Cited by: §I.
-  (2013) TV-L1 optical flow estimation. Image Process. Line 3, pp. 137–150. Cited by: §III-B.
-  (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, pp. 5534–5542. Cited by: §I, §I.
-  (2019) Deep high-resolution representation learning for human pose estimation. In CVPR, pp. 5693–5703. Cited by: §III-A.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. In ICCV, pp. 4489–4497. Cited by: §I.
-  (2019) Video classification with channel-separated convolutional networks. In ICCV, pp. 5551–5560. Cited by: §I, §II-C, §III-A, TABLE I.
-  (2016) Temporal segment networks: towards good practices for deep action recognition. In ECCV, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), pp. 20–36. Cited by: §I, TABLE I.
-  (2021) SDAN: stacked diverse attention network for video action recognition. In ISCAS, pp. 1–5. Cited by: §I.