Recent Advances in Monocular 2D and 3D Human Pose Estimation: A Deep Learning Perspective

04/23/2021 ∙ by Wu Liu, et al. ∙ JD.com, Inc. Harbin Institute of Technology 10

Estimation of the human pose from a monocular camera has been an emerging research topic in the computer vision community with many applications. Recently, benefited from the deep learning technologies, a significant amount of research efforts have greatly advanced the monocular human pose estimation both in 2D and 3D areas. Although there have been some works to summarize the different approaches, it still remains challenging for researchers to have an in-depth view of how these approaches work. In this paper, we provide a comprehensive and holistic 2D-to-3D perspective to tackle this problem. We categorize the mainstream and milestone approaches since the year 2014 under unified frameworks. By systematically summarizing the differences and connections between these approaches, we further analyze the solutions for challenging cases, such as the lack of data, the inherent ambiguity between 2D and 3D, and the complex multi-person scenarios. We also summarize the pose representation styles, benchmarks, evaluation metrics, and the quantitative performance of popular approaches. Finally, we discuss the challenges and give deep thinking of promising directions for future research. We believe this survey will provide the readers with a deep and insightful understanding of monocular human pose estimation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 5

page 7

page 9

page 10

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-a Motivation

Monocular human pose estimation (MHPE) is a fundamental and challenging task in the computer vision community. It aims to predict the human pose information, such as the spatial locations of body joints and/or the body shape parameters, from a monocular image or video. MHPE has been widely exploited for many computer vision tasks, such as person re-identification [176, 222], human parsing [142, 221], human action recognition [30, 39], and human-computer interaction [44, 107], etc. As MHPE does not need the complex multi-cameras or wearable marker points, it has become a significant part of many real-world applications, such as virtual reality, 3D movie making/editing, self-driving, motion and activity analysis, and human-robot interaction.

Fig. 1: The number of the published papers in mainstream computer vision, multi-media, and computer graphics conferences (CVPR, ICCV, ECCV, etc) and journals (TPAMI, TIP, TOG, etc) from the year 2014 to 2020.

According to the spatial dimension of the output results, the mainstream MHPE tasks can be divided into two categories, 2D pose estimation, and 3D pose estimation. Monocular 2D human pose estimation, also known as 2D keypoint detection, aims to locate the 2D coordinates of human anatomical keypoints (body joints) from images. Considering the number of people in a given image, the 2D human pose estimation task can be further classified into single person and multi-person pose estimation. Furthermore, given a video sequence, 2D pose estimation can exploit temporal information to boost keypoint prediction in a video system. Different from solely predicting 2D locations of body joints, 3D pose estimation further predicts the depth information for more accurate spatial representation. In this process, 2D pose estimation can be exploited as the intermediate representation for 3D pose estimation. In recent years, demanding for understanding the detailed pose information of humans has driven 3D pose estimation towards predicting not only the 3D location but also the detailed 3D shape and body texture.

Limited by data and computational resources, early research mainly focused on designing handcrafted features or fitting the deformable human body models with optimization algorithms. Recently, with the increase of large-scale 2D/3D pose datasets (e.g. COCO 

[111], MPII [5], Human3.6M [71], and 3DPW [196]), deep learning technologies have significantly boosted the performance of human pose estimation both in accuracy and efficiency. As shown in Fig. 1, from the year 2014 to 2020, the number of published papers in the mainstream conferences (CVPR, ICCV, ECCV, etc) and journals (TPAMI, TIP, TOG, etc) in the area of computer vision, multi-media, and computer graphics has rapidly increased. Recent works mainly focus on network design and optimization [192, 135, 156, 206, 27, 179, 150, 83, 181, 91], multitask interaction [59, 120, 56, 146, 89], body model exploration [223, 117, 232, 81], etc.

Fig. 2: Milestones, idea or dataset breakthroughs, and the state-of-the-art methods for 2D (top) and 3D (bottom) pose estimation from the year 2014 to 2021.

Although great successes have been achieved in performance and practice, few works have comprehensively reviewed the representative algorithms or given insightful analyses of 2D-to-3D pose estimation. On one hand, some previous surveys [62, 116, 55, 86] reviewed traditional methods, such as body models or handcrafted features, without recent deep-learning-based approaches. On the other hand, recent surveys have mainly focused on one aspect of either 2D pose estimation [133] or 3D pose estimation [170], without a comprehensive perspective to explore the intrinsic connections between 2D and 3D. The survey [26] describes recent works of representative 2D pose estimation methods and a few 3D pose estimation methods up to the year 2019. However, it does not well summarize the relative 3D pose and shape estimation methods, and neglects the perspective from 2D to 3D. Therefore, a more comprehensive survey covering the recent advantage of pose estimation is of great need in this community.

In this paper, we provide a comprehensive review of the deep learning-based MHPE approaches from 2D to 3D in recent years. We believe that most representative MHPE methods have intrinsic similarities and connections. Moreover, with the rapid development of 3D pose and shape estimation, it is necessary to have a deeper survey on the human pose estimation from 2D to 3D. Therefore, compared with the paper [26], our survey has the following differences and advantages. 1) We summarize the prevailing networks for both 2D and 3D pose estimation in the unified frameworks. They represent the representative paradigms. 2) We provide insightful analyses for human 3D representation, 3D datasets, 3D shape recovery methods, as well as the challenges and further work for 3D pose estimation. 3) Besides, we released a detailed code toolbox111https://github.com/Arthur151/SOTA-on-monocular-3D-pose-and-shape-estimation for 3D pose data processing, which will be timely and useful for 3D pose research. We summarize a timeline in Fig. 2, which shows the milestones, idea or dataset breakthroughs, and the state-of-the-art methods for 2D and 3D pose estimation from the year 2014 to 2021. We can see that new approaches and new datasets promote each other. 2D pose estimation has achieved explosive development since 2016 with breakthroughs both in ideas and datasets. Meanwhile, 3D pose estimation has also developed rapidly in recent years.

I-B Overview of Deep Learning Framework for MHPE

The human body is nonrigid and flexible for high degree-of-freedom poses, therefore, predicting human pose estimation from a monocular camera faces many challenges, such as complex or strange posture, person-object/person-person interaction or occlusion, and crowded scenes, etc. Different camera views and complex scenes will also introduce problems of truncation, image blur, low resolution, and small target persons.

Fig. 3: Typical framework for single person pose estimation.
Fig. 4: Typical frameworks for multi-person pose estimation.

To address these problems, existing methods explore the powerful representation of deep learning to mine more clues for pose estimation. Although they are different in either global design or detailed optimization, the network architectures of milestone methods have internal similarities. As shown in Fig. 3, most of the prevailing single person pose estimation networks [135, 206, 27, 179, 123, 183, 83]

can be regarded as consisting of a pose encoder (also called feature extractor) followed by a pose decoder. The former aims at extracting high-level features through a high-to-low resolution process. The latter estimates the target output, 2D/3D keypoint location or 3D mesh, in a detection-based manner or a regression-based manner. For the pose decoder, detection-based methods yield feature maps or heatmaps, while regression-based methods directly output the target parameters. Following the unified frameworks, we describe details of network design for 2D and 3D pose estimation in Sections III and IV, respectively.

For multi-person scenes, to estimate the 2D or 3D pose of each person, existing works exploit the top-down paradigm or bottom-up paradigm. The top-down framework first detects the person areas and then extracts the bounding box-level features from them. The features are used to estimate the pose results for every single person. In contrast, the bottom-up paradigm first detects all target outputs and then assign them to different people by grouping [18, 134, 215] or sampling [182]. As shown in Fig. 4, the representative multi-person methods of the two paradigms also rely on pose-encoder-and-decoder-based architecture with network input being either the detected bounding box or the whole image.

Therefore, how to design an effective pose encoder and pose decoder architecture is a common and popular topic in pose estimation. Different from classification, detection, and semantic segmentation, human pose estimation needs to deal with the subtle differences between body parts, especially in the unavoidable truncation, crowded, and occluded cases. To achieve this, the body structural models [191, 33, 185], multi-scale feature fusion [135, 179], multistage pipelines [202, 18], refinement in a coarse-to-fine manner [150, 138], multi-task learning [59, 89, 146], etc, have been explored and designed. We will introduce them in detail in Section III and IV.

Moreover, regarding estimating 3D poses from monocular images, another challenge is the insufficient in-the-wild 3D training data. Because of the equipment constraints, common 3D pose datasets are often captured in restrained experimental environments. For example, the most widely used 3D pose dataset Human3.6M [71] contains only 15 indoor activities performed by seven persons. Therefore, the diversities of human poses, shapes, and scenes are extremely limited. Models solely trained on these datasets are prone to fail on the in-the-wild images. To address this problem, many methods take the 2D pose as the intermediate representation or extra supervision, and learn from in-the-wild 2D pose information. Nevertheless, there are inherent ambiguities in this process, i.e., a single 2D pose may correspond to multiple 3D poses and vice versa. To solve the inherent ambiguities, we must consider how to fully exploit the common structure prior to the human body, motion continuity, and multi-view consistency.

In conclusion, considering the main challenges of the task and the unified frameworks of the representative paradigms, in this paper, we systematically analyze the deep learning-based 2D and 3D MHPE approaches proposed since the year 2014. The rest of the paper is organized as follows. We first introduce the pose estimation background in Section II, which is fundamental for understanding the MHPE task. Then in Section III, we introduce the representative approaches for 2D pose estimation, including single person pose estimation, multi-person pose estimation, pose estimation in videos, and the related tasks of 2D pose estimation, respectively. Then in Section IV, for 3D pose estimation, we detail the approaches according to their motivations and challenges. In addition, we introduce widely used 2D and 3D pose benchmarks in Section V and compare their state-of-the-art methods. Finally, in Section VI, we conclude the paper and give some insight into future research.

Ii Background

Ii-a Representations for Human Body 

Fig. 5: Widely used human body representations: (a) 2D keypoints [18]; (b) 2D heatmap (upper) [18] and volumetric heatmap (below) [120]); (c) orientation map PAF [18]; (d) hierarchical bone representation [102]; (e) cylinder model (blue) and ellipBody (pink); and (f) skeleton-driven skinned multi-person linear model (SMPL) [117].

Various representations of the human body have been developed to describe the complex human body pose in different aspects. They have shown various characteristics to handle different challenges of pose estimation. Existing representations can be divided into two categories: 1) keypoint-based representation; and 2) model-based representation.

Ii-A1 Keypoint-based Representation 

2D or 3D coordinates of body keypoints are the simple and intuitive representations for the body skeleton, which have several representation forms.

2D/3D keypoint coordinates. Body keypoints can be explicitly described by the 2D/3D coordinates. As shown in Fig. 5 (a), the keypoints are connected following the inherent body structure. The orientations of the body part can be derived from these connected limbs.

2D/3D heatmaps.

To make the coordinates more suitable for being regressed by a convolutional neural network, many methods represent the keypoint coordinates in a heatmap manner. As shown in Fig. 

5 (b), the Gaussian heatmap of each keypoint has a high response value on the corresponding 2D/3D coordinates and a low response value at other positions.

Orientation maps. Some methods [18, 118] take body keypoints’ orientation map as the auxiliary representation of heatmaps. OpenPose [18] develops the well-known part affinity fields (PAFs) to represent the 2D orientation between limbs. As shown in Fig. 5

(c), a PAF is a 2D vector field that associates two keypoints of a limb. Each pixel in the field contains a 2D vector that points from one part of the limbs to the other. Orinet 

[118] further develops it into the 3D orientation map, which can explicitly model the limb orientations.

Hierarchical bone vectors. The 2D version of hierarchical bone representation was proposed in the compositional human pose (CHP) [180], which is the combination of joints and bone vectors. Xu et al. [209] and Li et al. [102] further developed it to 3D. As shown in Fig. 5 (d), the 3D human skeleton is represented by a set of bone vectors. Each bone vector is pointing from the parent keypoint to the child keypoint, following a kinematic tree. Each parent keypoint is associated with a local spherical coordinate system. The bone vector can be represented by a spherical coordinates in this system.

Ii-A2 Model-based Representation

Model-based representation is developed according to the inherent structural characteristics of the human body. It provides richer body information than the keypoint-based description. The model-based representation can be divided into the part-based volumetric model and the statistical 3D human body model.

Part-based volumetric model. Part-based volumetric models are developed to address challenges in reality. For example, in [29], the cylinder model was developed to generate the labels of occluded parts. As shown in the blue model of Fig. 5 (e), each limb is represented as a cylinder. Each cylinder is located by aligning the top and bottom surface centers with the 3D keypoints of the limb. Similarly, as shown in the pink model of Fig. 5 (e), an EllipBody model is proposed to take the ellipsoid as the basic unit of body parts [200]. It is more flexible than a cylinder.

Detailed statistical 3D human body model. Compared with the part-based volumetric model, the statistical 3D human body mesh describes more detailed information including the body pose and shape. We introduce the most widely used skinned multi-person linear model (SMPL) [117], which is a skeleton-driven human body model. SMPL disentangles the shape and pose of a human body, and encodes the 3D mesh into low-dimensional parameters. It establishes an efficient mapping from shape and pose to a triangulated mesh with 6,890 vertices, where represents the statistical prior of the human body. The shape parameter is the linear combination weight of 10 basic shapes. The pose parameter represents the relative 3D rotation of 23 joints in the axis-angle representation. Then a linear regressor is developed to derive preselected body joints from 6890 vertices of human body mesh via . The linear combination operation of this regressor guarantees that joint location is differentiable with respect to shape and pose parameters.

Ii-B 3D-to-2D Projection

3D-to-2D projection connects the 3D space to the 2D image plane. It is important to introduce this tool to better understand the methods that use it. 3D-to-2D projection uses a camera model to generate 3D-2D pose pairs [46, 102], supervise 3D poses using 2D pose annotations [83, 183, 14], or refine 2D poses via 3D pose projection [209]. The perspective camera model and weak-perspective camera model are two kinds of widely used camera models.

Perspective camera model. The perspective camera model is usually used to project the points in the 3D space into 2D pixel coordinates on the image plane. Generally, it consists of two steps. First, we need to transform the 3D points into the camera coordinates using the extrinsic matrix , which describes the camera rotation and translation. Second, we need the intrinsic matrix to make an adaptive adjustment for accurate projection. Therefore, the 2D projection of 3D keypoints can be described as .

Weak-perspective camera model. In most situations, the input 2D images are un-calibrated and complete perspective camera parameters can hardly be retrieved. Therefore, the weak-perspective camera model is more widely used in most existing methods for calculating the 2D projection of 3D keypoints by , where is the global rotation parameter, is an orthographic projection operation, and represent the translation and scale on the image plane, respectively.

Image/ Single/ Main idea Methods
Video Multiple
Image Single Structural Body Model Spatial relationships of adjacent joints [191, 24, 43]; Bi-directional tree-structured model [33]; Chain model [53]; GAN-based pose discriminator [25]; Human body compositional model [185]; Structured representation by GNN [218]; Occlusion relational graphical model [49].
Multi-stage Pipeline Stacked hourglass [135] and its variants [212, 34, 32]; CPM [202] with intermediate input and supervision.
Pose Refinement Multi-model fusion [145] and Hybrid-Pose [82]; Iterative update model [19, 11]; Voting scheme [109]; Coarse-to-fine hierarchical network [67] and HCRN [138]; Data-driven augmentation [48, 131].
Multi-task Learning Jointly 2D and 3D pose estimation [120]; Human parsing guided [139]; Jointly train augmentation and pose estimation [153].
Efficiency Improvement Multi-resolution and low computational cost [159]; Binarized neural network [16]; Hierarchical multi-scale residual architecture [16]; Hourglass using MobileNet [36]; Pose distillation [216].
Multiple Top-down Single stage model [147, 206]; Multi-task (Whole body pose ZoomNet [77], Mask-RCNN [59], pose and parsing together [168][204],and [108]); Multi-stage/branch fusion (CPN [177], MSPN [105], RSN [17],HRNet [179], Graph-PCNN [199]); Complex case (RMPE [45], CrowdPose [100], OASNet [228], ASDA [13]).
Bottom-up

Integer linear program for joint grouping (DeepCut 

[156], DeeperCut [70]);
Part Affinity Fields for joint grouping (OpenPose [18], PifPaf [93], whole body OpenPose [61], and [99]); Associative embedding for joint grouping [134] and HigherHRNet [28]; Pose Partition Network [141, 143]; Multi-task (MultiPoseNet [89] and PersonLab [146]).
Video Single Temporal clues Insert multiple frames into channel layer [155]; Along with action recognition ([136] and [72]); Optical flow-based model (Thin-Slicing [175] and [154][20]); Sequence model (Chained Model [53], LSTM Pose Machine [119], UniPose-LSTM [8]); Dynamic Kernel Distillation [140].
Multiple Top-down Clip-based spatio-temporal model (Detect-and-Track [52] and [201]); Optical flow-based FlowTrack [206] and PoseFlow [207]; Transformer-based keypoint tracker KeyTrack [174]; Recovering missing detection (PGPT [9] and [194]); Learnable similarity metric (POINet [167] and [226].
Bottom-up Graph partitioning-based model [73, 69]; Temporal Flow Fields-based model [37, 42, 68, 158]; Spatio-temporal associative embedding model KE-SIE [76].
TABLE I: Representative deep learning-based methods for monocular 2D pose estimation.

Iii Monocular 2D Pose Estimation

Monocular 2D pose estimation predicts the 2D locations of body keypoints in images or videos. According to the input/output, the task can be divided into single person pose estimation and multi-person pose estimation in image-level or video-level. Since the flexibility of the human body, 2D pose estimation has to deal with various postures, self-occlusion, and the interaction between body and scene. Especially, in multi-person scenes, the problems of crowd and occlusion further challenge the power of algorithms. In this section, we introduce the representative approaches according to the above categories and summarize them in Table I. Additionally, we also give a brief introduction to the related tasks which use 2D pose estimation, such as person re-identification, action recognition, human-object interaction, human parsing, etc.

Iii-a Single Person Pose Estimation

As shown in Fig. 3, the framework of typical single person pose estimation methods can be formulated as consisting of a pose encoder followed by a pose decoder. The pose encoder is a backbone to extract high-level features, while the pose decoder yields the 2D locations of keypoints in the regression-based manner or detection-based manner.

Most of the pose encoders are based on image classification networks, such as ResNet [60]

, with a pre-trained model on a large-scale dataset such as ImageNet. Instead, few work designs the task-specific pose encoders. For example, the stacked hourglass network 

[135] exploits the skip connection layer to connect the mirror features with the same resolution. Furthermore, PoseNAS [10] exploits the Neural Architecture Search [113] to find that the task-driven searchable feature extractor blocks. It directly searches a data-oriented pose encoder with stacked searchable cells, which can provide an optimum feature extractor for the pose specific task.

Most of the recent works focus on the design of pose decoder, which pays more and more attention to explore the context information and the inherent characteristics of body structure. Toshev et al. [192] propose DeepPose, which is one of the first human pose estimation methods based on deep convolutional neural networks (DCNNs). With a cascade of DCNN-based pose predictors, DeepPose formulates the keypoint estimation as a regression problem. It is different from previous traditional methods like manually designed graphical models [169, 213] and part detectors [160, 6, 96]. Iterative Error Feedback (IEF) network [19] exploits a self-correcting regression model. It is a kind of top-down feedback to progressively change the initial keypoint predictions. Sun et al. [180] introduce the compositional pose regression, which is body structure-aware. The method in [120] solves the regression-based keypoint prediction along with human action recognition in the multi-task manner.

Since the regression-based method directly maps the image to the coordinates of body joints, it is a non-linear problem and may fail for complex poses. Instead, the detection-based pose decoder generates heatmaps of keypoints instead of direct regression [191]. As the detection-based pose decoders are widely used in many existed methods, we will introduce them according to their design categories as following.

Structural Body Model. Along with the DCNN-based feature representation for the whole body, graphical models are explored to describe the structural and local parts with the spatial relationship, as illustrated in Fig. 6 (a). Tompson et al. [191] propose the convolutional network Part-Detector via a hybrid DCNN architecture. They formulate the distribution of spatial locations for body parts as an Markov Random Field-like model, which helps to remove the anatomically incorrect pose predictions. Similarly, Chen et al. [24]

use DCNNs to learn conditional probabilities for the presence of body parts and their spatial relationships within image patches. Different from those works that learn pair-wise relationship from the predicted score maps, Chu et al. 

[33] first investigate the relationship among parts at the feature level. The proposed end-to-end learning framework captures structural information among body joints by the learnable geometrical transform kernels and a bi-directional tree-structured model. Other than relying on any assumptions about the conditional distributions of joints, Gkioxari et al. [53] propose a chained sequence-to-sequence model to sequentially predict each body part based on all previously predicted body parts. Besides, to avoid biologically implausible pose predictions, the work in [25]

proposes a structure-aware network to implicitly exploit geometric constraint priors of the human body. It designs discriminators to distinguish the real poses from the fake ones by the conditional Generative Adversarial Networks (GANs). To further learn the compositionality of human body, Tang et al. 

[185] propose the deeply learned compositional model (DLCM) that has the bottom-up/top-down inference stages across multiple semantic levels. In the bottom-up stage, the higher-level parts are recursively estimated from their children, while in the top-down stage, the lower-level parts are recursively refined by their parents. Different from the previous approaches that use fully shared features for all body parts, Tang et al. [184] proposes to learn specific features for related parts. Moreover, instead of the manually defined body structure relation, they propose a data-driven approach to group related parts based on the amount of information they shared. Additionally, to deal with occlusion, ORGM [49] proposes an occlusion relational graphical model to represent the self-occlusion and object-person occlusion simultaneously, which discriminatively encodes the interactions between human body parts and objects.

Fig. 6: Illustration of six widely used paradigms for 2D single person pose estimation.

Multi-stage Pipeline. It has been shown that multi-stage pipeline and multi-level feature fusion (illustrated in Fig. 6 (c)) are useful for capturing the details of the human body. One of the representative work is the stacked hourglass network [135], as shown in Fig. 6 (b). Each hourglass network consists of a symmetric distribution between bottom-up processing (from high resolutions to low resolutions) and top-down processing (from low resolutions to high resolutions). It uses a single pipeline with skip layers to preserve spatial information at each resolution. In conjunction with the intermediate supervision, the whole network consecutively stacks multiple hourglass modules together. It has been a solid baseline for its variants [212, 34, 32] with various network design optimization. Among them, Yang et al. [212] propose to insert the designed pyramid residual modules into the hourglass network, which can handle scale changes among human body parts. The work in [34] designs the Hourglass Residual Units (HRUs) to increase the receptive field of the stacked hourglass network. Meanwhile, a multi-context attention mechanism is exploited to enable the representation of different granularity from local regions to global semantic consistent spaces. To exploit the structural information and multiple resolution features, the method in [85] exploits the multi-scale supervision, multi-scare regression, and structure-aware loss on the stacked hourglass framework. Besides stacked hourglass, another well-known multi-stage network Convolutional Pose Machine (CPM) [202] uses the intermediate input and supervision to learn implicit spatial models without an explicit graphical model. Its sequential multi-stage convolutional architectures increasingly refine the prediction for keypoint locations.

Pose Refinement. Refinement for the network outputs can improve the final pose estimation performance. Fig. 6 (d) shows the framework of the common coarse-to-fine refinement pipeline. Ouyang et al. [145] build a multi-source deep model to extract non-linear representation from different information sources, including visual appearance score, appearance mixture type and deformation. Th representations of all information sources are fused for pose estimation. It can be viewed as the post-processing of pose estimation results. The work in [19] uses an iterative update module to progressively make an incremental improvement to the pose estimation. Belagiannis et al.[11] introduce a recurrent convolutional neural network to iteratively improve the performance. Lifshitz et al. [109] propose a voting scheme for optimal pose configuration where each pixel in the image votes for the optimal position of each keypoint. Besides, there are some methods that use multi-branch networks for pose refinement. Huang et al. [67] present a coarse-fine hierarchical network consisting of multiple branches. With multi-level supervision for the multi-resolution feature maps, multiple branches are unified to predict the final keypoints. HCRN [138] is a hierarchical contextual refinement network in which keypoints of different complexities are processed at different layers. HCRN is in a single-stage pipeline by exploiting the contextual refinement unit to transfer informative context from easy joints to difficult ones. Hybrid-Pose [82] adopts a two-branch Stacked Hourglass Networks, a Refinement Network (RNet) for pose refinement, and a Correction Network (CNet) for pose correction. RNet refines the keypoint locations in each hourglass stage horizontally. CNet guides the refinement and fuses the heatmaps in a hybrid manner.

Different from adding an extra network to the ahead coarse network for end-to-end training, the works in [48] and [131] apply a similar refinement strategy to take both the RGB images and the coarse predicted keypoints as input. Then the refinement network directly predicts a refined pose by jointly reasoning the input-output space. This kind of separate refinement network employs a data-driven augmentation for training and can be applied to any existing method.

Multi-task Learning. As shown in Fig. 6 (e), by exploiting complementary information from the related tasks, multi-task learning can provide extra cues for pose estimation. For example, Luvizon et al. [120] propose a multi-task framework for jointly 2D/3D pose estimation and human action recognition from video sequences. The method in [139] uses a human part parsing learner to exploit the part segmentation information and provide complementary features to assist pose estimation. The adversarial data augmentation is exploited in [153] to address the limitation of random data augmentation during network training. It also designs a reward/penalty strategy for jointly training the augmentation network and the target (pose estimation) network.

Improving Efficiency. Along with the development of model performance, how to improve the speed of a model has also attracted lots of attention. Fig. 6 (f) shows the commonly used framework for improving model efficiency, including using light-weight operator, network binarization, model distillation, etc. RafiLGK et al. [159] propose a multi-resolution light-weight network that explores low computational requirements. The Binarized neural network is first exploited in [16] to design a light-weight network with limited computational resources. Specifically, based on an exhaustive evaluation of various design choices, a hierarchical, parallel, and multi-scale residual architecture is proposed. The method in [36] investigates the combination of MobileNets and the hourglass network to design a light-weight architecture. In addition, the work in [216] presents a pose distillation (FPD) model that trains a high-speed pose network based on the idea of knowledge distillation.

Fig. 7: Three representative top-down 2D multi-person pose estimation networks: (a) Simple BaseLine [206]; (b) CPN [27]; and (c) HRNet [179].

Iii-B Multi-person Pose Estimation

Multi-person pose estimation needs to detect and locate the keypoints of all persons in an image, where the number of persons is unknown in advance. According to the processing paradigm, the representative methods can be sorted into two categories, i.e., top-down methods and bottom-up methods. The former is a two-stage pipeline that firstly detects all persons in an input image, then detects keypoints of each person in the detected bounding box. Differently, the bottom-up pipeline predicts all keypoints at once, then assigns these keypoints to different persons. We will introduce the representative CNN-based methods of these two categories.

Iii-B1 Top-down Methods

This kind of methods firstly detect and crop each person in the image. Then given a cropped image patch that only contains a single person, they use single-person pose estimation models followed by post-processing, such as pose Non-Maximum-Suppression (NMS) [147], to predict the final keypoint outputs of each person. Theoretically, the single person methods introduced in Section III.A can be applied after cropping the image patch. However, compared with the single person case, multi-person scenes have to deal with truncation, environmental occlusion, person-person occlusion, and small targets. Therefore, the representative top-down methods not only focus on designing networks by digging the potential of CNN and exploring rich context information fusion or exchange, but also pay attention to complex scenes.

Two Stage Pipeline. Papandreou et al. [147] propose one of the first deep learning-based two-stage top-down pipeline, named G-RMI, which achieves the state-of-art results on the challenging COCO 2016 keypoints task. They use the Faster RCNN detector to detect each person, then exploit a fully convolutional ResNet [60] to jointly predict the keypoint’s dense heatmaps and offsets. They also introduce the keypoint-based NMS instead of the box-level NMS to improve the keypoint confidence. Furthermore, as in Fig. 7 (a), Xiao et al. [206] provide a simple and effective model that consists of a ResNet backbone and three deconvolution layers to increase the spatial resolution. It shows that a well-designed simple top-down model can achieve surprisingly effective.

Multi-task Learning. By sharing features between related tasks of pose estimation, multi-task learning can provide better feature representations for pose estimation. For example, Mask-RCNN [59] can detect person bounding boxes, then crops the feature map of the corresponding proposal to predict human keypoints. Since human keypoints and human semantic parts are related and complementary, many works [168, 204, 108] design multi-task networks to jointly predict the keypoints and segment the semantic parts. Besides, ZoomNet [77]

unifies the human body pose estimator, hand/face detectors, and hand/face pose estimators into a single network. The network first localizes the body keypoints, then zooming in the hands/face regions to predict those keypoints with higher resolutions. It can handle the scale variance among different human parts. by Moreover, to deal with the lack of the whole-body data, the COCO-WholeBody dataset is proposed by extending the COCO dataset with whole-body annotations.

Multi-stage or Multi-branch Fusion. Multi-stage or multi-branch fusion strategy is developed to break the bottleneck of a single model. The work in [27] proposes a Cascade Pyramid Network (CPN), as shown in Fig. 7 (b), which consists of a global network and a refining network to progressively refine the keypoint prediction. It also proposes an online hard keypoints mining (OHKM) loss to deal with hard keypoints. CPN achieves the 1st place in the COCO 2017 keypoint challenge. The work in [177] improves CPN by introducing the channel shuffle module and the spatial channel-wise attention residual bottleneck to boost the original model. MSPN [105]

, the winner of the COCO 2018 keypoint challenge, extends CPN in the multi-stage pipeline. It uses the global network of CPN as each single-stage module, fuses features from different stages by the cross-stage feature aggregation, and supervises the whole network via the coarse-to-fine loss functions. HRNet 

[179], shown in Fig. 7 (c), points out that the high-resolution representation is important for hard keypiont detection. HRNet maintains the high-resolution representations through the whole network, and gradually adds high-to-low resolution sub-networks to form multi-resolution features. It has been a solid and superior model for pose estimation and many other computer vision tasks. Furthermore, to consider the keypoints’ relationship and refine the rough predictions, Graph-PCNN [199] proposes a graph pose refinement module Via a model-agnostic two-stage framework. The work [17] of the 1st place of COCO Keypoint Challenge 2019 utilizes a multi-stage pipeline with Residual Steps Network (RSN) modules to aggregate intra-level features. With the delicate local representations obtained from RSN, a Pose Refine Machine (PRM) module is proposed to further balance the local/global representations and refine the output keypoints. The resulting architecture establishes the new state of the art on the COCO dataset and MPII dataset.

Dealing with Complex Scenes: In real-world applications, crowded, occlusion, and truncation scenes are unavoidable. To remove the effect of inaccurate person detection, RMPE [45]

designs a symmetric spatial transformer network to detect every person, a parametric pose NMS to filter out the redundant pose, and a pose-guided human proposal generator to enhance the network capacity for multi-person pose estimation. To tackle the problem in crowded scenes, Li et al. 

[100] firstly obtain joint candidates in each cropped bounding box, then solve the joint association problem in a graph model. They also collect a crowded human pose estimation dataset named CrowdPose, and define the Crowd Index to measure the crowding level of an image. The work in [54] investigates the problem of pose estimation in crowded and occlusion surveillance scenes. It proposes to add an extra network branch to detect occluded keypoints. Besides, OASNet [228] exploits the Siamese network with an attention mechanism to remove the occlusion-aware ambiguities and reconstruct the occlusion-free features. To enlarge the training set for challenging cases, Bin et al. [13] propose to augment images by combing segmented body parts to simulate challenging examples. A generative network is utilized to dynamically adjust the augmentation parameters and produce the most confusing training samples.

Fig. 8: Two representative bottom-up 2D multi-person pose estimation methods: (a) OpenPose [18]; and (b) Associative Embedding [134].

Iii-B2 Bottom-up Methods

Different from top-down methods that rely on the human detector, bottom-up methods directly predict all keypoints in the image, then group keypoint candidates into each person. Besides the network design for more accurate keypoint detection, how to encode the connection information between keypoints is the core for grouping keypoints to different people. In the early stage, DeepCut [156] predicts all keypoints by Fast-RCNN [161] and formulates the keypoint assignment problem as an integer linear programming (ILP). DeeperCut [70] improves DeepCut by introducing a stronger part detector. Besides, it proposes an image-conditioned pair-wise term that explores keypoint geometric and appearance constraints. However, ILP is still a time-consuming NP-hard problem. Differently, recent works exploit the learnable association strategies for keypoint grouping, such as Part Affinity Fields [18], Associative Embedding [134], and their variants. We will further introduce them in detail.

Part Affinity Fields.

The most popular bottom-up pose estimation method OpenPose [18] proposes to jointly learn keypoint locations and their association by the Part Affinity Fields (PAFs). As shown in Fig. 5(c), PAF encodes the location and orientation of limbs by a set of 2D vector fields. The direction of PAF points from one part of the limb to the other. Then multi-person association performs the bipartite matching to associate keypoint candidates using PAFs. Via a two-branch and multi-stage architecture, as shown in Fig. 8(a), OpenPose achieves real-time performance independent of the people number in the image. The idea of PAF is further explored by [93, 61, 99]. In PifPaf net [93], the PAF is used to associate body parts. In addition, the Part Intensity Field (PIF) is designed to localize body parts. PIF and PAF are jointly produced along with the utilization of Laplace loss to deal with the low-resolution and occluded scenes. Moreover, Hidalgo et al. [61] propose the first single-network approach for whole-body multi-person pose estimation, which simultaneously locates the body, face, hands, and feet keypoints in an image. The work in [99] designs a body part-aware PAF to encode the connection between keypoints, and improves the stacked hourglass network with attention mechanisms and a focal loss.

Associative Embedding. This associative embedding is a kind of detection and grouping method, which detects keypoints and group them into persons with embedding features or tags. Newell et al. [134] propose to generates the keypoint heatmaps and their embedding tags for multi-person pose estimation. As shown in Fig. 8 (b), for each joint of the body, the network produces detection heatmaps and predicts associative embedding tags at the same time. They take the top detections for each joint, and match them to other detections which share the same embedding tag, to produce a final set of individual pose predictions. Associative Embedding is also used by HigherHRNet [28], which learns scale-aware representations from high-resolution feature pyramids. Exploiting the aggregated features from HRNet [179], and the up-sampled higher-resolution features through a transposed convolution, HigherHRNet well handles the scale variation and achieves the new state-of-the-art for bottom-up pose estimation. For other embedding-based keypoint assignment strategies, Nie et al. [141] propose a Pose Partition Network (PPN) that uses the centroid embedding for all keypoint candidates. [143] introduces the Structured Pose Representation (SPR). It exploits the root joints to indicate different people and encodes the positions of keypoints into their displacements w.r.t. the corresponding root.

Multi-task Learning. Besides pose estimation alone, MultiPoseNet [89] proposes a multi-task model that can jointly handle person detection, keypoint detection, and person segmentation. MultiPoseNet proposes a Pose Residual Network (PRN) to assign the predicted keypoints and the detected person bounding boxes by measuring their locations’ similarity. PersonLab [146] is also a multi-task network that jointly predicts keypoint heatmaps and person segmentation maps. In PersonLab, short-range offsets and mid-range pairwise offsets are used to group human keypoints. Meanwhile, long-range offsets and the human pose detections are used to distinguish person segmentation masks.

Iii-C 2D Pose Estimation in Videos

With the development of image-based pose estimation methods, 2D pose estimation in videos has also been boosted by DCNNs. Different from image-based pose estimation, video pose estimation must consider the temporal relation across frames to remove the motion blur and geometric inconsistency. Therefore, directly applying the existing image-based pose estimation methods on videos may produce sub-optimal results. In this part, we can classify the single/multi person(s) pose estimation methods in videos according to how do they exploit the spatio-temporal information.

Iii-C1 Single Person Pose Estimation in Videos

For single person pose estimation in videos, most works explore to propagate temporal clues across frames for refining the single frame pose results. As one of the first works for deep learning based pose estimation in videos [155], Pfister et al. exploits temporal information by inserting multiple frames into the data colour channels to replace the three-channel RGB image input in an image-based network. The following works further explore the temporal propagation in four main categories, i.e., action recognition task for mutual learning, temporal propagation by optical flow-based model, sequence model, and distillation model.

Pose Estimation with Action Recognition. Nie et al. [136] propose a spatial-temporal And-Or Graph (ST-AOG) model to combine video pose estimation with action recognition. By adding additional activity recognition branches based on optical flow and appearance features, the two tasks mutually benefit from each other. Furthermore, instead of an additional action recognition branch, the work in [72] shows that the action prediction can be achieved by incorporating activity priors using an action conditioned pictorial structure model.

Optical Flow-based Feature Propagation. Pfister et al. [154] propose SpatialNet that temporally warps the neighboring frame heatmaps to the current frame through optical flow. Afater that, they also exploit a parametric pooling layer to combine the aligned heatmaps into a pooled confidence heatmap. The work in [20] proposes a personalized ConvNet pose estimator, which can propagate high-quality automatic pose annotations throughout the video by spatial image matching and optical flow propagation. The propagated new annotations are used to fine-tune the generic ConvNet pose estimator. In Thin-Slicing [175], Song et al. propose to propagate keypoints by computing dense optical flow between neighboring frames. They conduct a flow-based warping layer to align previous heatmaps to the current frame, followed by a spatio-temporal inference layer. The spatial-temporal propagation utilizes the iterative message passing of the pose configuration graph with both spatial and temporal relationships edges.

Sequence Model-based Feature Propagation. In the Chained Model [53], Gkioxari1 et al. adopt the sequence-to-sequence recurrent model to solve the structured pose prediction in videos. In the recurrent model, the prediction of each body keypoint relies on all previously predicted keypoints. In LSTM Pose Machine [119], Luo et al. explore to capture temporal dependency in videos by memory augmented LSTM framework. Given a frame, the Encoder-RNN-Decoder pipelines [53, 119] firstly learn high-level image representations by an encoder, then propagate temporal information and produce hidden states by RNN units. They finally predict keypoints of the current frame by a decoder which takes hidden states as input. A similar concept is adopted by UniPose-LSTM [8], which exploits an LSTM module to propagate previous heatmaps generated by the multi-resolution Atrous Spatial Pooling architecture in the UniPose network.

Distillation Model. Different from optical flow or sequential RNN that requires a large network, Nie et al. [140] introduce a Dynamic Kernel Distillation (DKD) model to transfer pose knowledge in a one-shot feed-forward manner. Specifically, the small network DKD exploits a temporally adversarial training strategy via a discriminator to generate temporally coherent pose kernels and predict keypoints in videos.

Iii-C2 Multi-Person Pose Estimation and Tracking in Videos

Multi-person pose estimation in videos has to deal with identifying and locating keypoints in the case of multi-person occlusion, motion blur, pose and appearance variations. To exploit temporal clues, multi-person pose estimation is usually associated with articulated tracking. PoesTrack dataset [73, 4] is the first large-scale and in-the-wild multi-person dataset for pose estimation and tracking, which contains hundreds of videos with more than 150K poses. With the boom of datasets, the mainstream solutions for this task can be sorted into top-down methods and bottom-up methods.

Top-down Methods. The top-down methods follow the tracking-by-detection paradigm. They first detect persons and keypoints in each frame, then propagate the bounding boxes or keypoints across frames. For example, 3D Mask R-CNN is proposed in Detect-and-Track [52] to employ fully 3D convolutional networks to detect keypoints of each person in a video clip. Then a keypoint tracker is used to link the predictions by comparing the distances of the detected bounding boxes. Moreover, the clip-based tracker is further exploited in [201], which extends HRNet via carefully designed 3D convolutional layers to learn the temporal correspondence between keypoints. Next, a spatio-temporal merging procedure is designed to estimate the optimal keypoint outputs by spatial and temporal smoothing. By combining the clip tracking network and the merging procedure, the model in [201] can well handle failure detections in complex scenes with severe occlusion and highly entangled people.

Different from clip-based detection, the work in [206] independently builds on the single-frame detection and exploits the optical flow-based temporal pose similarity to associate keypoints across different frames. To cope with missing detection in a single frame, PGPT [9] proposes to combine an image-based detector with an online person location predictor to compensate for the missing bounding boxes. Meanwhile, PGPT introduces a hierarchical pose-guided graph convolutional network that exploits the human structural relations to boost the person representation and data association. For efficient data association, the work in POINet [167]

explores a pose-guided ovonic insight network to learn feature extraction, similarity metric, and identity assignment in an unified end-to-end network. In another work for learnable similarity metric 

[226], temporal keypoint matching and keypoint refinement are designed for keypoints association and pose correction, respectively. They are both learnable and incorporated into a pose estimation network. The work in [194] proposes the self-supervised keypoint correspondences that cannot only recover missing pose detections, but also associate detected and recovered poses across frames. Different from using high-dimensional image representation to track people, KeyTrack [174] presents a transformer-based tracker that only relies on keypoints. The Transformer-based network exploits a binary classification to predict whether one pose temporally follows another.

Bottom-up Methods. This kind of methods commonly use the single frame pose estimation to predict all keypoints in each frame, then assign keypoints across frames in a spatio-temporal optimization manner. For example, the graph partitioning based methods in [73, 69, 4] firstly extend the image-level bottom-up multi-person pose estimation [156, 18]. Then they build a spatio-temporal graph that connects keypoint candidates spatially as well as temporally, which can be formulated into an linear programming problem. However, the spatial and temporal graph partitioning usually leads to heavy computation and non-online solution. Diffrerently, PoseFlow [207] exploits a pose flow that measures the pose distance in different frames to track the same person. Inspired by the spatial Part Affinity Field in OpenPose [18], the works in [37, 42, 68, 158] exploit Temporal Flow Fields to indicate the propagating direction of keypoints in different frames. Associative Embedding [134], which is used in the image-based bottom-up keypoint association strategy, is also extended in [76] to build the spatio-temporal embedding. It associates keypoints with embedding features for temporal consistency.

In summary, the development of 2D pose estimation has been greatly boosted by the design of CNN-based networks with the aid of body part structural relationship, multi-stage pipeline, multi-level feature fusion, pose refinement, multi-task learning, and efficiency-aware design. In the multi-person case, an excellent top-down method relies on an accurate detection network and a solid single person pose estimation network. As for the bottom-up methods, the most important part is how to group the detected keypoints to different people. In the video-level pose estimation and tracking, the representative researches focuse on how to efficiently propagate the spatio-temporal information to guarantee the consistency and smoothness of the predictions.

Iii-D 2D Pose Estimation for Other Tasks

Person Re-Identification: For the task of person re-identification (ReID), pose variations may bring significant change to the appearance of a person. The detailed part-based information is important to discriminate individuals, especially for the occluded person ReID and partial person ReID. For example, Su et al. [176] propose a Pose-driven model to learn improved feature extraction and matching models, where the response maps of body joints can alleviate the pose variations. Besides, Zhao et al. [222] exploits the body structure information to align body region features. Different semantic regions can be merged with a competitive scheme. To correct the pose variations caused by camera views and person motions, Zhang et al. [225] employ pose estimation to the body part discovery and affine projection. To deal with the misalignment and occlusion problem of ReID, Xu et al. [208] propose a confidence-map-based solution instead of the previous RoI-based solution, which can precisely exclude background clutter and adjacent part features. Furthermore, Miao et al. [128] encode the position information of body parts, which helps to disentangle the useful information from the occlusion noise. Gao et al. [50] design a pose-guided visible part matching framework, which can jointly learn the discriminative pose-guided attention and part visibility in a self-supervised way.

Action Recognition: Recognition of human actions is an important task in the understanding of dynamic scenes. The detection and alignment of human poses are proved to be crucial to capture discriminative information of human actions. For example, Cheron et al. [30] introduce pose estimation to extract the appearance and flow information at characteristic positions. This work points out the importance of a representation derived from the human pose. Furthermore, Nie et al. [136] propose a framework to integrate action recognition and pose estimation from video, where a spatial-temporal And-Or Graph model is employed for representing action and poses. Besides, Luvizon et al. [120] propose a multi-task framework that jointly handles 2D/3D pose estimation and action recognition from video sequences. Differently, Du et al. [39] propose a pose-attention mechanism to adaptively learn pose-related features at every time-step action prediction of RNNs. With the development of 2D pose estimation, recent researches [40, 101, 172, 211] tend to screen out the pose estimation task in action recognition and focus on skeleton-based action recognition directly with structured positions of human keypoints.

Human-Object Interactions: Recognizing Human-Object Interactions (HOI) is an important research problem, which has massive applications in image understanding and robotics. Pose information is important to infer the interactions between the detected human and object. For example, Fang et al. [44]

propose a pairwise body-part attention model to focus on crucial parts and their correlations for HOI recognition. In the model, a pose estimator is employed to detect human keypoints and then body parts. Li et al. 

[107] further combine visual appearance, spatial location, and human pose information for interactiveness discrimination. They employ pose estimation to predict the 17 keypoints to build spatial-pose information. Wan et al. [197] train CPN as the pose estimator, and utilize the estimated human pose to capture global spatial configuration as the guidance to extract local features at the semantic part level. In the 3D HOI task, Li et al. [106] first obtain the 2D human pose, and then estimate 3D human body to construct the 3D spatial configuration volume.

Human Parsing: Human parsing aims at segmenting a human image into different fine-grained semantic parts, which serves as the basis for many high-level applications, e.g., human behavior analysis, person re-identification, and video surveillance. The human pose can offer structure information for body part segmentation and labeling. For example, Dong et al.[38] present a unified framework for simultaneous human parsing and pose estimation. They verify that the mutually complementary nature of the two tasks can boost the performance of each other. Similarly, Nie et al. [142] present a novel mutual learning solution for joint human parsing and pose estimation. Besdies, Liang et al. [108] propose a joint human parsing and pose estimation network to explore efficient context modeling, which can simultaneously predict parsing and pose with high quality. Zhang et al. [221] propose that human body edge and pose are two beneficial factors to human parsing. They design a superior way of leveraging the pivotal contextual cues provided by edges and poses for human parsing.

Model Type Input/Output Type Main Idea Methods
Skeleton-based Single person Heatmap-based Coarse-to-fine [150]; VNect [127]; Intergral heatmap regression IHP [181].

Lifting 2D pose to 3D Simple-3D [123]; 2D-to-3D lifting via matching [21]; Cycle consistency between 2D/3D [189, 22].
Fusing image features with keypoints Global and local integration [137]; SemGCN [223] and Liu et al. [114].

Solving data lacking Depth ordering of keypoints [151, 227]; Multi-view consistency [163, 162, 90, 193, 129]; Learning from synthetics [164, 23, 195, 95].

Solving inherent ambiguity Body Geometr prior [46] ; Temporal smoothness [110, 63, 98, 152, 29]; Rationality/View supervision [171, 87]; Hierarchy bone representation [209, 102].

Multi-person Top-down methods LCR-Net++ [165]; 3DMPPE [130].

Bottom-up methods Grouping keypoints to skeleton [215]; Occlusion-robust pose-maps [126]; Compressing volumetric heatmaps [41]; PandaNet [12]; Depth estimation [224, 198].

Mesh-based
Single person Solving data lacking GAN-based rationality supervision [83, 88]; Temporal dynamics [84, 183]; Appearance consistency [210, 149]; SMPL optimization in the loop SPIN [91].

Proper representations Graph representation GraphCMR [92]; Voting results HoloPose [56]; Skeleton-disentangled representation [183]); Multi-body-part 3D mesh [166, 205, 148].

Multi-person Multi-stage methods Scene constraints [214]; CRMH [75].

Single-shot methods CenterHMR [182] .
TABLE II: Representative deep learning-based methods for monocular 3D pose estimation.

Iv Monocular 3D Pose Estimation

Monocular 3D pose estimation can be divided into skeleton-based 3D pose estimation and mesh-based 3D pose estimation according to the output representation. The former predicts the 3D locations of the body joints, while the latter outputs 3D body mesh depending on a human mesh topology or a statistical 3D body model. Compared with the 2D pose estimation, estimating 3D pose from monocular 2D images is much more challenging. Besides all the challenges in the 2D part, monocular 3D pose estimation also suffers from the lack of in-the-wild 3D data and the inherent 2D-to-3D ambiguity.

The first big challenge is lacking sufficient in-the-wild data with accurate 3D annotations. Most existed 3D pose datasets are not diversity enough. It is hard and expensive to precisely capture 3D pose annotations for 2D images, especially in outdoor conditions. Existing 3D pose datasets are often biased towards specific environments (e.g. indoor) with limited actions. For example, the well-known 3D pose dataset, Human3.6M [71], only contains 11 actors performing 15 activities. In contrast, 2D pose data is easy to be collected, which contains much richer poses and environments. Therefore, 2D pose datasets [111, 5, 78] are often used to improve the generalization of 3D algorithms. For example, most existing 3D pose estimation methods adopt 2D pose as an intermediate representation [183, 223] or even network input [123, 152] for reducing the difficulties. Besides, many methods propose un-supervised [162, 22] or weak-supervised [163, 129] frameworks to alleviate the dependence of fully supervised methods on datasets.

Furthermore, losing the depth information may cause the inherent 2D-to-3D ambiguity problem for 3D pose estimation. Especially for the two stage-based 3D pose estimation methods, they first try to estimate 2D pose from images, then lift 2D pose to 3D. The inherent ambiguity problem of these methods are even serious because multiple 3D poses can map to the same 2D keypoints. Many methods attempt to tackle this problem by using various prior information, such as geometric prior knowledge [151, 46], statistical model [117, 83], and temporal smoothness [35, 88].

In this section, according to the way of pose representation, we classify the representative 3D pose estimation methods into the 1) skeleton-based paradigm and 2) mesh-based paradigm. Additionally, we also introduce ideas to solve the lack of in-the-wild 3D data, inherent 2D-to-3D ambiguity, and the case of the multi-person scene. We summarize all the representative methods in Table II.

Fig. 9: Representative frameworks of monocular 3D human pose estimation.

Iv-a Skeleton-based 3D Pose Estimation

As shown in Fig. 9, according to the representation, three kinds of frameworks are generally adopted by the prevailing methods for single person 3D pose estimation, i.e., methods based on 1) volumetric heatmap, 2) lifting 2D pose to 3D pose, and 3) fusing image features with the 2D pose. Furthermore, because of the significance, we also summarize methods to solve two common challenging cases in 3D pose estimation, i.e., 4) the lack of 3D data and 5) 2D-3D inherent ambiguity.

Iv-A1 Heatmap-based Methods

Different from the regression-based method [103]

that directly predicts the 3D location of each keypoint, heatmap-based methods represent each 3D keypoint as a 3D Gaussian distribution in the heatmap. Then in post-processing, we can easily parse the 3D keypoint coordinates from the estimated volumetric heatmap via acquiring the local maximum. Therefore, as shown in in Fig. 

9 (a), the heatmap-based methods are designed to directly estimate the volumetric heatmap from the monocular images through an end-to-end framework. Following this pipeline, Coarse-to-Fine (C2F) [150] network is proposed to stack the hourglass networks [135], and progressively extend the volumetric direction of the predicted heatmap for fine-gained results. Besides, VNect [127] is a real-time monocular 3D pose estimation method. It makes the coherent kinematic skeleton fitting in post-processing to yield temporally stable pose results based on a coherent kinematic skeleton. To simplify the post-processing, Integral Human Pose (IHP) [181] proposes an integral operation to directly convert the heatmaps into keypoint coordinates in a differentiable manner during forward inference. It builds a bridge between the heatmap-based and regression-based methods.

Iv-A2 Lifting 2D Pose to 3D

Benefited from the robust 2D pose estimation methods, as shown in Fig. 9 (b), many methods focus on lifting 2D pose to 3D via a simple regressor. The entire pipeline can be divided into two parts: 1) estimating 2D pose from monocular images, and 2) lifting the estimated 2D poses to 3D. In this way, they can build the 3D methods based on the 2D pose estimation. The model complexity is much lower, while the generalization can be improved. For example, Martinez et al. [123] propose Simple-3D, which is a well-known simple baseline method to estimate the depth of each keypoint from the 2D pose. It only contains two fully connected blocks while achieves good performance on the related benchmarks. Furthermore, Chen et al. [21] solve the 2D-to-3D keypoint lifting problem via pose matching. To enrich the matching library, they generate a lot of 2D-3D pose pairs via projecting the 3D pose back to the 2D image plane using a random camera. Then they get a large 2D-3D pose pairs library. Given a 2D image, they just need to predict the 2D pose, and search the most similar 2D-3D pose pairs from the library. The paired 3D pose is selected as the 3D pose estimation results. Differently, Tome et al. [189] develop a multi-stage convolution network to recurrently optimize the estimated 3D pose. They refine the intermediate 2D pose using the back projection of the estimated 3D pose, which progressively increases the accuracy of both 2D and 3D pose estimation. Besides, Chen et al. [22] take advantage of the cycle consistency between 2D pose input and the 2D projection of the estimated 3D pose to learn the 2D-to-3D lifting function in an unsupervised manner. Without any 3D annotations, the cycle consistency could not help the model to properly learn the depth. The model may converge to a local minimum with a constant depth. To tackle it, they take advantage of a geometric prior that 2D projections of the same 3D pose from different views should lift to the same 3D pose. Specifically, the lifted 3D pose is firstly projected back to 2D image from a random view to supervise the depth-wise rationality using 2D pose available in a generative adversarial manner. Then they lift the randomly projected 2D pose back to 3D and supervise the difference between its 2D projection from the original view and the original 2D pose input. In this way, the model get properly trained without any 3D annotations.

Iv-A3 Integrating Image Features with 2D Keypoints

As shown in Fig. 9 (c), some methods attempt to integrate the heatmap-based and lifting 2D pose to 3D frameworks together. On one hand, 2D keypoints provide limited information of the human body, which might make 2D-to-3D lifting ambiguous in challenging scenes. On the other hand, image features provide more contextual information, which are helpful to determine the accurate 3D pose. In this manner, Nie et al. [137] propose to integrate the local image texture features of keypoints into the global skeleton of the 2D pose. Then a two-level hierarchy of LSTM network is developed to model the global and local features progressively. Furthermore, SemGCN [223] and Liu et al. [114] both extract joint-level image features, and integrate them with keypoint coordinates to form multiple graph nodes. Next, GCN and LSTM are used to dig into the relationships between these nodes with the help of image features.

Iv-A4 Solving the Data Deficiency Problem

Most 3D pose datasets capture very limited activities of a few actors in indoor environments. Compared with 2D pose datasets, 3D pose datasets are poor in the diversity of pose and environment. To handle this problem, many approaches attempt to train the model in an unsupervised or weak-supervised manner. For example, some approaches propose to use weak but cheap labels for supervision. For example, Pavlakos et al. [151] take advantage of the weak ordinal depth relations between keypoints for supervision. Experiments show that compared with the direct supervision using ground truth 3D pose annotations, the ordinal supervision can also achieve comparative performance. Similarly, Hemlets [227] encodes the explicit depth ordering of adjacent keypoints via a heatmap triplet loss as ground truth.

Except for the body structure prior, many approaches propose to use multi-view consistency for supervision. However, only considering the multi-view consistency will lead to a degenerated solution [163]. The model may be trapped into a local minimum and produce similar zero poses for different inputs. To address this problem, Rhodin et al. [163] propose to use a small amount of labeled data to avoid the local minimum and correct the predictions. Differently, Rhodin et al. [162] propose to use sequential images to provide temporal consistency prior for body representation learning. EpipolarPose [90] takes advantage of the multi-view 2D poses for generating 3D pose annotations via the epipolar geometry. In this way, the entire framework can be trained in a self-supervised manner. Umar et al. [193] tackle the degeneration trap via a novel alignment-based object function without requiring extrinsic camera calibration. They train the model using unlabelled multi-view images and 2D pose datasets. The 2D poses are only used to train the 2D pose backbone. The multi-view images are exploited for the corresponding 3D pose estimation. Multi-view 3D pose results are normalized and then aligned using Procrustes analysis for the consistency. In this process, the 3D poses are transformed to the same scale for rigid alignment. Mitra et al. [129]

propose the Multiview-Consistent Semi-Supervised Learning (MCSS) framework, which try to learn the view-invariant pose embedding in a semi-supervised manner. On one hand, the model is trained to estimate the view-invariant 3D poses by making the left pelvis bone parallel to the

plane. On the other hand, temporal relation-based hard-negative mining is applied to gain consistent pose embedding in different views.

Additionally, some methods propose to learn the model from data synthesis. In general, there are two kinds of synthetic data pipelines: 1) 2D image stitching pipeline, and 2) 3D model projection pipeline. For the first framework, Rogez et al. [164] attempt to generate the 2D image of 3D poses brought from 3D motion capture (MoCap) datasets. They first select the image patch whose 2D pose matches a part of the projected 3D pose. With kinematical constraints, local image patches are stitched up to form the complete 2D image of a 3D pose. Differently, Chen et al. [23] and Varol et al. [195] follow the 3D modal projection pipeline. They project the textured statistical human body model onto the 2D in-the-wild background images for data generation, e.g., the Shape Completion and Animation of People (SCAPE) [7] and SMPL [117]. In this way, the complete 3D annotations of a 2D in-the-wild person image is available. The annotations contain not only the 3D body pose, shape, and texture, but also the camera and light parameters. In summary, the 2D image stitching pipeline has the potential to generate more realistic person images, while the 3D model projection pipeline can obtain more comprehensive 3D annotations. Besides, PGP-human [95] constructs a self-supervised training pipeline using the 3D-to-2D projection. PGP-human takes advantage of image pairs sampled from in-the-wild videos for training, which contain the same person performs different actions in different backgrounds. The model is trained to disentangle the appearance and pose information via mixing their features extracted from image pairs for image re-synthesis. In this process, the Puppet model [232] is adopted to transform the back-projected 2D pose into part segmentation maps for promoting appearance reconstruction.

Iv-A5 Solving the Inherent Ambiguity Problem

: Due to the inherent ambiguity in depth, a single 2D pose may correspond to multiple 3D poses, especially in the pipeline of lifting 2D pose to 3D. Therefore, various prior constraints are employed to determine the specific pose. Many methods use temporal consistency and dynamics to solve the ambiguity of a single 2D pose. For example, the method RSTV [188] directly regress from a spatio-temporal volume of bounding boxes to a 3D pose in the central frame. This approach achieves motion compensation by training two networks to predict large body shifts between consecutive frames and then refine them. Fang et al. [46] explicitly incorporate the body prior (including kinematics, symmetry, and driven joint coordination) into the model via a hierarchy of bi-directional RNNs. In this way, 3D pose prediction is supervised to follow the body prior constraints and temporal dynamics. Besides, Lin et al. [110], TP-Net [63] and Lee et al. [98] develop sequence-to-sequence networks composed of LSTM units to estimate a sequence of 3D poses from 2D poses. The specific pose of each frame could be better determined via exploring the motion dynamics in a motion sequence. Furthermore, to improve the computational efficiency, VideoPose3D [152] and OANet [29] adopt temporal convolution on 2D pose sequences to guarantee the temporal consistency. The full convolutional architecture enables efficient parallel computation. Differently, OANet uses a cylinder human model to generate the occlusion label, which helps the model to learn the collision between body parts. Moreover, Sharma et al. [171] solve the ambiguity in a generative adversarial manner. They train a conditional VAE network to justify the rationality of the generated 3D-pose samples conditioned on the 2D pose. In addition, ActiveMoCap [87] try to estimate the uncertainty of different predictions that are used to select the best output with lower ambiguity. It helps the model to learn the best viewpoints of 3D pose estimation.

Except the temporal consistency, some other methods attempt to solve the ambiguity via developing a representation closer to the body structure. For example, Xu et al. [209] and Li et al. [102] adopt a hierarchical bone representation, as introduced in Sec. II-A. The geometric dependence of adjacent joints is explicitly modeled in this hierarchical bone representation, which mainly focus on supervising the bone length and the joint direction. Benefited from the hierarchical bone representation, 3D body skeleton is separable and could be easily mixed up to synthesize new skeletons. Therefore, Li et al. [102] propose to enrich the pose-space of training data via mixing up different skeletons images. Training with expanded data helps to improve model generalization.

Iv-A6 Multi-person 3D Pose Estimation

In general cases, the real-world scenes always contain multiple persons. Similar to the challenges faced in multi-person 2D pose estimation, the 3D case also predicts the root depth and keypoint relative depth of each person. According to the processing pipeline, existing methods for multi-person 3D pose estimation can be roughly classified into two categories: 1) the top-down paradigm, and 2) the bottom-up paradigm. The top-down methods first detect the person, then estimate the 3D pose of each one separately. While the bottom-up methods first detect the keypoints, then group them to form the 3D pose of each person.

Top-down Methods. As a typical top-down paradigm, LCR-Net++ [165] is built on the common two-stage anchor-based detection framework. They first collect pose candidates from the anchor proposals and then determine the final output by score ranking. Similarly, the work in [130] is also build on the anchor-based detection framework. They estimate the 3D absolute root localization and root-relative pose estimation with separate network branches from the detected person area and their bounding box location.

Bottom-up Methods. Following this paradigm, Zanfir et al. [215] propose a bottom-up multi-stage framework for monocular multi-person 3D pose estimation. They first estimate the volumetric heatmaps from a single image to determine the 3D keypoint locations. Then the confidence scores of all possible connections between detected keypoints are predicted to form the limbs. Finally, they perform the skeleton grouping to assign the limbs to different persons. Moreover, to deal with the occlusion problem in multi-person scenes, Mehta et al. [126] develop an Occlusion-Robust Pose-Maps (ORPM) to involve the redundant occlusion information in the part affinity maps. Besides, they propose the first multi-person 3D pose dataset, MuCo3DHP, which greatly promotes the development of this field. Furthermore, Fabbri et al. [41] propose to estimate the volumetric heatmaps in an encoder-decoder manner, and regress the multi-person 3D pose from them. The intermediate encoded features are supervised via the encoded feature of the ground truth keypoint heatmaps for compressing volumetric heatmaps. Differently, PandaNet [12] is an anchor-based single-shot model for multi-person 3D pose estimation. It directly predicts the 2D/3D pose for each anchor position. In addition, SMAP [224] estimates multiple maps, representing the body root depth and part relative-depth at each position. To group the keypoints on heatmaps, they use the estimated depth to determine the association. HMOR [198] models multi-person interaction relations for better performance. It hierarchically estimates multi-person ordinal relations through instance-level, part-level, and joint-level.

Fig. 10: The representative framework of mesh-based 3D pose estimation via the SMPL model.

Iv-B Mesh-based 3D Pose Estimation

Mesh-based 3D pose estimation from a single image has attracted a lot of attention because it can provide extra body shape information beyond the locations of keypoints. Recent works in this community can be roughly classified into two categories. The first kind directly regresses the 3D human body shape from an input image, which represents the human body using a 3D mesh with thousands of vertices. For example, based on a pre-defined mesh topology, GraphCMR [92] can regresses 3D mesh vertices using graph convolutional neural networks (GCNNs). Similarly, Pose2Mesh [31] proposes a cascaded model using GCNNs. Differently, I2L-MeshNet [132] proposes an image-to-lixel (line+pixel) prediction network, which predicts the per-lixel likelihood on 1D heatmaps to regress each mesh vertex coordinate.

Different from mesh vertex regression, the other kind methods take the statistical 3D human model, like SMPL [117], as the representation, which can bring the strong geometric prior. In this way, they formulate the 3D pose estimation as estimating the SMPL pose and shape parameters. As shown in Fig. 10, the general framework is to directly estimate the camera and SMPL parameters from a single-person 2D RGB image. As introduced in Sec. II-A, we can derive 3D human body mesh from SMPL parameters, and regress 3D keypoints from the mesh. In this community, most works focus on how to 1) solve the 3D data shortage problem, 2) facilitating more proper representations for mesh-based 3D pose estimation, and 3) dealing with multi-person cases in real application scenes.

Iv-B1 Solving the Lack of Data

Due to the data shortage, we are supposed to use all 2D/3D pose datasets available for supervision. Many approaches develop various loss functions to supervise different aspects of the estimated body mesh. Human Mesh Recovery (HMR) [83] exploits the way of learning from unpaired data from 2D pose and 3D motion capture (MoCap) datasets. If only learning from 2D poses without depth-wise supervision, it will result in unreasonable 3D pose and shape. Instead, they use the MoCap data to supervise the rationality of the estimated parameters in a generative adversarial manner. A discriminator is developed to determine whether the estimated SMPL pose and shape parameters are reasonable.

To guide the model to explicitly learn from the existing data, some approaches supervise with the inherent properties of 2D images. To learn from the temporal dynamics, a 3D human dynamics model [84] is trained to estimate 3D poses of the current, past, and future frames. To transfer the static images to a motion sequence, a hallucinator is learned to estimate the features of the past and future motion for synthesis. To learn from the temporal smoothness, Kocabas et al. [88]

develop a temporal network named VIBE. Following HMR, they employ a motion discriminator to supervise the rationality of the predicted motion sequences in a generative adversarial manner. Specifically, via Gated Recurrent Units (GRUs), the SMPL parameters which describe the motion sequence are mapped to a latent representation at each time step. In addition to exploiting temporal information, TexturePose 

[149] utilizes the appearance consistency of the same person among multiple viewpoints or adjacent video frames for supervision. Body textures are mapped from 2D images to UV maps, which semantically align the multi-view or sequential textures. Only visible parts among multiple UV maps of each person are supervised with texture consistency.

Moreover, there are some methods proposed to develop more detailed supervision. For example, HoloPose [56] propose a multi-task network that estimates DensePose [57], 2D and 3D keypoints, along with the part-based 3D reconstruction. An iterative refinement method is proposed to improve the alignment between the model-based 3D estimates of 2D/3D keypoints and DensePose. Besides, Human Mesh Deformation (HMD) [230] utilizes additional information, including body keypoints, silhouettes, and per-pixel shading, to refine the estimated 3D mesh. Via a hierarchical mesh projection and deformation refinement, body mesh is well aligned with the person in the input 2D image. SMPLify [14] proposes to estimate 3D human mesh by fitting the SMPL model to the predicted 2D keypoints and minimizing the re-projection error. The SMPL oPtimization IN the loop method try to combine the advantages of regression-based and optimization-based methods [91]. They utilize the SMPLify to refine the estimated results in the training loop to provide additional 3D supervision.

Iv-B2 Model Representations

Considering that regressing all parameters from a global image feature is ambiguous for the complex body mesh, more and more researchers focus on exploring more proper representations for mesh-based 3D pose estimation. For example, GraphCMR [92] utilizes a graph-based representation for the 3D body mesh. Via a graph convolution network (GCN), the 3D location of each mesh vertice is estimated at each node. Then the SMPL parameters can be estimated from these vertices. Based on DensePose, DenseRaC [210] uses the estimated IUV map as the intermediate representation to estimate SMPL parameters. Specifically, a differential renderer is employed to render the estimated body mesh back to the IUV map and make a comparison with the input for supervision. Sun et al. [183] develop a skeleton-disentangled representation using the bilinear transformation to tackle the feature coupling problem of 2D pose and the other details. They also employ a transformer-based network to learn the temporal smoothness, where an unsupervised adversarial training strategy is developed to learn the motion dynamics by ordering the shuffled frames.

Different from recovering a single-body-part 3D mesh, some works extend the research into the representation for multi-body-part 3D mesh recovery. For example, SMPL+H [166] integrates a 3D hand model into the SMPL body model to jointly recover the 3D mesh of the body and hands. Xiang et al. [205] propose the method MTC to use separate CNN networks for estimating body, hands, and face, and then jointly fit the Adam [81] model to the outputs of all body parts. SMPL-X [148] combines FLAME head model [104] with SMPL+H, and learns the pose-dependent blending shapes by fitting the model to 3D scans data. SMPLify-X [148] is proposed to recover the human whole body 3D mesh by iteratively fitting SMPL-X to 2D keypoints of face, hands, and body.

Iv-B3 Multi-person 3D Mesh Recovery

Although great progress has been made on monocular 3D human pose and shape estimation for the single person scene, it is crucial to deal with multi-person cases with the truncation, environmental occlusion, and person-person occlusion. Existing multi-stage methods equip the single-person pipeline with a 2D person detector to handle multi-person scenes. Different from 2D/3D keypoints estimation that only estimates dozens of body joints, recent works also attempt to explore the particularity of 3D mesh recovery. For example, Zanfir et al. [214] propose to use the natural scene constraints in multi-person scene. To get the initial 3D body meshes, they fit the SMPL model to the 3D poses and their semantic segmentation estimated from the image. To exclude the case of volume occupancy, they put a collision constraint into the objective function. Meanwhile, the ground plane is estimated to model the interactions between the plane and all human subjects. Furthermore, Jiang et al. [75] propose to use the coherent reconstruction of multiple humans (CRMH) for multi-person 3D mesh recovery. They build their method based on Faster-RCNN [161]

, where the RoI-aligned features are used to predict the SMPL parameters. Specifically, they develop a differentiable interpolation loss to avoid collision between body meshes. Besides, to learn the correct depth ordering between multiple persons, they supervise the rendering of multi-person body meshes by instance segmentation. Recently, Sun et al. 

[182] present a real-time Center-based Human Mesh Recovery network (CenterHMR) that is a novel bottom-up single-shot method. The model is trained to simultaneously predict two feature maps, which represent the location of each human body center and the corresponding parameter vector of 3D human mesh at each center, respectively. The explicit center-based representation guarantees the pixel-level feature encoding. The 3D mesh result of each person is estimated from the features centered at the visible body parts, which improves the robustness under occlusion. Besides, to deal with crowded cases with severe overlapping, an occlusion-aware center representation is proposed in this paper.

In summary, both the skeleton-based methods and mesh-based methods of 3D pose estimation have been greatly improved. With the aid of 2D pose estimation and the methods of un-supervision or weak-supervision, the problem of lack of 3D data can be solved to a certain extent. Meanwhile, with more exploration for the representations of 3D body, the mesh-based 3D pose estimation moves towards more accurate and efficient directions.

Fig. 11: Some example annotated images selected from the monocular 2D (first row) and 3D pose benchmarks.

V Evaluation Metrics and Datasets

V-a Evaluation Metrics

V-A1 Evaluation Metrics of 2D Pose Estimation

The evaluation of 2D pose estimation aims to measure the accuracy of the predicted 2D locations. According to the characteristics of datasets, the widely used evaluation metrics include the Percentage of Correct Parts (PCP) [47], Percentage of Correct Keypoints (PCK) [213], and Average Precision (AP) [213, 111], which are introduced as following.

Percentage of Correct Parts (PCP) [47] is proposed to measure the accuracy of body part prediction. The body part prediction is accurate if the estimated two endpoints of the corresponding limb are within a threshold () of the ground-truth endpoints. Specifically, PCPm in [5] is defined by using of the mean ground-truth segment length over the entire test set as a matching threshold of PCP. However, PCP has a drawback that the foreshortening affects the correct measurement of body parts in different views and ranges.

Percentage of Correct Keypoints (PCK) [213] is a widely used metric to measure the accuracy of 2D keypoints prediction. In [213], the threshold for measuring keypoints to the ground-truth is defined as a fraction of the person bounding box size. Similarly, PDJ, the Percentage of Detected Joints, sets the threshold as the pixel radius that is normalized by the torso height of each test sample [169]. PCKh@0.5 is a slight modification of the PCK. It adopts the matching threshold as of the head segment length of the testing person. By using the head size as a reference, PCKh makes the measurement articulation independent. By altering the threshold percentage, Area Under the Curve (AUC) can be generated to further evaluate the power of different pose estimation algorithms.

Average Precision (AP), first called the average precision of keypoints (APK), is proposed in [213] to measure pose estimation in a real system which has no annotated bounding boxes at test time. A detected keypoint candidate is considered to be correct (true positive) if it is within a threshold of the ground-truth. Each keypoint separately calculates its correspondence with the ground-truth poses. AP correctly penalizes both missed detections and false positives. In [111], for multi-person pose estimation, AP is calculated by measuring the Object Keypoint Similarity (). Simialr to IoU in object detection, measures the similarity between the predictions and the ground-truths:

(1)

where is the Euclidean distance between the detected keypoint and the corresponding ground truth, is the ground-truth visibility flag, is the person scale, and is a per-keypoint constant that controls falloff. For each keypoint the ranges between 0 and 1.

Given the over all labeled keypoints, the average precision (AP) and average recall (AR) can be computed. By tuning values, the precision-recall curve can be calculated. and at different OKS can throughly reflect the performance of the testing algorithms. For COCO dataset [111], 10 metrics are used for evaluating the performance of a keypoint detector, including ( at ), , ( the mean of scores at 10 values, ), for medium objects, for large objects, , , , for medium objects, for large objects. The metrics of and are useful for understanding which keypoints are more difficult than others. They have been widely used as the evaluation metric for multi-person pose estimation.

V-A2 Evaluation Metrics of 3D Pose Estimation

The Mean Per Joint Position Error (MPJPE) is the most widely used evaluation metrics of 3D pose estimation. It measures the average Euclidean distance from the 3D pose predictions to the ground truth in millimeters. The predictions and ground truth keypoints are aligned by the pelvis keypoint for comparison.

Procrustes Aligned MPJPE (PA-MPJPE) is a modification of MPJPE, which can be obtained by rigidly aligning the predicted pose with ground truth in millimeters. It is also called as the reconstruction error. Through Procrustes alignment, the effects of translation, rotation, and scale are eliminated, which makes PA-MPJPE focus on evaluating the accuracy of the reconstructed 3D skeleton.

3D PCK & AUC. 3D PCK is the 3D version of the PCK metric. The threshold of success prediction is usually set to 50mm or 150mm in different methods. Correspondingly, the AUC, which is the total area under the PCK-threshold curve, is calculated by computing PCKs by varying the threshold from 0 to 200 mm.

The Mean Per Joint Angle Error (MPJAE) measures the angle between the predicted keypoint orientation and the ground truth orientation in degrees. The orientation difference is measured as the geodesic distance in SO(3). The detailed definition can be founded in [157]. Besides, only the angles of four limbs and the root are used for evaluation.

Procrustes Aligned MPJAE (PA-MPJAE) measures the MPJAE normalized by the rotation matrix on all predicted orientations. The rotation matrix is obtained from the Procrustes alignment. Similarly, PA-MPJAE neglects the global mismatch.

V-B Datasets

The rapid development of the related datasets boost the development of deep learning-based pose estimation methods. Public available datasets provide training sources and fair comparison for different methods. Considering the dataset scale and diversity of poses and scenes, in this section we introduce the representative datasets in recent years. Most of them are high-quality and large-scale datasets with good annotations in different shooting scenes.

V-B1 2D Pose Datasets

we introduce 2D datasets according to the categories of 1) image-level or video-level, and 2) single person or multiple persons. The dataset summary is given in Table III. Part of the example images is shown in Fig. 11.

Dataset Type Year Image/Video Num of Kpts Train set Val set Test set Evaluation
LSP [78] Single 2010 Image 14 1K 0 1K PCK
FLIC [169] Single 2013 Image 10  5K 0  1K PCK
MPII Single Person [5] Single 2014 Image 16 29K 0 12K PCK
MPII Multi-Person [5] Multiple 2014 Image 16 3.8K 0  1.7K PCK
COCO 17 [111] Multiple 2016 Image 17 57K 5K 20K mAP
AI-Challenger [203] Multiple 2017 Image 14 210K 30K 60K mAP
CrowdPose [100] Multiple 2019 Image 14 10K 2K 8K mAP
HiEve [112] Multiple 2020 Both 14 33K 0 16K mAP
J-HMDB [74] Single 2013 Video 15 0.6K 0 0.3K PCK
Penn Action [220] Single 2013 Video 13  1K 0  1K PCK
PoseTrack [4] Multiple 2017 Video 15 0.29k 0.05k 0.2K mAP
TABLE III: 2D pose estimation datasets. Kpts is short for keypoints. For the train/val/test set, the number of images in image-based datasets and the number of videos in video-based datasets are listed.

Image-level 2D Single Person Dataset:

Leeds Sports Pose (LSP) Dataset [78] is collected from Flickr using the tags of eight sport activities (athletics, badminton, baseball, gymnastics, parkour, soccer, tennis, and volleyball). The dataset contains 2,000 images, of which 1,000 images for training and the rest 1,000 images for testing. Each person is labeled by 14 keypoints of the full body. Compared with those newly released datasets, LSP is relatively small-scale. It is a initial performance evaluation for single person pose estimation methods.

Frames Labeled in Cinema (FLIC) Dataset [169] contains 5,003 images collected from Hollywood movies. They run the person detector Poselets [15] on every tenth frame of 30 movies. Originally 20K candidates are selected be labeled by the crowdsourcing marketplace Amazon Mechanical Turk with 10 upper-body keypoints. Images with persons occluded or severely non-frontal are filtered out. Finally, 1,016 images are selected as the testing set.

Method Publication Backbone Input size Keywords of Network PCKh@0.5
Tompson et al. [191] NeurIPS’14 AlexNet Keypoint heatmap, multi-resolution, MRF spatial model 79.6
Carreira et al. [19] CVPR’16 GoogleNet Self-correcting model with iterative update 81.3
Tompson et al. [190] CVPR’15 AlexNet Position refinement model to predict joint offset location 82.0
Hu et al. [64] CVPR’16 VGG Hierarchical Rectified Gaussian model 82.4
Pishchulin et al. [156] CVPR’16 VGG Combine detection and pose estimation based on Fast R-CNN 82.4
Lifshitz et al. [109] ECCV’16 VGG Each pixel in the image votes for the positions of keypoints 85.0
Gkioxary et al. [53] ECCV’16 InceptionNet Chained model, each keypoint relies on its previous ones 86.1
Sun et al. [180] ICCV’17 ResNet-50 Add bone-based representation as constraints 86.4
Insafutdinov et al. [70] ECCV’16 ResNet-152 Keypoint geometric and appearance constraints 88.5
Wei et al. [202] CVPR’16 CPM Convolutional Pose Machines, intermediate supervision 88.5
Newell et al. [135] ECCV’16 Hourglass Hourglass model, intermediate supervision 90.9
Sun et al. [178] ICCV’17 Hourglass Two-stage normalization,multi-scale supervision and fusion 91.0
Tang et al. [186] ECCV’18 Hourglass Order-K dense connectivity, quantification to low bit 91.2
Luvizon et al. [121] CG’19 Hourglass Multi-stage, soft-argmax on heatmaps 91.2
Nie et al. [138] TIP’19 Hourglass Hierarchical contextual refinement network
Chu et al. [34] CVPR’17 Hourglass multi-resolution attention maps, Hourglass Residual Units 91.5
Chen et al. [25] ICCV’17 En/Decoder GAN, multi-task for poses and occlusion parts, structure-aware 91.9
Yang et al. [212] ICCV’17 Hourglass pyramid residual module to learn various scales 92.0
Ke et al. [85] ECCV’18 Hourglass Multi-scale supervision, structure-aware loss, keypoint masking 92.1
Tang et al. [185] ECCV’18 Hourglass Hierarchical compositional model, bone-based representation 92.3
Sun et al. [179] CVPR’19 HRNet Maintain high-resolution maps, multi-branch/scale fusion 92.3
Zhang et al. [218] ArXiv’19 Hourglass Cascade fusion, graph neural network for refinement 92.5
Tang et al. [184] CVPR’19 Hourglass data-driven keypoint grouping, part-based branching network 92.7
TABLE IV: Performance of the representative 2D single pose estimation methods on the MPII test set.

MPII Dataset [5] is a large-scale dataset containing rich activities and diversity capture environments, both indoor and outdoor. It is collected from 3,913 videos spanning 491 different activities from YouTube. A total of 24,920 frames are extracted from the collected videos. The annotations are conducted by in-house workers on Amazon Mechanical Turk (AMT). The annotations include 2D locations of 16 keypoints, full 3D torso and head orientation, occlusion labels for keypoints, and activity labels. Adjacent video frames are also available for motion information. Finally, 40,522 people are labeled, of which 28,821 people are used for training and 11,701 for testing. MPII dataset has been widely used for pose estimation and other pose related tasks. Table. IV shows the state-of-the-art methods evaluated on the MPII test set. Since the posture is relatively easy, the accuracy of detected 2D keypoints is high and the performance is close to saturation.

Method Publication Backbone Input size Keywords of Network mAP
Bottom-up: keypoint detection and grouping
OpenPose [18] CVPR’17 CMU-Net Multi-stage, part affinity fields.
Asso. Emb. [134] NeurIPS’17 Hourglass Hourglass for pose, associative embedding for grouping.
PersonLab [146] ECCV’18 ResNet Multi-task, short/mid/long-rang offsets for grouping.
MultiPoseNet [89] ECCV’18 ResNet Multi-task, Pose Residual Network.
PifPaf [93] CVPR’19 ResNet Part Intensity Field and Part Association Field
Li et al. [99] AAAI’20 IMHN Hourglass network with spatial/channel attention
HigherHRNet [28] CVPR’20 HRNet Feature pyramid, associative embedding.
Top-down: human detection and single-person keypoint detection
Mask-RCNN [59] ICCV’17 ResNet Faster-RCNN, multi-task, RoIAlign.
G-RMI [147] CVPR’17 ResNe Full convolutions, keypoint offsets, keypoint NMS.
Integral Regre. [181] ECCV’18 ResNet Intergral loss and keypiont regression.
CPN [27] CVPR’18 ResNet GlobalNet and RefineNet, online hard keypoint mining.
RMPE [45] ICCV’17 PyraNet Spatial transformer, parametric NMS, proposals generator.
CFN [67] ICCV’17 Inception Multi-level supervision and fusion, coarse to fine.
SimpleBaseline [206] ECCV’18 ResNet Deconvolution pose head network.
CSM+SCASRB [177] CVPR’19 ResNet Channel shuffle module, spatial channel-wise attention.
HRNet-W [179] CVPR’19 HRNet Maintain high-resolution, multi-branch/scale fusion.
MSPN [105] arXiv’19 MSPN Multi-stage feature aggregation, coarse-to-fine loss.
DARK [217] CVPR’20 HRNet Distribution-aware coordinate representation of keypoint.
UDP [65] CVPR’20 HRNet Unbiased data processing.
PoseFix [131] CVPR’19 HR+ResNet Pose refinement, error statistics to generate synthetic poses.
Graph-PCNN [199] CVPR’20 HRNet Two-stage graph-based and model-agnostic framework.
RSN [17] CVPR’20 4-RSN Residual steps network, pose refine machine.
TABLE V: Performance of the representative 2D multi-person pose estimation methods on the COCO test-dev set. For bottom-up methods, the reported results use multi-scale testing.

Image-level 2D Multi-Person Dataset:

Microsoft Common Objects in COntext (MSCOCO) Dataset [111]

contains annotations for object detection, panoptic segmentation, and keypoint detection. The images are collected from websites including Google, Bing, and Flickr. The annotations are performed by workers on Amazon’s Mechanical Turk (AMT). The dataset contains over 200K images and 250K person instances. Along with the dataset, the Challenge of COCO Keypoint Detection is held every year since 2016. The dataset has two versions. The difference is the split of training and validation set. In the latest 2017 version, the training/val images split is 118K/5K instead of the previous 83K/41K. The test set contains 20K images and the annotations are hold out by the official testing server. Besides, 120K unlabeled images are also released that follow the same class distribution as the labeled images. They may be used for semi-supervised learning. For keypoint detection, 17 keypoints are labeled along with the visibility tag, bounding box, and body segmentation area. COCO dataset has been a widely used evaluation benchmark, and served as auxiliary data for pose related tasks such as action recognition and person ReID. Table. 

V shows the performance of the state-of-the-art methods on the COCO test set. RSN [17] achieves mAP showing the superiority of the top-down methods. With the improvement of the network backbone and the keypoint grouping method, the bottom-up methods have rapidly developed. HigherHRNet [28] obtains mAP. The bottom-up methods likely have the potential to achieve comparable performance with the top-down ones.

AI-Challenger Dataset [203], which is also referred as the Human skeletal system Keypoint Detection Dataset (HKD), contains 300K high-resolution images for keypoint detection and Chinese captioning, and 81,658 images for zero-shot recognition. The large-scale dataset has multiple persons and various poses. Each person is labeled with a bounding box and 14 keypoints. The whole dataset is divided into the training set, validation set, test A set, and test B set with 210K, 30K, 30K, and 30K images, respectively. Due to its large scale, high resolution, and rich scenes, AI-Challenger Dataset has been widely used as an auxiliary dataset for 2D/3D pose estimation network training and pose related tasks.

CrowdPose Dataset [100] is intended for better evaluating human pose estimation methods in crowded scenes. The images are collected from MSCOCO (person subset), MPII, and AI Challenger by measuring the Crowd Index. Crowd Index is defined to evaluate the crowding level of an image. 30K images are analyzed via the Crowd Index and finally 20K high-quality images are selected. Next, 14 keypoints and full-body bounding boxes are annotated for about 80K persons. The training, validation, and testing subset are split in proportion to 5:1:4. Since the detection for either person bounding boxes or keypoints in crowd scenes is relatively hard, the CrowdPose dataset is still challenging in the multi-person pose estimation community.

Video-level 2D Single Person Dataset:

J-HMDB Dataset [74], short for joint-annotated HMDB, is a subset of the HMDB51 database [94] that contains over 5,100 clips of 51 human actions. J-HMDB Dataset contains 928 clips with 21 action categories. Each action class contains 36-55 clips. Each clip includes 15-40 frames. 31,838 images are annotated via a 2D puppet model [231] on Amazon Mechanical Turk. Up to 15 visible body keypoints are labeled, along with the scale, viewpoint, segmentation, puppet mask, and puppet flow. The ratio for the number of training and testing images is roughly 7:3. J-HMDB dataset has been widely used in the task of pose estimation in videos and action recognition.

Penn Action Dataset [220] is another unconstrained video dataset that contains 2,326 video clips covering 15 actions. The training set and testing set both have 1,163 video clips. The dataset contains various intra-class actor appearances, action execution rate, viewpoint, spatio-temporal resolution, and complicated natural backdrops. Annotation is conducted via a semi-automated video annotation tool deployed on Amazon Mechanical Turk. Each person is annotated with 13 keypoints with 2D coordinates, visibility, and camera viewpoints.

Method Publication Keywords Total Total
mAP MOTA
Bottom-up: keypoint detection and grouping
ArtTrack [69] CVPR’17 Spatio-temporal clustering and grouping for keypoints 59.4 48.1
PoseTrack [73] CVPR’17 Bottom-up pose estimation and spatial-temporal graph 59.4 48.4
JointFlow[37] BMVC’18 Temporal flow fields to propagate keypionts across frames 63.4 53.1
TML++ [68] IJCNN’19

Temporal flow maps for keypoints, multi-stride frames sampling

68.8 54.5
STAF [158] CVPR’19 Spatial-temporal affinity fields across a sequence 70.3 53.8
Top-down: person detection, keypoint detection, and data association
Detect-Track [52] CVPR’18 3D Mask R-CNN for each clip, bounding box IOU for similarity metric 59.6 51.8
PoseFlow [207] BMVC’18 Pose flow building in short clips, pose flow NMS post-processing 63.0 51.0
LightTrack [144] CVPRW’20 Combining single-person pose tracking and object tracking, Siamese GCN 66.7 58.0
POINet [167] ACM MM’19 Unifying feature extraction and data association by ovonic insight network 72.5 58.4
PGPT [9] TMM’20 Combining detection and single object tracking, GCN for data association 72.6 60.2
KeyTrack [174] CVPR’20 Transformer-based pose tracking, parameter-free refinement 74.0 61.2
DetTrack [201] CVPR’20 3D HRNet for each clip, spatio-temporal merging 74.1 64.1
Self-Sup. [194] ECCV’20 Self-supervised keypoint correspondences
FlowTrack [206] ECCV’18 Bounding boxes propagation and optical flow-based temporal similarity 74.6 57.8
TABLE VI: Performance of the representative multi-person pose estimation and tracking methods on PoseTrack 2017 test set.

Video-level 2D Multi-Person Dataset:

PoseTrack Dataset [73, 4] is the first large-scale multi-person pose estimation and tracking dataset. It is collected from the unlabelled videos in MPII Multi-Person Pose dataset [5]. It has two versions, i.e., PoseTrack 2017 and PoseTrack 2018. PoseTrack 2017 contains 550 videos split into 292, 50, and 208 videos for training, validation, and testing, respectively. Totally 23,000 frames are annotated with 153,615 pose labels. PoseTrack 2018 is its extended version. It contains 593 training videos, 170 validation videos, and 375 testing videos. For each video in the training set, the middle 30 frames are annotated. For the validation set and testing set, the middle 30 frames along with every four frames are annotated. The labels contain 15 2D keypoints, an unique person ID, and the head bounding box for each person. PoseTrack is challenging since the videos contain various pose appearance and scale variation, along with body part occlusion and truncation. It has been a widely used benchmark to evaluate multi-person pose estimation and tracking algorithms.

Table. VI presents the performance of the representative methods on the PoseTrack 2017 test set. The multi-step top-down methods show superior performance over the bottom-up methods, while the latter ones are more efficient. The pose estimation task only relies on the accuracy of keypoint prediction, while the pose tracking also needs a solid and robust data association scheme. With the development of pose estimation and data association, pose tracking has the potential to achieve better performance with higher efficiency.

Human-in-Events (HiEve) Dataset [112] is a large-scale video-based dataset for realistic events, especially for crowd and complex events. It contains 2D poses, actions, trajectory tracking, and pose tracking. The dataset is collected from 9 realistic scenes containing 49,820 frames with annotations for 1,302,481 bounding boxes, 2,687 track IDs, 56,643 actions (14 action categories), and 1,099,357 human 2D poses. The label for 2D pose contains 14 keypoints and filters out the heavy occlusion and small bounding box (less than 500 pixels). HiEve dataset is the largest scale human-centric dataset to date, which will be useful in many tasks for human behavior analysis.

Datasets Total Camera Annotation Type Code Link
Frames Viewpoints 3DP 2DP Mesh ITW Real
Human3.6M[71] 3.6 M 4 http://vision.imar.ro/human3.6m/description.php
HumanEva-I[173] 0.037 M 7 http://humaneva.is.tue.mpg.de/
CMU Panoptic[79] 1.5 M 31 https://virtualhumans.mpi-inf.mpg.de/3DPW/
3DPW[196] 0.051 M 1 http://domedb.perception.cs.cmu.edu/
MPI-INF-3DHP[125] 1.3 M 14 http://gvv.mpi-inf.mpg.de/3dhp-dataset/
JTA[42] 0.46 M 1 https://github.com/fabbrimatteo/JTA-Dataset
Varol et al.[195] 6.0 M 1 https://www.di.ens.fr/willow/research/surreal/data/
TABLE VII: The details of the widely used 3D pose estimation datasets.

V-B2 3D Pose Datasets

We introduce the widely used 3D single person datasets, multi-person datasets, and the analysis of benchmarks in this part. The widely used 3D pose benchmarks are summarized in Table VII and the example images are shown in Fig. 11.

3D Single Person Dataset:

Human3.6M [71] is the most widely used multi-view single-person 3D human pose benchmark. The dataset is captured in a 4m3m indoor space using 4 RGB camera, 1 time-of-flight sensor, and 10 motion cameras. It contains 3.6 million 3D human poses and the corresponding videos (50 FPS) in 15 scenarios, such as discussion, sitting on a chair, taking a photo, etc. Especially, both 3D positions and angles of keypoints are available. Currently, due to privacy concerns, only 7 subjects’ data is available. For evaluation, videos are usually down-sampled by every 5-th/64-th frame for removing the redundancy. Methods are often evaluated on two common protocols for comparison. The first protocol is to train on 5 subjects (S1, S5, S6, S7, S8) and test on subject S9 and S11. The second protocol shares the same train/test set, but only evaluates the images captured in the frontal view.

HumanEva-I [173] is a single-person 3D pose dataset captured from 3 camera views at 60 Hz. It contains 4 subjects performing 6 actions. Related methods are usually evaluated on 3 actions, walk, jogging, and boxing, performed by 3 subjects, S1, S2, and S3.

MPI-INF-3DHP [125] is captured in a 14 camera studio using commercial marker-less motion capture device for acquiring the ground truth 3D pose. It contains 8 actors performing 8 activities. The RGB videos are recorded from a wide range of viewpoints. Over 1.3M frames are captured from all 14 cameras. Except for the indoor videos of a single person, they also provide MATLAB code to generate a multi-person dataset, MuCo-3DHP, via mixing up segmented foreground human appearance. With the provided body part segmentation, researchers can also exchange the clothes and the backgrounds using extra texture data.

MoVi [51] is a large-scale single-person video dataset with 3D MoCap annotations. Different from Human3.6M and MPI-INF-3DHP, it contains more subjects (60 female and 30 male). Each person performs a collection of 20 predefined actions and one self-chosen movement. The video synchronized with motion capture was taken from two perspectives, front and side. Except for the 3D pose annotations and camera parameters, MoVi also provides the SMPL parameters obtained via the MoSh++ [122].

Fig. 12: Pose space analysis for four 3D pose benchmarks: Human3.6M, 3DPW, MoVi, and MPI-INF-3DHP.
Method Publication Human3.6M HumanEva-I Code Link
MPJPE PMPJPE PMPJPE
Zhou et al. [229] CVPR’16 113.0 - - https://github.com/chuxiaoselena/SparsenessMeetsDeepness
Tome et al. [189] CVPR’17 88.4 - - https://github.com/DenisTome/Lifting-from-the-Deep-release
C2F [150] CVPR’17 71.9 51.9 25.5 https://github.com/geopavlakos/c2f-vol-demo
Lin et al. [110] CVPR’17 73.1 - 30.9 https://github.com/MudeLin/RPSM
Martinez et al. [123] ICCV’17 62.9 47.7 24.6 https://github.com/una-dinosauria/3d-pose-baseline
CHP [180] ICCV’17 92.4 59.1 - -
Fang et al. [46] AAAI’18 60.4 45.7 22.9 -
Pavlakos et al. [151] CVPR’18 56.2 41.8 18.3 https://github.com/geopavlakos/ordinal-pose3d
Luvizon et al. [120] CVPR’18 53.2 - - https://github.com/dluvizon/deephar
Rhodin et al. [163] CVPR’18 66.8 51.6 - -
IHP [181] ECCV’18 64.1 49.6 - https://github.com/JimmySuen/integral-human-pose
TP-Net [63] ECCV’18 58.2 44.1 22.0 https://github.com/rayat137/Pose_3D
Liu et al. [114] TPAMI’19 61.1 - - -
EpipolarPose [90] CVPR’19 51.8 45.0 - https://github.com/mkocabas/EpipolarPose
VideoPose3D [152] CVPR’19 46.8 36.5 19.7 https://github.com/facebookresearch/VideoPose3D
SemGCN [223] CVPR’19 57.6 - - https://github.com/garyzhao/SemGCN
OANet [29] ICCV’19 42.9 32.8 14.3 -
Xu et al. [209] CVPR’20 45.6 36.2 15.2 -
Liu et al. [115] CVPR’20 45.1 35.6 15.4 https://github.com/lrxjason/Attention3DHumanPose
3DMPPE [130] ICCV’19 54.4 - - https://github.com/mks0601/3DMPPE_ROOTNET_RELEASE
Fabbri et al. [41] CVPR’20 61.0 49.1 - https://github.com/fabbrimatteo/LoCO
TABLE VIII: Comparisons of 3D pose estimation methods on Human3.6M and HumanEva datasets.
Method Pub. Human3.6M HumanEva-I 3DPW Code Link
MPJPE PMPJPE MPJPE PMPJPE
SMPLify [14] ECCV’16 82.3 - 79.9 - https://http://smplify.is.tue.mpg.de/
UP [97] CVPR’17 80.7 - 74.5 - http: //up.is.tuebingen.mpg.de/
HMR [83] CVPR’18 87.9 58.1 - - https://github.com/akanazawa/hmr
Human dynamics [84] CVPR’19 - 56.9 - 72.6 https://github.com/akanazawa/human_dynamics
GraphCMR [92] CVPR’19 71.9 50.1 - - https://github.com/nkolot/GraphCMR
HoloPose [56] CVPR’19 60.2 46.5 - - http://arielai.com/holopose
DenseRaC [210] ICCV’19 76.8 48.0 - - -
Texturepose [149] ICCV’19 - 49.7 - - https://github.com/geopavlakos/TexturePose
Sun et al. [183] ICCV’19 59.1 42.4 - 69.5 https://github.com/JDAI-CV/DSDSATN
SPIN [91] ICCV’19 - 41.1 - 59.2 https://github.com/nkolot/SPIN
VIBE [88] CVPR’20 65.9 41.5 - 56.5 https://github.com/mkocabas/VIBE
CenterHMR [182] arXiv’20 - - - 53.2 https://github.com/Arthur151/CenterHMR
TABLE IX: Comparisons of 3D mesh recovery methods on Human3.6M, HumanEva, and 3DPW datasets.

SURREAL Dataset [195] is a large-scale synthetic dataset by rendering the textured SMPL model on the background images. The SMPL model is driven by numerous 3D motion capture data. However, the body textures are limited and low-resolution, which makes the rendered 2D images are un-realistic.

AMASS [122] is a large-scale motion capture (MoCap) dataset. It unifies 15 MoCap datasets by converting them to the SMPL parameters via MoSh++ [122]. It contains more than 40 hours of motion data, spanning over 300 subjects, and more than 110K motions. AMASS is widely used to establish a prior human motion space by supervising the rationality of the estimated pose or motion.

3D Multi-person Dataset:

3DPW [196] is a single-view multi-person in-the-wild 3D human pose dataset that contains 60 video sequences (24 train, 24 test, and 12 validation) of rich activities, such as climbing, golfing, relaxing on the beach, etc. The videos are captured in various scenes, such as forest, street, playground, shopping mall, etc. They leverage IMU to obtain accurate 3D pose despite the complexity of scenes. Especially, 3DPW contains abundant 3D annotations, including 2D/3D pose annotations, 3D body scanning, and SMPL parameters. However, in some crowded scenes (e.g. on the street), 3DPW only provides the label of the target person, ignoring the pedestrians passing by. Generally, the entire dataset is used for evaluation, without any fine-tuning.

CMU Panoptic Dataset [80, 79] is a large-scale multi-view and multi-person 3D pose dataset. Currently, it contains 65 sequences and 1.5 million 3D skeletons. They build an impressive dome for 360-degree motion capture, which contains 480 VGA cameras (25 FPS), 31 HD cameras (30 FPS), 10 Kinect2 Sensors (30 FPS), and 5 DLP Projectors. Especially, it contains multi-person social scenarios. Multi-person 3D pose estimation methods usually extract part of the data for evaluation. Zanfir et al. [214, 215] and Jiang et al. [75] select 2 sub-sequences (9,600 frames from the HD camera 16 and 30) of 4 social activities (Haggling, Mafia, Ultimatum, and Pizza) for evaluation.

Joint Track Auto (JTA) Dataset [42] is a photo-realistic synthetic dataset for multi-person 3D pose evaluation. JTA is generated using the well-known video game Grand Theft Auto V. It contains 512 HD videos of pedestrians walking in urban scenarios. Each video is 30s long and recorded at 30 FPS.

For details of the widely used benchmarks, please refer to Table. VII. Besides, we have released a detailed comparison and code toolbox for 3D pose data processing on Github222https://github.com/Arthur151/SOTA-on-monocular-3D-pose-and-shape-estimation.

Analysis of Benchmark:

Benchmark leaderboards of 3D pose estimation and 3D mesh recovery are presented in Table. VIII and Table. IX, respectively. As we can see, the 3D pose estimation methods show obvious advantages in obtaining better 3D pose accuracy on indoor single-person 3D pose benchmarks, Human3.6M, and HumanEva. The 3D mesh recovery methods are more suitable for more comprehensive 3D human analysis and visualization. Besides, 3DPW [196] is a new in-the-wild multi-person 3D pose benchmark. 3D mesh recovery methods have shown promising generalization ability on it.

Pose Space Analysis.

One of the main challenges for monocular 3D pose estimation is lacking sufficient training data, especially in diversity. Therefore, we analyze the pose space of four 3D benchmarks and visualize the statistical results. In detail, we first align the 3D pose annotations of all datasets to the pelvis, then perform cluster analysis. Here, we use the K-means clustering algorithm to evenly divide the pose space. For visualization, UMAP 

[124] is employed to reduce the dimension. The statistical results are drawn in Fig. 12.

The distribution of different activities in pose space and different clusters are shown in the first picture of Fig. 12. The sample density in pose space is very uneven. Most samples are gathered together. A similar conclusion could be drawn from the clustering results of the four benchmarks. We have observed that the 3D poses of most samples are close to walking or standing posture. The distribution of pose space is biased, which limits the diversity of these datasets.

Vi Conclusion and Future Directions.

This paper provides a comprehensive survey on deep learning-based monocular human pose estimation for both 2D and 3D tasks. We summarize over 200 papers on 2D and 3D tasks, according to the single-person and multi-person scenes based on image-level datasets or video-level datasets. Under some commonly used frameworks, human pose estimation methods have achieved significant progress via pose task-specific designs based on deep learning technologies. Despite great success, there are still challenges and many emerging topics that deserve further investigation and research.

Pose Estimation for Complex Postures and Crowed Scenes. In daily life, the human body may make various complex or rare postures. This makes the general models fail to accurately recognize the pose. In particular, for real-world applications, such as sports competitions for gymnastics, diving, and high jump, athletes may show extreme postures in a very short time. The complex and fast-changing postures will confuse the existing models that are trained on general datasets. Therefore, on the one hand, datasets for complex and rare postures are needed to improve the performance of the existing models. On the other hand, models with a stronger representation of complex behaviors and postures will be helpful. Additionally, in realistic scenarios, such as shopping malls, traffic surveillance, and sports events, pose estimation for crowded people is very challenging. Although some works [100, 28] have attempted to address crowds and occlusions in 2D pose estimation, they still suffer from low performance in real scenes. Moreover, for 3D pose estimation, occlusion caused by crowded scenes will confuse the models and cause unreasonable reconstructions for human shape and pose. Since the context of a complex scene contains clues for the interaction between person-person and person-object, further work may exploit the relation of scene and person to reason for invisible or occluded body parts.

Benchmark, Protocol, and Toolkit for 3D Mesh Recovery.

Although monocular 3D human mesh recovery is a promising direction, due to the absence of large-scale 3D mesh datasets, the evaluation of 3D mesh recovery is usually performed on the skeleton-based 3D pose benchmarks, such as evaluating the MPJPE and PA-MPJPE for joint position error. Since the 3D mesh contains more information than the 3D keypoints, e.g., appearance information, this kind of indirect evaluation is not enough. Therefore, we need large-scale 3D human mesh datasets and protocols for comprehensive evaluations. Additionally, as the technology matures, industrial applications need easy-to-operate toolkits, especially for the lightweight implementation on cloud servers and hand-held devices. Some companies or communities, such as Google, Microsoft, and Tensorflow, have developed toolkits 

[1, 2] or APIs (Application Programming Interface) [3] for 2D pose estimation. In the future, with the increasing application demands, we believe more mature and general toolboxes for both 2D and 3D pose estimation will further promote the implementation of the advanced algorithms.

Realistic Bodies with Expressive Faces, Hands, Hair, and Clothes. For better understanding the human in the scene, we need richer clues such as facial expression, emotional state, gesture, and clothes to describe a person in the 3D paradigm. Considering that the 3D pose recovery of a single body part (e.g., body, hand, and face) has been well developed in recent years, it is a natural tendency to move on to realistic 3D whole-body recovery with photorealistic details for hair, clothes, and expressive state. It is a brand new research field while only a few works [148, 205] make attempts to combine body, hands, and face in a unified representation framework. The main challenges of this task lie in the lack of paired 3D pose datasets with all detailed information, and the scale differences between different body parts. Although there are many available separate 2D/3D body/face/hand/clothes/hair datasets, it is difficult to capture the realistic whole-body motion simultaneously. Therefore, further work may develop weak-/un-supervised methods to take advantage of all single-part data for effective learning. Moreover, with the development of computer graphics, photorealistic synthesized data may further improve the research.

Multi-person 3D Pose Estimation. Although 2D pose estimation in multi-person scenes has been widely studied in recent years, the study of 3D cases has just begun. Multi-person 3D pose estimation is a very promising direction and is close to the real application scenarios. Most existing well-performing methods are multi-stage frameworks, which heavily rely on 2D human detection. Single-shot methods have the potential to achieve more attractive efficiency. However, when reasoning the poses especially the shapes of multiple people in the 3D form, the problems would be more complicated than in 2D keypoint estimation. For example, the interactive information of person-person or person-scene information is an important clue for determining the ambiguous poses, while it is usually ignored by most existing methods. Therefore, exploiting more comprehensive interactive information in the 2D images would be important for estimating more reasonable 3D poses.

Interaction with 3D World and Other Agents. We live in a dynamic 3D world where people and objects interact with the environment. It would be interesting and promising to build an interaction-aware system that can capture and understand these embodied agents in the 3D world from monocular images. Although there have been some works focus on the 3D hand-object interaction [187, 66], and the interaction of the 3D body with some specific objects [58, 219]

, how to tackle the holistic 3D scene-understanding in uncontrolled in-the-wild scenes still remains challenging. On one hand, the 3D recovery of general objects from a monocular image has not been well solved. More detailed statistical models for general objects would bring a significant boost. On the other hand, instead of modeling the independent compositions, how to effectively represent the relationship of people, objects, and the scene will greatly affect the reasoning results. Moreover, with the development of 3D scene capture technologies, large-scale real-world 3D datasets would be transformative to boost the development of algorithms.

Virtual Digital Human Generation with Emotion, Speech, and Communication. A virtual digital person refers to a virtual person with a digital appearance character. It has specific features such as appearance, gender and personality, and the ability to express and communicate with language, facial expressions, and body movements. The technology has attracted lots of attention in many industries such as film production, virtual host, intelligent customer service, virtual teacher, etc. Modeling and generation of the 2D/3D human character is the core of virtual digital human products. Most of the existing products rely on computer graphics technologies and the marker motion capture equipment that is expensive and complex for operation. With the sharp increase in market demands, virtual digital humans are moving towards more intelligence, convenience, and diversified product forms. In the future, with the development of monocular 3D human recovery technology, the marker-less motion capture without professional sensing equipment is expected to realize simplicity, ease of use, and low price. Besides, although the current digital people have realized the intelligent synthesis of facial expression and mouth movement, the movements of other body parts only support recording and broadcasting. There will be more integrated and automated technology to realize the photorealistic 2D/3D whole-body model including the body movements, facial expressions, finger gestures, voices, etc. Additionally, the development of multimodal human-machine interaction will promote natural communication and interaction with digital people.

To summarize, monocular human pose estimation is a challenging and practical task. The development of deep learning for pose estimation is promising and exciting. In the future, both the research and application of human pose estimation contain many opportunities as well as challenges. The future of monocular human pose estimation will largely depend on the practical focus and progress in algorithms, data, and application scenarios.

References

  • [1] External Links: Link Cited by: §VI.
  • [2] External Links: Link Cited by: §VI.
  • [3] External Links: Link Cited by: §VI.
  • [4] M. Andriluka, U. Iqbal, E. Insafutdinov, L. Pishchulin, A. Milan, J. Gall, and B. Schiele (2018) Posetrack: a benchmark for human pose estimation and tracking. In CVPR, Cited by: §III-C2, §III-C2, §V-B1, TABLE III.
  • [5] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele (2014) 2d human pose estimation: new benchmark and state of the art analysis. In CVPR, Cited by: §I-A, §IV, §V-A1, §V-B1, §V-B1, TABLE III.
  • [6] M. Andriluka, S. Roth, and B. Schiele (2009) Pictorial structures revisited: people detection and articulated pose estimation. In CVPR, Cited by: §III-A.
  • [7] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis (2005) SCAPE: shape completion and animation of people. In ACM Transactions on Graphics, Cited by: §IV-A4.
  • [8] B. Artacho and A. Savakis (2020) UniPose: unified human pose estimation in single images and videos. In CVPR, Cited by: TABLE I, §III-C1.
  • [9] Q. Bao, W. Liu, Y.-H. Cheng, B. Zhou, and T. Mei (2020) Pose-guided tracking-by-detection: robust multi-person pose tracking. IEEE Transactions on Multimedia. Cited by: TABLE I, §III-C2, TABLE VI.
  • [10] Q. Bao, W. Liu, J. Hong, L.Y. Duan, and T. Mei (2020) Pose-native network architecture search for multi-person human pose estimation. In ACM MM, Cited by: §III-A.
  • [11] V. Belagiannis and A. Zisserman (2017) Recurrent human pose estimation. In FG, Cited by: TABLE I, §III-A.
  • [12] A. Benzine, F. Chabot, B. Luvison, Q.-C. Pham, and C. Achard (2020) PandaNet: anchor-based single-shot multi-person 3d pose estimation. In CVPR, Cited by: TABLE II, §IV-A6.
  • [13] Y. Bin, X. Cao, X. Chen, Y. Ge, Y. Tai, C. Wang, J. Li, F. Huang, C. Gao, and N. Sang (2020) Adversarial semantic data augmentation for human pose estimation. In ECCV, Cited by: TABLE I, §III-B1.
  • [14] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M.-J. Black (2016) Keep it smpl: automatic estimation of 3D human pose and shape from a single image. In ECCV, Cited by: §II-B, §IV-B1, TABLE IX.
  • [15] L. Bourdev and J. Malik (2009) Poselets: body part detectors trained using 3d human pose annotations. In ICCV, Cited by: §V-B1.
  • [16] A. Bulat and G. Tzimiropoulos (2017) Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In ICCV, Cited by: TABLE I, §III-A.
  • [17] Y. Cai, Z. Wang, Z. Luo, B. Yin, A. Du, H. Wang, X. Zhang, X. Zhou, E. Zhou, and J. Sun (2020) Learning delicate local representations for multi-person pose estimation. In ECCV, Cited by: TABLE I, §III-B1, §V-B1, TABLE V.
  • [18] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh (2017) Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, Cited by: §I-B, §I-B, Fig. 5, §II-A1, TABLE I, Fig. 8, §III-B2, §III-B2, §III-C2, TABLE V.
  • [19] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik (2016) Human pose estimation with iterative error feedback. In CVPR, Cited by: TABLE I, §III-A, §III-A, TABLE IV.
  • [20] J. Charles, T. Pfister, D.-R. Magee, D.-C. Hogg, and A. Zisserman (2016) Personalizing human video pose estimation. In CVPR, Cited by: TABLE I, §III-C1.
  • [21] C.-H. Chen and D. Ramanan (2017) 3D human pose estimation = 2d pose estimation+ matching. In CVPR, Cited by: TABLE II, §IV-A2.
  • [22] C. Chen, A. Tyagi, A. Agrawal, D. Drover, S. Stojanov, and J. M. Rehg (2019) Unsupervised 3d pose estimation with geometric self-supervision. In CVPR, Cited by: TABLE II, §IV-A2, §IV.
  • [23] W. Chen, H. Wang, Y. Li, H. Su, Z-H. Wang, C.-H. Tu, D. Lischinski, D. Cohen-Or, and B.Q. Chen (2016) Synthesizing training images for boosting human 3d pose estimation. In 3DV, Cited by: TABLE II, §IV-A4.
  • [24] X.-J. Chen and A.-L. Yuille (2014) Articulated pose estimation by a graphical model with image dependent pairwise relations. In NeurIPS, Cited by: TABLE I, §III-A.
  • [25] Y. Chen, C. Shen, X.-S. Wei, L. Liu, and J. Yang (2017) Adversarial posenet: A structure-aware convolutional network for human pose estimation. In ICCV, Cited by: TABLE I, §III-A, TABLE IV.
  • [26] Y.-C. Chen, Y.-L. Tian, and M.-Y. He (2020) Monocular human pose estimation: a survey of deep learning-based methods. Computer Vision and Image Understanding 192, pp. 102897. Cited by: §I-A, §I-A.
  • [27] Y.-L. Chen, Z.-C. Wang, Y.-X. Peng, Z.-Q. Zhang, G. Yu, and J. Sun (2018) Cascaded pyramid network for multi-person pose estimation. In CVPR, Cited by: §I-A, §I-B, Fig. 7, §III-B1, TABLE V.
  • [28] B.-W. Cheng, B. Xiao, J.-D. Wang, H.-H. Shi, T.-S. Huang, and L. Zhang (2020) HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In CVPR, Cited by: TABLE I, §III-B2, §V-B1, TABLE V, §VI.
  • [29] Y. Cheng, B. Yang, B. Wang, W.-D. Yan, and R.-T. Tan (2019) Occlusion-aware networks for 3d human pose estimation in video. In ICCV, Cited by: §II-A2, TABLE II, §IV-A5, TABLE VIII.
  • [30] G. Chéron, I. Laptev, and C. Schmid (2015) P-cnn: pose-based cnn features for action recognition. In ICCV, Cited by: §I-A, §III-D.
  • [31] H. Choi, G. Moon, and K. M. Lee (2020) Pose2Mesh: graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In ECCV, Cited by: §IV-B.
  • [32] C.-J. Chou, J.-T. Chien, and H.-T. Chen (2018) Self adversarial training for human pose estimation. In APSIPA, Cited by: TABLE I, §III-A.
  • [33] X. Chu, W.-L. Ouyang, H.-S. Li, and X.-G. Wang (2016) Structured feature learning for pose estimation. In CVPR, Cited by: §I-B, TABLE I, §III-A.
  • [34] X. Chu, W. Yang, W. Ouyang, C. Ma, A.-L. Yuille, and X.-G. Wang (2017) Multi-context attention for human pose estimation. In CVPR, Cited by: TABLE I, §III-A, TABLE IV.
  • [35] R. Dabral, A. Mundhada, U. Kusupati, S. Afaque, A. Sharma, and A. Jain (2018) Learning 3d human pose from structure and motion. In ECCV, Cited by: §IV.
  • [36] B. Debnath, M. OrBrien, M. Yamaguchi, and A. Behera (2018) Adapting mobilenets for mobile based upper body pose estimation. In AVSS, Cited by: TABLE I, §III-A.
  • [37] A. Doering, U. Iqbal, J. Gall, and D. Bonn (2018) Jointflow: temporal flow fields for multi person pose tracking. In BMVC, Cited by: TABLE I, §III-C2, TABLE VI.
  • [38] J. Dong, Q. Chen, X.-H. Shen, J.-C. Yang, and S.-C. Yan (2014) Towards unified human parsing and pose estimation. In CVPR, Cited by: §III-D.
  • [39] W. Du, Y. Wang, and Y. Qiao (2017) RPAN: an end-to-end recurrent pose-attention network for action recognition in videos. In ICCV, Cited by: §I-A, §III-D.
  • [40] Y. Du, W. Wang, and L. Wang (2015)

    Hierarchical recurrent neural network for skeleton based action recognition

    .
    In CVPR, Cited by: §III-D.
  • [41] M. Fabbri, F. Lanzi, S. Calderara, S. Alletto, and R. Cucchiara (2020) Compressed volumetric heatmaps for multi-person 3d pose estimation. In CVPR, Cited by: TABLE II, §IV-A6, TABLE VIII.
  • [42] M. Fabbri, R. Lanzi, S. Calderara, A. Palazzi, R. Vezzani, and R. Cucchiara (2018) Learning to detect and track visible and occluded body joints in a virtual world. In ECCV, Cited by: TABLE I, §III-C2, §V-B2, TABLE VII.
  • [43] X. Fan, K. Zheng, Y. Lin, and S. Wang (2015) Combining local appearance and holistic view: dual-source deep neural networks for human pose estimation. In CVPR, Cited by: TABLE I.
  • [44] H.-S. Fang, J. Cao, Y.-W. Tai, and C. Lu (2018) Pairwise body-part attention for recognizing human-object interactions. In ECCV, Cited by: §I-A, §III-D.
  • [45] H.-S. Fang, S.-Q. Xie, Y.-W. Tai, and C.-W. Lu (2017) RMPE: regional multi-person pose estimation. In ICCV, Cited by: TABLE I, §III-B1, TABLE V.
  • [46] H.-S. Fang, Y.-L. Xu, W.-G. Wang, X.-B. Liu, and S.-C. Zhu (2018) Learning pose grammar to encode human body configuration for 3d pose estimation. In AAAI, Cited by: §II-B, TABLE II, §IV-A5, §IV, TABLE VIII.
  • [47] V. Ferrari, M. Marin-Jimenez, and A. Zisserman (2008) Progressive search space reduction for human pose estimation. In CVPR, Cited by: §V-A1, §V-A1.
  • [48] M. Fieraru, A. Khoreva, L. Pishchulin, and B. Schiele (2018) Learning to refine human pose estimation. In CVPR WorkShops, Cited by: TABLE I, §III-A.
  • [49] L. Fu, J. Zhang, and K. Huang (2017) ORGM: occlusion relational graphical model for human pose estimation. IEEE Transactions on Image Processing 26 (2), pp. 927–941. Cited by: TABLE I, §III-A.
  • [50] S. Gao, J. Wang, H. Lu, and Z. Liu (2020) Pose-guided visible part matching for occluded person reid. In CVPR, Cited by: §III-D.
  • [51] S. Ghorbani, K. Mahdaviani, A. Thaler, K. Kording, D. J. Cook, G. Blohm, and N.-F. Troje (2020) MoVi: a large multipurpose motion and video dataset. arXiv preprint arXiv:2003.01888. Cited by: §V-B2.
  • [52] R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran (2018) Detect-and-track: efficient pose estimation in videos. In CVPR, Cited by: TABLE I, §III-C2, TABLE VI.
  • [53] G. Gkioxari, A. Toshev, and N. Jaitly (2016) Chained predictions using convolutional neural networks. In ECCV, Cited by: TABLE I, §III-A, §III-C1, TABLE IV.
  • [54] T. Golda, T. Kalb, A. Schumann, and J. Beyerer (2019) Human pose estimation for real-world crowded scenarios. In AVSS, Cited by: §III-B1.
  • [55] W. Gong, X. Zhang, J. Gonzàlez, A. Sobral, T. Bouwmans, C. Tu, and E. Zahzah (2016) Human pose estimation from monocular images: a comprehensive survey. Sensors 16 (12), pp. 1966. Cited by: §I-A.
  • [56] R.-A. Guler and I. Kokkinos (2019) HoloPose: holistic 3d human reconstruction in-the-wild. In CVPR, Cited by: §I-A, TABLE II, §IV-B1, TABLE IX.
  • [57] R. Guler, N. Neverova, and I. Kokkinos (2018) DensePose: dense human pose estimation in the wild. In CVPR, Cited by: §IV-B1.
  • [58] M. Hassan, V. Choutas, D. Tzionas, and M. Black (2019) Resolving 3d human pose ambiguities with 3d scene constraints. In ICCV, Cited by: §VI.
  • [59] K.-M. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, Cited by: §I-A, §I-B, TABLE I, §III-B1, TABLE V.
  • [60] K.-M. He, X.-Y. Zhang, S.-Q. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §III-A, §III-B1.
  • [61] G. Hidalgo, Y. Raaj, H. Idrees, D.-L. Xiang, H. Joo, T. Simon, and Y. Sheikh (2019) Single-network whole-body pose estimation. In ICCV, Cited by: TABLE I, §III-B2.
  • [62] M.-B. Holte, G. Tran, M.-M. Trivedi, and T.-B. Moeslund (2012) Human pose estimation and activity recognition from multi-view videos: comparative explorations of recent developments. IEEE Journal of Selected Topics in Signal Processing 6 (5), pp. 538–552. Cited by: §I-A.
  • [63] M. Hossain and J.-J. Little (2018) Exploiting temporal information for 3d human pose estimation. In ECCV, Cited by: TABLE II, §IV-A5, TABLE VIII.
  • [64] P.-Y. Hu and D. Ramanan (2016) Bottom-up and top-down reasoning with hierarchical rectified gaussians. In CVPR, Cited by: TABLE IV.
  • [65] J. Huang, Z. Zhu, F. Guo, and G. Huang (2020) The devil is in the details: delving into unbiased data processing for human pose estimation. In CVPR, Cited by: TABLE V.
  • [66] L. Huang, J. Tan, J. Meng, J. Liu, and J. Yuan (2020) HOT-net: non-autoregressive transformer for 3d hand-object pose estimation. In ACM MM, Cited by: §VI.
  • [67] S.-L. Huang, M. Gong, and D. Tao (2017) A coarse-fine network for keypoint localization. In ICCV, Cited by: TABLE I, §III-A, TABLE V.
  • [68] J. Hwang, J. Lee, S. Park, and N. Kwak (2019) Pose estimator and tracker using temporal flow maps for limbs. In IJCNN, Cited by: TABLE I, §III-C2, TABLE VI.
  • [69] E. Insafutdinov, M. Andriluka, L. Pishchulin, S.-Y. Tang, E. Levinkov, B. Andres, and B. Schiele (2017) ArtTrack: articulated multi-person tracking in the wild. In CVPR, Cited by: TABLE I, §III-C2, TABLE VI.
  • [70] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele (2016) DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV, Cited by: TABLE I, §III-B2, TABLE IV.
  • [71] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu (2014) Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7), pp. 1325–1339. Cited by: §I-A, §I-B, §IV, §V-B2, TABLE VII.
  • [72] U. Iqbal, M. Garbade, and J. Gall (2017) Pose for action - action for pose. In FG, Cited by: TABLE I, §III-C1.
  • [73] U. Iqbal, A. Milan, and J. Gall (2017) PoseTrack: joint multi-person pose estimation and tracking. In CVPR, Cited by: TABLE I, §III-C2, §III-C2, §V-B1, TABLE VI.
  • [74] H. Jhuang, J. G. S. Zuffi, C. Schmid, and M.-J. Black (2013) Towards understanding action recognition. In ICCV, Cited by: §V-B1, TABLE III.
  • [75] W. Jiang, N. Kolotouros, G. Pavlakos, X.-W. Zhou, and K. Daniilidis (2020) Coherent reconstruction of multiple humans from a single image. In CVPR, Cited by: TABLE II, §IV-B3, §V-B2.
  • [76] S. Jin, W. Liu, W.-L. Ouyang, and C. Qian (2019) Multi-person articulated tracking with spatial and temporal embeddings. In CVPR, Cited by: TABLE I, §III-C2.
  • [77] S. Jin, L. Xu, J. Xu, C. Wang, and P. Luo (2020) Whole-body human pose estimation in the wild. In ECCV, Cited by: TABLE I, §III-B1.
  • [78] S. Johnson and M. Everingham (2010) Clustered pose and nonlinear appearance models for human pose estimation.. In BMVC, Cited by: §IV, §V-B1, TABLE III.
  • [79] H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh (2015) Panoptic studio: a massively multiview system for social motion capture. In ICCV, Cited by: §V-B2, TABLE VII.
  • [80] H. Joo, T. Simon, X.-L. Li, H. Liu, L. Tan, L. Gui, S. Banerjee, T.-S. Godisart, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh (2017) Panoptic studio: a massively multiview system for social interaction capture. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §V-B2.
  • [81] H. Joo, T. Simon, and Y. Sheikh (2018) Total capture: a 3D deformation model for tracking faces, hands, and bodies. In CVPR, Cited by: §I-A, §IV-B2.
  • [82] A. Kamel, B. Sheng, P. Li, J. Kim, and D.-D. Feng (2020) Hybrid refinement-correction heatmaps for human pose estimation. IEEE Transactions on Multimedia (), pp. 1–1. External Links: Document Cited by: TABLE I, §III-A.
  • [83] A. Kanazawa, M.-J. Black, D.-W. Jacobs, and J. Malik (2018) End-to-end recovery of human shape and pose. In CVPR, Cited by: §I-A, §I-B, §II-B, TABLE II, §IV-B1, §IV, TABLE IX.
  • [84] A. Kanazawa, J. Zhang, P. Felsen, and J. Malik (2019) Learning 3d human dynamics from video. CVPR. Cited by: TABLE II, §IV-B1, TABLE IX.
  • [85] L.-P. Ke, M.-C. Chang, H.-G. Qi, and S.-W. Lyu (2018) Multi-scale structure-aware network for human pose estimation. In ECCV, Cited by: §III-A, TABLE IV.
  • [86] NU. Khan and W. Wan (2018) A review of human pose estimation from single image. In ICALIP, Cited by: §I-A.
  • [87] S. Kiciroglu, H. Rhodin, S.-N. Sinha, M. Salzmann, and P. Fua (2020) ActiveMoCap: optimized viewpoint selection for active human motion capture. In CVPR, Cited by: TABLE II, §IV-A5.
  • [88] M. Kocabas, N. Athanasiou, and M.-J. Black (2020) VIBE: video inference for human body pose and shape estimation. In CVPR, Cited by: TABLE II, §IV-B1, §IV, TABLE IX.
  • [89] M. Kocabas, S. Karagoz, and E. Akbas (2018) Multiposenet: fast multi-person pose estimation using pose residual network. In ECCV, Cited by: §I-A, §I-B, TABLE I, §III-B2, TABLE V.
  • [90] M. Kocabas, S. Karagoz, and E. Akbas (2019) Self-supervised learning of 3d human pose using multi-view geometry. In CVPR, Cited by: TABLE II, §IV-A4, TABLE VIII.
  • [91] N. Kolotouros, G. Pavlakos, M.-J. Black, and K. Daniilidis (2019) Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In ICCV, Cited by: §I-A, TABLE II, §IV-B1, TABLE IX.
  • [92] N. Kolotouros, G. Pavlakos, and K. Daniilidis (2019) Convolutional mesh regression for single-image human shape reconstruction. In CVPR, Cited by: TABLE II, §IV-B2, §IV-B, TABLE IX.
  • [93] S. Kreiss, L. Bertoni, and A. Alahi (2019) Pifpaf: composite fields for human pose estimation. In CVPR, Cited by: TABLE I, §III-B2, TABLE V.
  • [94] H. Kuehne, H.-H. Jhuang, E. Garrote, T. Poggio, and T. Serre (2011) HMDB: a large video database for human motion recognition. In ICCV, Cited by: §V-B1.
  • [95] J.-N. Kundu, S. Seth, V. Jampani, R. Mugalodi, R.-V. Babu, and A. Chakraborty (2020) Self-supervised 3d human pose estimation via part guided novel image synthesis. In CVPR, Cited by: TABLE II, §IV-A4.
  • [96] L. Ladicky, P.-H.-S. Torr, and A. Zisserman (2013) Human pose estimation using a joint pixel-wise and part-wise formulation. In CVPR, Cited by: §III-A.
  • [97] C. Lassner, J. Romero, M. Kiefel, F. Bogo, M.-J. Black, and P.-V. Gehler (2017) Unite the people: closing the loop between 3d and 2d human representations. In CVPR, Cited by: TABLE IX.
  • [98] K. Lee, I. Lee, and S.-H. Lee (2018) Propagating lstm: 3d pose estimation based on joint interdependency. In ECCV, Cited by: TABLE II, §IV-A5.
  • [99] J. Li, W. Su, and Z.-F. Wang (2020) Simple pose: rethinking and improving a bottom-up approach for multi-person pose estimation. In AAAI, Cited by: TABLE I, §III-B2, TABLE V.
  • [100] J.-F. Li, C. Wang, H. Zhu, Y.-H. Mao, H.-S. Fang, and C.-W. Lu (2019) Crowdpose: efficient crowded scenes pose estimation and a new benchmark. In CVPR, Cited by: TABLE I, §III-B1, §V-B1, TABLE III, §VI.
  • [101] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In CVPR, Cited by: §III-D.
  • [102] S.-C. Li, L. Ke, K. Pratama, Y.-W. Tai, C.-K. Tang, and K.-T. Cheng (2020) Cascaded deep monocular 3d human pose estimation with evolutionary training data. In CVPR, Cited by: Fig. 5, §II-A1, §II-B, TABLE II, §IV-A5.
  • [103] S. Li and A. B. Chan (2014) 3d human pose estimation from monocular images with deep convolutional neural network. In ACCV, Cited by: §IV-A1.
  • [104] T.-Y. Li, T. Bolkart, M.-J. Black, H. Li, and J. Romero (2017) Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics 36 (6), pp. 194–1. Cited by: §IV-B2.
  • [105] W.-B. Li, Z.-C. Wang, B.-Y. Yin, Q.-X. Peng, Y. Du, T.-Z. Xiao, G. Yu, H.-T. Lu, Y.-C. Wei, and J. Sun (2019) Rethinking on multi-stage networks for human pose estimation. arXiv preprint arXiv:1901.00148. Cited by: TABLE I, §III-B1, TABLE V.
  • [106] Y.-L. Li, X.-P. Liu, H. Lu, S.-Y. Wang, J.-Q. Liu, J.-F. Li, and C.-W. Lu (2020) Detailed 2d-3d joint representation for human-object interaction. In CVPR, Cited by: §III-D.
  • [107] Y.-L. Li, S.-Y. Zhou, X.-J. Huang, L. Xu, Z. Ma, H.-S. Fang, Y.-F. Wang, and C.-W. Lu (2019) Transferable interactiveness knowledge for human-object interaction detection. In CVPR, Cited by: §I-A, §III-D.
  • [108] X. Liang, K. Gong, X. Shen, and L. Lin (2019) Look into person: joint body parsing pose estimation network and a new benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (4), pp. 871–885. Cited by: TABLE I, §III-B1, §III-D.
  • [109] I. Lifshitz, E. Fetaya, and S. Ullman (2016) Human pose estimation using deep consensus voting. In ECCV, Cited by: TABLE I, §III-A, TABLE IV.
  • [110] M.-D. Lin, L. Lin, X.-D. Liang, K. Wang, and H. Cheng (2017) Recurrent 3d pose sequence machines. In CVPR, Cited by: TABLE II, §IV-A5, TABLE VIII.
  • [111] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C.-L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §I-A, §IV, §V-A1, §V-A1, §V-A1, §V-B1, TABLE III.
  • [112] W.-Y. Lin, H.-B. Liu, S.-Z. Liu, Y. Y. Li, G.-J. Qi, R. Qian, T. Wang, N. Sebe, N. Xu, and H.-K. Xiong (2020) Human in events: a large-scale benchmark for human-centric video analysis in complex events. arXiv preprint arXiv:2005.04490. Cited by: §V-B1, TABLE III.
  • [113] H.X. Liu, K. Simonyan, and Y.M. Yang (2019) DARTS: differentiable architecture search. In ICLR, Cited by: §III-A.
  • [114] J. Liu, H.-H. Ding, A. Shahroudy, L.-Y. Duan, X.-D. Jiang, G. Wang, and A.-K. Chichung (2019) Feature boosting network for 3d pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2), pp. 494–501. Cited by: TABLE II, §IV-A3, TABLE VIII.
  • [115] R.-X. Liu, J. Shen, H. Wang, C. Chen, S.-C. Cheung, and V. Asari (2020) Attention mechanism exploits temporal contexts: real-time 3d human pose reconstruction. In CVPR, Cited by: TABLE VIII.
  • [116] Z. Liu, J. Zhu, J. Bu, and C. Chen (2015) A survey of human pose estimation: the body parts parsing based methods. Journal of Visual Communication and Image Representation 32, pp. 10–19. Cited by: §I-A.
  • [117] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M.-J. Black (2015) SMPL: a skinned multi-person linear model. ACM Transactions on Graphics. Cited by: §I-A, Fig. 5, §II-A2, §IV-A4, §IV-B, §IV.
  • [118] C.-X. Luo, X. Chu, and A. Yuille (2018) OriNet: a fully convolutional network for 3d human pose estimation. In BMVC, Cited by: §II-A1.
  • [119] Y. Luo, J.-S.-J. Ren, Z. Wang, W. Sun, J. Pan, J. Liu, J. Pang, and L. Lin (2018) LSTM pose machines. In CVPR, Cited by: TABLE I, §III-C1.
  • [120] D.-C. Luvizon, D. Picard, and H. Tabia (2018) 2D/3d pose estimation and action recognition using multitask deep learning. In CVPR, Cited by: §I-A, Fig. 5, TABLE I, §III-A, §III-A, §III-D, TABLE VIII.
  • [121] D.-C. Luvizon, H. Tabia, and D. Picard (2019) Human pose regression by combining indirect part detection and contextual information. Computers & Graphics 85, pp. 15–22. Cited by: TABLE IV.
  • [122] N. Mahmood, N. Ghorbani, N.-F. Troje, G. Pons-Moll, and M.-J. Black (2019) AMASS: archive of motion capture as surface shapes. In ICCV, Cited by: §V-B2, §V-B2.
  • [123] J. Martinez, R. Hossain, J. Romero, and J.-J. Little (2017) A simple yet effective baseline for 3d human pose estimation. In CVPR, Cited by: §I-B, TABLE II, §IV-A2, §IV, TABLE VIII.
  • [124] L. McInnes, J. Healy, N. Saul, and L. Grossberger (2018) UMAP: uniform manifold approximation and projection.

    The Journal of Open Source Software

    3 (29), pp. 861.
    Cited by: §V-B2.
  • [125] D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W.-P. Xu, and C. Theobalt (2017) Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3DV, Cited by: §V-B2, TABLE VII.
  • [126] D. Mehta, O. Sotnychenko, F. Mueller, W.-P. Xu, S. Sridhar, G. Pons-Moll, and C. Theobalt (2018) Single-shot multi-person 3d pose estimation from monocular rgb. In 3DV, Cited by: TABLE II, §IV-A6.
  • [127] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H.-P. Seidel, W.-P. Xu, D. Casas, and C. Theobalt (2017) Vnect: real-time 3d human pose estimation with a single rgb camera. ACM Transactions on Graphics 36 (4), pp. 1–14. Cited by: TABLE II, §IV-A1.
  • [128] J.-X. Miao, Y. Wu, P. Liu, Y. Ding, and Y. Yang (2019) Pose-guided feature alignment for occluded person re-identification. In ICCV, Cited by: §III-D.
  • [129] R. Mitra, N.-B. Gundavarapu, A. Sharma, and A. Jain (2020) Multiview-consistent semi-supervised learning for 3d human pose estimation. In CVPR, Cited by: TABLE II, §IV-A4, §IV.
  • [130] G. Moon, J.-Y. Chang, and K.-M. Lee (2019) Camera distance-aware top-down approach for 3d multi-person pose estimation from a single rgb image. In ICCV, Cited by: TABLE II, §IV-A6, TABLE VIII.
  • [131] G. Moon, J.-Y. Chang, and K.-M. Lee (2019) PoseFix: model-agnostic general human pose refinement network. In CVPR, Cited by: TABLE I, §III-A, TABLE V.
  • [132] G. Moon and K. Lee (2020) I2L-meshnet: image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single RGB image. In ECCV, Cited by: §IV-B.
  • [133] T.-L. Munea, Y.-Z. Jembre, H. Weldegebriel, L.-B. Chen, C.-X. Huang, and C.-H. Yang (2020) The progress of human pose estimation: a survey and taxonomy of models applied in 2d human pose estimation. IEEE Access 8, pp. 133330–133348. Cited by: §I-A.
  • [134] A. Newell, Z. Huang, and J. Deng (2017) Associative embedding: end-to-end learning for joint detection and grouping. In NeurIPS, Cited by: §I-B, TABLE I, Fig. 8, §III-B2, §III-B2, §III-C2, TABLE V.
  • [135] A. Newell, K.-Y. Yang, and J. Deng (2016) Stacked hourglass networks for human pose estimation. In ECCV, Cited by: §I-A, §I-B, §I-B, TABLE I, §III-A, §III-A, §IV-A1, TABLE IV.
  • [136] B. Nie, C. Xiong, and S.-C. Zhu (2015) Joint action recognition and pose estimation from video. In CVPR, Cited by: TABLE I, §III-C1, §III-D.
  • [137] B.-X. Nie, P. Wei, and S.-C. Zhu (2017) Monocular 3D human pose estimation by predicting depth on joints. In ICCV, Cited by: TABLE II, §IV-A3.
  • [138] X. Nie, J. Feng, J. Xing, S. Xiao, and S. Yan (2019) Hierarchical contextual refinement networks for human pose estimation. IEEE Transactions on Image Processing 28 (2), pp. 924–936. External Links: Document Cited by: §I-B, TABLE I, §III-A, TABLE IV.
  • [139] X. Nie, J.-S. Feng, Y.-M. Zuo, and S.-C. Yan (2018) Human pose estimation with parsing induced learner. In CVPR, Cited by: TABLE I, §III-A.
  • [140] X. Nie, Y.-C. Li, L. Luo, N. Zhang, and J. Feng (2019) Dynamic kernel distillation for efficient pose estimation in videos. In ICCV, Cited by: TABLE I, §III-C1.
  • [141] X.-C. Nie, J.-S. Feng, J.-L. Xing, and S.-C. Yan (2018) Pose partition networks for multi-person pose estimation. In ECCV, Cited by: TABLE I, §III-B2.
  • [142] X.-C. Nie, J.-S. Feng, and S.-C. Yan (2018) Mutual learning to adapt for joint human parsing and pose estimation. In ECCV, Cited by: §I-A, §III-D.
  • [143] X.-C. Nie, J.-S. Feng, J.-F. Zhang, and S.-C. Yan (2019) Single-stage multi-person pose machines. In CVPR, Cited by: TABLE I, §III-B2.
  • [144] G.H. Ning, J. Pei, and H. Huang (2020) Lighttrack: a generic framework for online top-down human pose tracking. In CVPR Workshops, Cited by: TABLE VI.
  • [145] W.-L. Ouyang, X. Chu, and X.-G. Wang (2014) Multi-source deep learning for human pose estimation. In CVPR, Cited by: TABLE I, §III-A.
  • [146] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, and K. Murphy (2018) Personlab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In ECCV, Cited by: §I-A, §I-B, TABLE I, §III-B2, TABLE V.
  • [147] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy (2017) Towards accurate multi-person pose estimation in the wild. In CVPR, Cited by: TABLE I, §III-B1, §III-B1, TABLE V.
  • [148] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A.-A. Osman, D. Tzionas, and M.-J. Black (2019) Expressive body capture: 3d hands, face, and body from a single image. In CVPR, Cited by: TABLE II, §IV-B2, §VI.
  • [149] G. Pavlakos, N. Kolotouros, and K. Daniilidis (2019) TexturePose: supervising human mesh estimation with texture consistency. In ICCV, Cited by: TABLE II, §IV-B1, TABLE IX.
  • [150] G. Pavlakos, X.-W. Zhou, K.-G. Derpanis, and K. Daniilidis (2017) Coarse-to-fine volumetric prediction for single-image 3d human pose. In CVPR, Cited by: §I-A, §I-B, TABLE II, §IV-A1, TABLE VIII.
  • [151] G. Pavlakos, X.W. Zhou, and K. Daniilidis (2018) Ordinal depth supervision for 3d human pose estimation. In CVPR, Cited by: TABLE II, §IV-A4, §IV, TABLE VIII.
  • [152] D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli (2018) 3D human pose estimation in video with temporal convolutions and semi-supervised training. arXiv preprint arXiv:1811.11742. Cited by: TABLE II, §IV-A5, §IV, TABLE VIII.
  • [153] X. Peng, Z. Tang, F. Yang, R. Feris, and D.-N. Metaxas (2018) Jointly optimize data augmentation and network training: adversarial data augmentation in human pose estimation. In CVPR, Cited by: TABLE I, §III-A.
  • [154] T. Pfister, J. Charles, and A. Zisserman (2015) Flowing convnets for human pose estimation in videos. In ICCV, Cited by: TABLE I, §III-C1.
  • [155] T. Pfister, K.Simonyan, J. Charles, and A. Zisserman (2014) Deep convolutional neural networks for efficient pose estimation in gesture videos. In ACCV, Cited by: TABLE I, §III-C1.
  • [156] L. Pishchulin, E. Insafutdinov, S.-Y. Tang, B. Andres, M. Andriluka, P.-V. Gehler, and B. Schiele (2016) Deepcut: joint subset partition and labeling for multi person pose estimation. In CVPR, Cited by: §I-A, TABLE I, §III-B2, §III-C2, TABLE IV.
  • [157] G. Pons-Moll (2014) Human pose estimation from video and inertial sensors. Ph.D. Thesis, Leibniz Universität Hannover Hannover. Cited by: §V-A2.
  • [158] Y. Raaj, H. Idrees, G. Hidalgo, and Y. Sheikh (2019) Efficient online multi-person 2d pose tracking with recurrent spatio-temporal affinity fields. In CVPR, Cited by: TABLE I, §III-C2, TABLE VI.
  • [159] U. Rafi, B. Leibe, J. Gall, and I. Kostrikov (2016) An efficient convolutional network for human pose estimation. In BMVC, Cited by: TABLE I, §III-A.
  • [160] D. Ramanan (2006) Learning to parse images of articulated bodies. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §III-A.
  • [161] S.-Q. Ren, K.-M. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, Cited by: §III-B2, §IV-B3.
  • [162] H. Rhodin, M. Salzmann, and P. Fua (2018) Unsupervised geometry-aware representation for 3d human pose estimation. In ECCV, Cited by: TABLE II, §IV-A4, §IV.
  • [163] H. Rhodin, J. Spörri, I. Katircioglu, V. V. Constantin, F. Meyer, E. Müller, M. Salzmann, and P. Fua (2018) Learning monocular 3d human pose estimation from multi-view images. In CVPR, Cited by: TABLE II, §IV-A4, §IV, TABLE VIII.
  • [164] G. Rogez and C. Schmid (2016) Mocap-guided data augmentation for 3d pose estimation in the wild. In NeurIPS, Cited by: TABLE II, §IV-A4.
  • [165] G. Rogez, P. Weinzaepfel, and C. Schmid (2019) LCR-net++: multi-person 2d and 3d pose detection in natural images. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (5), pp. 1146–1161. Cited by: TABLE II, §IV-A6.
  • [166] J. Romero, D. Tzionas, and M.-J. Black (2017) Embodied hands: modeling and capturing hands and bodies together. ACM Transactions on Graphics 36 (6), pp. 245. Cited by: TABLE II, §IV-B2.
  • [167] W. Ruan, W. Liu, Q. Bao, J. Chen, Y.-H. Cheng, and T. Mei (2019) POINet: pose-guided ovonic insight network for multi-person pose tracking. In ACM MM, Cited by: TABLE I, §III-C2, TABLE VI.
  • [168] D. Sánchez, M. Oliu, M. Madadi, X. Baró, and S. Escalera (2019) Multi-task human analysis in still images: 2d/3d pose, depth map, and multi-part segmentation. In FG, Cited by: TABLE I, §III-B1.
  • [169] B. Sapp and B. Taskar (2013) Modec: multimodal decomposable models for human pose estimation. In CVPR, Cited by: §III-A, §V-A1, §V-B1, TABLE III.
  • [170] N. Sarafianos, B. Boteanu, B. Ionescu, and I. A. Kakadiaris (2016) 3d human pose estimation: a review of the literature and analysis of covariates. Computer Vision and Image Understanding 152, pp. 1–20. Cited by: §I-A.
  • [171] S. Sharma, P.T. Varigonda, P. Bindal, A. Sharma, and A. Jain (2019) Monocular 3d human pose estimation by generation and ordinal ranking. In ICCV, Cited by: TABLE II, §IV-A5.
  • [172] L. Shi, Y. Zhang, J. Cheng, and H. Lu (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Transactions on Image Processing 29, pp. 9532–9545. Cited by: §III-D.
  • [173] L. Sigal, A.-O. Balan, and M.-J. Black (2010) HumanEva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision 87 (1-2), pp. 4. Cited by: §V-B2, TABLE VII.
  • [174] M. Snower, A. Kadav, F. Lai, and H.-P. Graf (2020) 15 keypoints is all you need. In CVPR, Cited by: TABLE I, §III-C2, TABLE VI.
  • [175] J. Song, L. Wang, L.-V. Gool, and O. Hilliges (2017) Thin-slicing network: A deep structured model for pose estimation in videos. In CVPR, Cited by: TABLE I, §III-C1.
  • [176] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian (2017) Pose-driven deep convolutional model for person re-identification. In ICCV, Cited by: §I-A, §III-D.
  • [177] K. Su, D.-D. Yu, Z.Q. Xu, X. Geng, and C.-H. Wang (2019) Multi-person pose estimation with enhanced channel-wise and spatial information. In CVPR, Cited by: TABLE I, §III-B1, TABLE V.
  • [178] K. Sun, C.-L. Lan, J.-L. Xing, W.-J. Zeng, D. Liu, and J.-D. Wang (2017) Human pose estimation using global and local normalization. In ICCV, Cited by: TABLE IV.
  • [179] K. Sun, B. Xiao, D. Liu, and J.-D. Wang (2019) Deep high-resolution representation learning for human pose estimation. In CVPR, Cited by: §I-A, §I-B, §I-B, TABLE I, Fig. 7, §III-B1, §III-B2, TABLE IV, TABLE V.
  • [180] X. Sun, J.-X. Shang, S. Liang, and Y.-C. Wei (2017) Compositional human pose regression. In ICCV, Cited by: §II-A1, §III-A, TABLE IV, TABLE VIII.
  • [181] X. Sun, B. Xiao, F.-Y. Wei, S. Liang, and Y.-C. Wei (2018) Integral human pose regression. In ECCV, Cited by: §I-A, TABLE II, §IV-A1, TABLE V, TABLE VIII.
  • [182] Y. Sun, Q. Bao, W. Liu, Y. Fu, and T. Mei (2020) CenterHMR: a bottom-up single-shot method for multi-person 3d mesh recovery from a single image. Cited by: §I-B, TABLE II, §IV-B3, TABLE IX.
  • [183] Y. Sun, Y. Ye, W. Liu, W.-P. Gao, Y.-L. Fu, and T. Mei (2019) Human mesh recovery from monocular images via a skeleton-disentangled representation. In ICCV, Cited by: §I-B, §II-B, TABLE II, §IV-B2, §IV, TABLE IX.
  • [184] W. Tang and Y. Wu (2019) Does learning specific features for related parts help human pose estimation?. In CVPR, Cited by: §III-A, TABLE IV.
  • [185] W. Tang, P. Yu, and Y. Wu (2018) Deeply learned compositional models for human pose estimation. In ECCV, Cited by: §I-B, TABLE I, §III-A, TABLE IV.
  • [186] Z.-Q. Tang, X. Peng, S. Geng, L.-F. Wu, S.-T. Zhang, and D.-N. Metaxas (2018) Quantized densely connected u-nets for efficient landmark localization. In ECCV, Cited by: TABLE IV.
  • [187] B. Tekin, F. Bogo, and M. Pollefeys (2019) H+o: unified egocentric recognition of 3d hand-object poses and interactions. In CVPR, Cited by: §VI.
  • [188] B. Tekin, A. Rozantsev, V. Lepetit, and P. Fua (2016) Direct prediction of 3d body poses from motion compensated sequences. In CVPR, Cited by: §IV-A5.
  • [189] D. Tome, C. Russell, and L. Agapito (2017) Lifting from the deep: convolutional 3d pose estimation from a single image. In CVPR, Cited by: TABLE II, §IV-A2, TABLE VIII.
  • [190] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler (2015) Efficient object localization using convolutional networks. In CVPR, Cited by: TABLE IV.
  • [191] J. Tompson, A. Jain, Y. LeCun, and C. Bregler (2014) Joint training of a convolutional network and a graphical model for human pose estimation. In NeurIPS, Cited by: §I-B, TABLE I, §III-A, §III-A, TABLE IV.
  • [192] A. Toshev and C. Szegedy (2014) DeepPose: human pose estimation via deep neural networks. In CVPR, Cited by: §I-A, §III-A.
  • [193] I. Umar, M. Pavlo, and K. Jan (2020) Weakly-supervised 3d human pose learning via multi-view images in the wild. In CVPR, Cited by: TABLE II, §IV-A4.
  • [194] R. Umer, A. Doering, B. Leibe, and J. Gall (2020) Self-supervised keypoint correspondences for multi-person pose estimation and tracking in videos. In ECCV, Cited by: TABLE I, §III-C2, TABLE VI.
  • [195] G. Varol, J. Romero, X. Martin, N. Mahmood, M.-J. Black, I. Laptev, and C. Schmid (2017) Learning from synthetic humans. In CVPR, Cited by: TABLE II, §IV-A4, §V-B2, TABLE VII.
  • [196] T. von Marcard, R. Henschel, M.-J. Black, B. Rosenhahn, and G. Pons-Moll (2018) Recovering accurate 3d human pose in the wild using imus and a moving camera. In ECCV, Cited by: §I-A, §V-B2, §V-B2, TABLE VII.
  • [197] B. Wan, D.-S. Zhou, Y.-F. Liu, R.-J. Li, and X.-M. He (2019) Pose-aware multi-level feature network for human object interaction detection. In ICCV, Cited by: §III-D.
  • [198] C. Wang, J. Li, W. Liu, C. Qian, and C. Lu (2020) HMOR: hierarchical multi-person ordinal relations for monocular multi-person 3d pose estimation. In ECCV, Cited by: TABLE II, §IV-A6.
  • [199] J. Wang, X. Long, Y. Gao, E. Ding, and S. Wen (2020) Graph-pcnn: two stage human pose estimation with graph pose refinement. In ECCV, Cited by: TABLE I, §III-B1, TABLE V.
  • [200] M. Wang, F. Qiu, W.-T. Liu, C. Qian, X.W. Zhou, and L.Z. Ma (2020) EllipBody: a light-weight and part-based representation for human pose and shape recovery. arXiv preprint arXiv:2003.10873. Cited by: §II-A2.
  • [201] M.-C. Wang, J. Tighe, and D. Modolo (2020) Combining detection and tracking for human pose estimation in videos. In CVPR, Cited by: TABLE I, §III-C2, TABLE VI.
  • [202] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh (2016) Convolutional pose machines. In CVPR, Cited by: §I-B, TABLE I, §III-A, TABLE IV.
  • [203] J. Wu, H. Zheng, B. Zhao, Y.-X. Li, B. Yan, R. Liang, W.-J. Wang, S. Zhou, G. Lin, and Y. Fu (2017) AI Challenger: a large-scale dataset for going deeper in image understanding. arXiv preprint arXiv:1711.06475. Cited by: §V-B1, TABLE III.
  • [204] F.-T. Xia, P. Wang, X.-J. Chen, and A.-L. Yuille (2017) Joint multi-person pose estimation and semantic part segmentation. In CVPR, Cited by: TABLE I, §III-B1.
  • [205] D.-L. Xiang, H. Joo, and Y. Sheikh (2019) Monocular total capture: posing face, body, and hands in the wild. In CVPR, Cited by: TABLE II, §IV-B2, §VI.
  • [206] B. Xiao, H.-P. Wu, and Y.-C. Wei (2018) Simple baselines for human pose estimation and tracking. In ECCV, Cited by: §I-A, §I-B, TABLE I, Fig. 7, §III-B1, §III-C2, TABLE V, TABLE VI.
  • [207] Y.-L. Xiu, J.-F. Li, H.-Y. Wang, Y.-H. Fang, and C.-W. Lu (2018) Pose flow: efficient online pose tracking. In BMVC, Cited by: TABLE I, §III-C2, TABLE VI.
  • [208] J. Xu, R. Zhao, F. Zhu, H. Wang, and W. Ouyang (2018) Attention-aware compositional network for person re-identification. In CVPR, Cited by: §III-D.
  • [209] J.-W. Xu, Z.-B. Yu, B.-B. Ni, J.-C. Yang, X.-K. Yang, and W.-J. Zhang (2020) Deep kinematics analysis for monocular 3d human pose estimation. In CVPR, Cited by: §II-A1, §II-B, TABLE II, §IV-A5, TABLE VIII.
  • [210] Y.-L. Xu, S.-C. Zhu, and T. Tung (2019) DenseRaC: joint 3d pose and shape estimation by dense render-and-compare. In ICCV, Cited by: TABLE II, §IV-B2, TABLE IX.
  • [211] S. Yan, Y. Xiong, and D. Lin (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, Cited by: §III-D.
  • [212] W. Yang, S. Li, W. Ouyang, H.-S. Li, and X.-G. Wang (2017) Learning feature pyramids for human pose estimation. In ICCV, Cited by: TABLE I, §III-A, TABLE IV.
  • [213] Y. Yang and D. Ramanan (2012) Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (12), pp. 2878–2890. Cited by: §III-A, §V-A1, §V-A1, §V-A1.
  • [214] A. Zanfir, E. Marinoiu, and C. Sminchisescu (2018) Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In CVPR, Cited by: TABLE II, §IV-B3, §V-B2.
  • [215] A. Zanfir, E. Marinoiu, M. Zanfir, A.-I. Popa, and C. Sminchisescu (2018) Deep network for the integrated 3d sensing of multiple people in natural images. In NeurIPS, Cited by: §I-B, TABLE II, §IV-A6, §V-B2.
  • [216] F. Zhang, X.-T. Zhu, and M. Ye (2019) Fast human pose estimation. In CVPR, Cited by: TABLE I, §III-A.
  • [217] F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu (2020) Distribution-aware coordinate representation for human pose estimation. In CVPR, Cited by: TABLE V.
  • [218] H. Zhang, H. Ouyang, S. Liu, X.-J. Qi, X. Shen, R. Yang, and J. Jia (2019) Human pose estimation with spatial contextual information. CoRR abs/1901.01760. Cited by: TABLE I, TABLE IV.
  • [219] J. Zhang, S. Pepose, H. Joo, D. Ramanan, J. Malik, and A. Kanazawa (2020) Perceiving 3d human-object spatial arrangements from a single image in the wild. In ECCV, Cited by: §VI.
  • [220] W.-Y. Zhang, M.-L. Zhu, and K.-G. Derpanis (2013) From actemes to action: a strongly-supervised representation for detailed action understanding. In ICCV, Cited by: §V-B1, TABLE III.
  • [221] Z.W. Zhang, C. Su, L. Zheng, and X.D. Xie (2020) Correlating edge, pose with parsing. In CVPR, Cited by: §I-A, §III-D.
  • [222] H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang (2017) Spindle net: person re-identification with human body region guided feature decomposition and fusion. In CVPR, Cited by: §I-A, §III-D.
  • [223] L. Zhao, X. Peng, Y. Tian, M. Kapadia, and D. Metaxas (2019) Semantic graph convolutional networks for 3d human pose regression. In CVPR, Cited by: §I-A, TABLE II, §IV-A3, §IV, TABLE VIII.
  • [224] J. Zhen, Q. Fang, J. Sun, W. Liu, W. Jiang, H. Bao, and X. Zhou (2020) SMAP: single-shot multi-person absolute 3d pose estimation. In ECCV, Cited by: TABLE II, §IV-A6.
  • [225] L. Zheng, Y. Huang, H. Lu, and Y. Yang (2019) Pose-invariant embedding for deep person re-identification. IEEE Transactions on Image Processing 28 (9), pp. 4500–4509. Cited by: §III-D.
  • [226] C.H. Zhou, Z. Ren, and G. Hua (2020) Temporal keypoint matching and refinement network for pose estimation and tracking. In ECCV, Cited by: TABLE I, §III-C2.
  • [227] K. Zhou, X.-G. Han, N.-J. Jiang, K. Jia, and J.-B. Lu (2019) HEMlets pose: learning part-centric heatmap triplets for accurate 3d human pose estimation. In ICCV, Cited by: TABLE II, §IV-A4.
  • [228] L. Zhou, Y. Chen, Y. Gao, J. Wang, and H. Lu (2020) Occlusion-aware siamese network for human pose estimation. In ECCV, Cited by: TABLE I, §III-B1.
  • [229] X.-W. Zhou, M.-L. Zhu, S. Leonardos, K.-G. Derpanis, and K. Daniilidis (2016) Sparseness meets deepness: 3d human pose estimation from monocular video. In CVPR, Cited by: TABLE VIII.
  • [230] H. Zhu, X. Zuo, S. Wang, X. Cao, and R. Yang (2019) Detailed human shape estimation from a single image by hierarchical mesh deformation. In CVPR, Cited by: §IV-B1.
  • [231] S. Zuffi and M.-J. Black (2013) Puppet flow. International Journal of Computer Vision 101 (3), pp. 437–458. Cited by: §V-B1.
  • [232] S. Zuffi and M.-J. Black (2015) The stitched puppet: a graphical model of 3d human shape and pose. In CVPR, Cited by: §I-A, §IV-A4.