## 1 Introduction

In this paper, we consider a core component in obtaining a detailed understanding of people in images and videos:
human 2D pose estimation—or the problem of localizing anatomical keypoints or “parts”. Human estimation has largely focused on finding body parts of *individuals*. Inferring the pose of multiple people in images presents a unique set of challenges. First, each image may contain an unknown number of people that can appear at any position or scale. Second, interactions between people induce complex spatial interference, due to contact, occlusion,
or
limb articulations, making association of parts difficult. Third, runtime complexity tends to grow with the number of people in the image, making realtime performance a challenge.

A common approach is to employ a person detector and perform single-person pose estimation for each detection. These top-down approaches directly leverage existing techniques for single-person pose estimation, but suffer from early commitment: if the person detector fails–as it is prone to do when people are in close proximity–there is no recourse to recovery. Furthermore, their runtime is proportional to the number of people in the image, for each person detection, a single-person pose estimator is run. In contrast, bottom-up approaches are attractive as they offer robustness to early commitment and have the potential to decouple runtime complexity from the number of people in the image. Yet, bottom-up approaches do not directly use global contextual cues from other body parts and other people. Initial bottom-up methods ([1, 2]) did not retain the gains in efficiency as the final parse required costly global inference, taking several minutes per image.

In this paper, we present an efficient method for multi-person pose estimation with competitive performance on multiple public benchmarks. We present the first bottom-up representation of association scores via Part Affinity Fields (PAFs), a set of 2D vector fields that encode the location and orientation of limbs over the image domain. We demonstrate that simultaneously inferring these bottom-up representations of detection and association encodes sufficient global context for a greedy parse to achieve high-quality results, at a fraction of the computational cost.

An earlier version of this manuscript appeared in[3]. This version makes several new contributions.
First,
we prove that PAF refinement is crucial for maximizing accuracy, while body part prediction refinement is not that important.
We increase the network depth but remove the body part refinement stages (Sections 3.1 and 3.2).
This refined network
increases both speed and accuracy by approximately 45% and 7%, respectively (detailed analysis in Sections 5.2 and 5.3).
Second,
we present an annotated foot dataset^{1}^{1}1dataset webpage: https://cmu-perceptual-computing-lab.github.io/foot_keypoint_dataset/ with 15K human foot instances that has been publicly released (Section 4.2), and we show that
a combined model with body and foot keypoints can be trained preserving the speed of the body-only model while maintaining its accuracy.
(described in Section 5.4).
Third, we demonstrate the generality of our method by applying it to the task of vehicle keypoint estimation (Section 5.5). Finally, this work documents the release of OpenPose [4]. This open-source library is the first available realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints (described in Section 4). We also include a runtime comparison to Mask R-CNN [5] and Alpha-Pose[6], showing the computational advantage of our bottom-up approach (Section 5.3).

## 2 Related Work

Single Person Pose Estimation The traditional approach to articulated human pose estimation is to perform inference over a combination of local observations on body parts and the spatial dependencies between them. The spatial model for articulated pose is either based on tree-structured graphical models [7, 8, 9, 10, 11, 12, 13], which parametrically encode the spatial relationship between adjacent parts following a kinematic chain, or non-tree models [14, 15, 16, 17, 18]

that augment the tree structure with additional edges to capture occlusion, symmetry, and long-range relationships. To obtain reliable local observations of body parts, Convolutional Neural Networks (CNNs) have been widely used, and have significantly boosted the accuracy on body pose estimation

[19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]. Tompson et al. [23] used a deep architecture with a graphical model whose parameters are learned jointly with the network. Pfister et al. [33] further used CNNs to implicitly capture global spatial dependencies by designing networks with large receptive fields. The convolutional pose machines architecture proposed by Wei et al. [20] used a multi-stage architecture based on a sequential prediction framework [34]; iteratively incorporating global context to refine part confidence maps and preserving multimodal uncertainty from previous iterations. Intermediate supervisions are enforced at the end of each stage to address the problem of vanishing gradients [35, 36, 37] during training. Newell et al. [19] also showed intermediate supervisions are beneficial in a stacked hourglass architecture. However, all of these methods assume a single person, where the location and scale of the person of interest is given.Multi-Person Pose Estimation For multi-person pose estimation, most approaches [38, 39, 40, 41, 42, 6, 5, 43, 44] have used a top-down strategy that first detects people and then have estimated the pose of each person independently on each detected region. Although this strategy makes the techniques developed for the single person case directly applicable, it not only suffers from early commitment on person detection, but also fails to capture the spatial dependencies across different people that require global inference. Some approaches have started to consider inter-person dependencies. Eichner et al. [45] extended pictorial structures to take a set of interacting people and depth ordering into account, but still required a person detector to initialize detection hypotheses. Pishchulin et al. [1]

proposed a bottom-up approach that jointly labels part detection candidates and associated them to individual people, with pairwise scores regressed from spatial offsets of detected parts. This approach does not rely on person detections, however, solving the proposed integer linear programming over the fully connected graph is an NP-hard problem and thus the average processing time for a single image is on the order of hours. Insafutdinov et al.

[2] built on [1] with a stronger part detectors based on ResNet [46] and image-dependent pairwise scores, and vastly improved the runtime with an incremental optimization approach, but the method still takes several minutes per image, with a limit of at most 150 part proposals. The pairwise representations used in [2], which are offset vectors between every pair of body parts, are difficult to regress precisely and thus a separate logistic regression is required to convert the pairwise features into a probability score.

In earlier work [3], we present part affinity fields (PAFs), a representation consisting of a set of flow fields that encodes unstructured pairwise relationships between body parts of a variable number of people. In contrast to [1] and [2], we can efficiently obtain pairwise scores from PAFs without an additional training step. These scores are sufficient for a greedy parse to obtain high-quality results with realtime performance for multi-person estimation. Concurrent to this work, Insafutdinov et al. [47] further simplified their body-part relationship graph for faster inference in single-frame model and formulated articulated human tracking as spatio-temporal grouping of part proposals. Recenetly, Newell et al. [48] proposed associative embeddings which can be thought as tags representing each keypoint’s group. They group keypoints with similar tags into individual people. Papandreou et al. [49] proposed to detect individual keypoints and predict their relative displacements, allowing a greedy decoding process to group keypoints into person instances. Kocabas et al. [50] proposed a Pose Residual Network which receives keypoint and person detections, and then assigns keypoints to detected person bounding boxes. Nie et al. [51] proposed to partition all keypoint detections using dense regressions from keypoint candidates to centroids of persons in the image.

In this work, we make several extensions to our earlier work [3]. We prove that PAF refinement is critical and sufficient for high accuracy, removing the body part confidence map refinement while increasing the network depth. This leads to a computationally faster and more accurate model. We also present the first combined body and foot keypoint detector, created from an annotated foot dataset that will be publicly released. We prove that combining both detection approaches not only reduces the inference time compared to running them independently, but also maintains their individual accuracy. Finally, we present OpenPose, the first open-source library for real time body, foot, hand, and facial keypoint detection.

## 3 Method

Fig. 2 illustrates the overall pipeline of our method. The system takes, as input, a color image of size (Fig. 2a) and produces the 2D locations of anatomical keypoints for each person in the image (Fig. 2e). First, a feedforward network predicts a set of 2D confidence maps of body part locations (Fig. 2b) and a set of 2D vector fields of part affinities, which encode the degree of association between parts (Fig. 2c). The set has confidence maps, one per part, where , . The set has vector fields, one per limb^{2}^{2}2We refer to part pairs as limbs for clarity, despite the fact that some pairs are not human limbs (e.g., the face)., where , . Each image location in encodes a 2D vector (as shown in Fig. 1). Finally, the confidence maps and the affinity fields are parsed by greedy inference (Fig. 2d) to output the 2D keypoints for all people in the image.

### 3.1 Network Architecture

Our architecture, shown in Fig. 3, iteratively predicts affinity fields that encode part-to-part association, shown in blue, and detection confidence maps, shown in beige. The iterative prediction architecture, following [20], refines the predictions over successive stages, , with intermediate supervision at each stage.

The network depth is increased with respect to [3]. In the original approach, the network architecture included several 7x7 convolutional layers. In our current model, the receptive field is preserved while the computation is reduced, by replacing each 7x7 convolutional kernel by 3 consecutive 3x3 kernels. While the number of operations for the former is , it is only for the latter. Additionally, the output of each one of the 3 convolutional kernels is concatenated, following an approach similar to DenseNet [52]. The number of non-linearity layers is tripled, and the network can keep both lower level and higher level features. Sections 5.2 and 5.3 analyze the accuracy and runtime speed improvements, respectively.

### 3.2 Simultaneous Detection and Association

The image is analyzed by a convolutional network (initialized by the first 10 layers of VGG-19 [53] and fine-tuned), generating a set of feature maps that is input to the first stage. At this stage, the network produces a set of part affinity fields (PAFs) , where refers to the CNNs for inference at Stage 1. In each subsequent stage, the predictions from the previous stage and the original image features are concatenated and used to produce refined predictions,

(1) |

where refers to the CNNs for inference at Stage , and to the number of total PAF stages. After iterations, the process is repeated for the confidence maps detection, starting in the most updated PAF prediction,

(2) | |||||

(3) |

where refers to the CNNs for inference at Stage , and to the number of total confidence map stages.

This approach differs from [3], where both the affinity field and confidence map branches were refined at each stage. Hence, the amount of computation per stage is reduced by half. We empirically observe in Section 5.2 that refined affinity field predictions improve the confidence map results, while the opposite does not hold. Intuitively, if we look at the PAF channel output, the body part locations can be guessed. However, if we see a bunch of body parts with no other information, we cannot parse them into different people.

Fig. 4

shows the refinement of the affinity fields across stages. The confidence map results are predicted on top of the latest and most refined PAF predictions, resulting in a barely noticeable difference across confidence map stages. To guide the network to iteratively predict PAFs of body parts in the first branch and confidence maps in the second branch, we apply a loss function at the end of each stage. We use an

loss between the estimated predictions and the groundtruth maps and fields. Here, we weight the loss functions spatially to address a practical issue that some datasets do not completely label all people. Specifically, the loss function of the PAF branch at stage and loss function of the confidence map branch at stage are:(4) | |||||

(5) |

where is the groundtruth part affinity vector field, is the groundtruth part confidence map, and is a binary mask with when the annotation is missing at an image location

. The mask is used to avoid penalizing the true positive predictions during training. The intermediate supervision at each stage addresses the vanishing gradient problem by replenishing the gradient periodically

[20]. The overall objective is
Comments

There are no comments yet.