We propose a new bottom-up method for multi-person 2D human pose estimation that is particularly well suited for urban mobility such as self-driving cars and delivery robots. The new method, PifPaf, uses a Part Intensity Field (PIF) to localize body parts and a Part Association Field (PAF) to associate body parts with each other to form full human poses. Our method outperforms previous methods at low resolution and in crowded, cluttered and occluded scenes thanks to (i) our new composite field PAF encoding fine-grained information and (ii) the choice of Laplace loss for regressions which incorporates a notion of uncertainty. Our architecture is based on a fully convolutional, single-shot, box-free design. We perform on par with the existing state-of-the-art bottom-up method on the standard COCO keypoint task and produce state-of-the-art results on a modified COCO keypoint task for the transportation domain.READ FULL TEXT VIEW PDF
Many image-based perception tasks can be formulated as detecting, associ...
In this paper, we present a new bottom-up one-stage method for whole-bod...
We rethink a well-know bottom-up approach for multi-person pose estimati...
Human pose estimation is an important topic in computer vision with many...
We propose a novel Enhanced Feature Aggregation and Selection network
This paper investigates the task of 2D human whole-body pose estimation,...
Existing pose estimation approaches can be categorized into single-stage...
Tremendous progress has been made in estimating human poses “in the wild” driven by popular data collection campaigns [1, 27]. Yet, when it comes to the “transportation domain” such as for self-driving cars or social robots, we are still far from matching an acceptable level of accuracy. While a pose estimate is not the final goal, it is an effective low dimensional and interpretable representation of humans to detect critical actions early enough for autonomous navigation systems (e.g., detecting pedestrians who intend to cross the street). Consequently, the further away a human pose can be detected, the safer an autonomous system will be. This directly relates to pushing the limits on the minimum resolution needed to perceive human poses.
In this work, we tackle the well established multi-person 2D human pose estimation problem given a single input image. We specifically address challenges that arise in autonomous navigation settings as illustrated in Figure 1: (i) wide viewing angle with limited resolution on humans, i.e., a height of 30-90 pixels, and (ii) high density crowds where pedestrians occlude each other. Naturally, we aim for high recall and precision.
Although pose estimation has been studied before the deep learning era, a significant cornerstone is the work of OpenPose, followed by Mask R-CNN . The former is a bottom-up approach (detecting joints without a person detector), and the latter is a top-down one (using a person detector first and outputting joints within the detected bounding boxes). While the performance of these methods is stunning on high enough resolution images, they perform poorly in the limited resolution regime, as well as in dense crowds where humans partially occlude each other.
In this paper, we propose to extend the notion of fields in pose estimation 
to go beyond scalar and vector fields tocomposite
fields. We introduce a new neural network architecture with two head networks. For each body part or joint, one head network predicts the confidence score, the precise location and the size of this joint, which we call a Part Intensity Field (PIF) and which is similar to the fused part confidence map in. The other head network predicts associations between parts, called the Part Association Field (PAF), which is of a new composite structure. Our encoding scheme has the capacity to store fine-grained information on low resolution activation maps. The precise regression to joint locations is critical, and we use a Laplace-based loss  instead of the vanilla loss . Our experiments show that we outperform both bottom-up and established top-down methods on low resolution images while performing on par on higher resolutions. The software is open source and available online111https://github.com/vita-epfl/openpifpaf.
Over the past years, state-of-the-art methods for pose estimation are based on Convolutional Neural Networks[18, 3, 31, 34]. They outperform traditional methods based on pictorial structures [12, 8, 9] and deformable part models . The deep learning tsunami started with DeepPose  that uses a cascade of convolutional networks for full-body pose estimation. Then, instead of predicting absolute human joint locations, some works refine pose estimates by predicting error feedback (i.e., corrections) at each iteration [4, 17] or using a human pose refinement network to exploit dependencies between input and output spaces . There is now an arms race towards proposing alternative neural network architectures: from convolutional pose machines , stacked hourglass networks [32, 28], to recurrent networks , and voting schemes such as .
All these approaches for human pose estimation can be grouped into bottom-up and top-down methods. The former one estimates each body joint first and then groups them to form a unique pose. The latter one runs a person detector first and estimates body joints within the detected bounding boxes.
Examples of top-down methods are PoseNet , RMPE , CFN , Mask R-CNN [18, 15] and more recently CPN  and MSRA . These methods profit from advances in person detectors and vast amounts of labeled bounding boxes for people. The ability to leverage that data turns the requirement of a person detector into an advantage. Notably, Mask R-CNN treats keypoint detections as an instance segmentation task. During training, for every independent keypoint, the target is transformed to a binary mask containing a single foreground pixel. In general, top-down methods are effective but struggle when person bounding boxes overlap.
. They solve the part association with an integer linear program which results in processing times for a single image of the order of hours. Later works accelerate the prediction time and broaden the applications to track animal behavior . Other methods drastically reduce prediction time by using greedy decoders in combination with additional tools as in Part Affinity Fields , Associative Embedding  and PersonLab . Recently, MultiPoseNet  develops a multi-task learning architecture combining detection, segmentation and pose estimation for people.
channels. An operation with stride two is indicated by “//2”. The decoder is a program that converts PIF and PAF fields into pose estimates containing 17 joints each. Each joint is represented by anand coordinate and a confidence score.
The goal of our method is to estimate human poses in crowded images. We address challenges related to low-resolution and partially occluded pedestrians. Top-down methods particularly struggle when pedestrians are occluded by other pedestrians where bounding boxes clash. Previous bottom-up methods are bounding box free but still contain a coarse feature map for localization. Our method is free of any grid-based constraint on the spatial localization of the joints and has the capacity to estimate multiple poses occluding each other.
Figure 2 presents our overall model. It is a shared ResNet  base network with two head networks: one head network predicts a confidence, precise location and size of a joint, which we call a Part Intensity Field (PIF), and the other head network predicts associations between parts, called the Part Association Field (PAF). We refer to our method as PifPaf.
Before describing each head network in detail, we briefly define our field notation.
Fields are a useful tool to reason about structure on top of images. The notion of composite fields directly motivates our proposed Part Association Fields.
We will use to enumerate spatially the output locations of the neural network and for real-valued coordinates. A field is denoted with over the domain and can have as codomain (the values of the field) scalars, vectors or composites. For example, the composite of a scalar field and a vector field can be represented as which is equivalent to “overlaying” a confidence map with a vector field.
The Part Intensity Fields (PIF) detect and precisely localize body parts. The fusion of a confidence map with a regression for keypoint detection was introduced in . Here, we recap this technique in the language of composite fields and add a scale as a new component to form our PIF field.
PIF have composite structure. They are composed of a scalar component for confidence, a vector component that points to the closest body part of the particular type and another scalar component for the size of the joint. More formally, at every output location , a PIF predicts a confidence , a vector with spread (details in Section 3.4) and a scale and can be written as .
The confidence map of a PIF is very coarse. Figure 2(a) shows a confidence map for the left shoulders for an example image. To improve the localization of this confidence map, we fuse it with the vectorial part of the PIF shown in Figure 2(b) into a high resolution confidence map.
We create this high resolution part confidence map with a convolution of an unnormalized Gaussian kernel with width over the regressed targets from the Part Intensity Field weighted by its confidence :
This equation emphasizes the grid-free nature of the localization. The spatial extent of a joint is learned as part of the field. An example is shown in Figure 2(c). The resulting map of highly localized joints is used to seed the pose generation and to score the location of newly proposed joints.
Associating joints into multiple poses is challenging in crowded scenes where people partially occlude each other. Especially two step processes – top-down methods – struggle in this situation: first they detect person bounding boxes and then they attempt to find one joint-type for each bounding box. Bottom-up methods are bounding box free and therefore do not suffer from the clashing bounding box problem.
We propose bottom-up Part Association Fields (PAF) to connect joint locations together into poses. An illustration of the PAF scheme is shown in Figure 4.
At every output location, PAFs predict a confidence, two vectors to the two parts this association is connecting and two widths (details in Section 3.4) for the spatial precisions of the regressions. PAFs are represented with . Visualizations of the associations between left shoulders and left hips are shown in Figure 5.
Both endpoints are localized with regressions that do not suffer from discretizations as they occur in grid-based methods. This helps to resolve joint locations of close-by persons precisely and to resolve them into distinct annotations.
There are 19 connections for the person class in the COCO dataset each connecting two types of joints; e.g., there is a right-knee-to-right-ankle association. The algorithm to construct the PAF components at a particular feature map location consists of two steps. First, find the closest joint of either of the two types which determines one of the vector components. Second, the ground truth pose determines the other vector component to represent the association. The second joint is not necessarily the closest one and can be far away.
During training, the components of the field have to point to the parts that should be associated. Similar to how an component of a vector field always has to point to the same target as the component, the components of the PAF field have to point to the same association of parts.
Human pose estimation algorithms tend to struggle with the diversity of scales that a human pose can have in an image. While a localization error for the joint of a large person can be minor, that same absolute error might be a major mistake for a small person. We use an -type loss to train regressive outputs. We improve the localization ability of the network by injecting a scale dependence into that regression loss with the SmoothL1  or Laplace loss .
The SmoothL1 loss allows to tune the radius around the origin where it produces softer gradients. For a person instance bounding box area of and keypoint size of , can be set proportionally to which we study in Table 3.
The Laplace loss is another -type loss that is attenuated via the predicted spread :
It is independent of any estimates of and and we use it for all vectorial components.
Decoding is the process of converting the output feature maps of a neural network into sets of 17 coordinates that make human pose estimates. Our process is similar to the fast greedy decoding used in .
A new pose is seeded by PIF vectors with the highest values in the high resolution confidence map defined in equation 1. Starting from a seed, connections to other joints are added with the help of PAF fields. The algorithm is fast and greedy. Once a connection to a new joint has been made, this decision is final.
Multiple PAF associations can form connections between the current and the next joint. Given the location of a starting joint , the scores of PAF associations a are calculated with
which takes into account the confidence in this connection
, the distance to the first vector’s location calibrated with the two-tailed Laplace distribution probability and the high resolution part confidence at the second vector’s target location. To confirm the proposed position of the new joint, we run reverse matching. This process is repeated until a full pose is obtained. We apply non-maximum suppression at the keypoint level as in . The suppression radius is dynamic and based on the predicted scale component of the PIF field. We do not refine any fields neither during training nor test time.
|Mask R-CNN ||41.6||68.1||42.5||28.2||59.8||49.0||76.0||50.0||35.6||67.5|
Cameras in self-driving cars have a wide field of view and have to resolve small instances of pedestrians within that field of view. We want to emulate that small pixel-height distribution of pedestrians with a publicly available dataset and evaluation protocol for human pose estimation.
In addition, and to demonstrate the broad applicability of our method, we also investigate pose estimation in the context of the person re-identification task (Re-Id) – that is, given an image of a person, identify that person in other images. Some prior work has used part-based or region-based models [45, 7, 43] that would profit from quality pose estimates.
We quantitatively evaluate our proposed method, PifPaf, on the COCO keypoint task  for people in low resolution images. Starting from the original COCO dataset, we constrain the maximum image side length to 321 pixels to emulate a crop of a 4k camera. We obtain person bounding boxes that are px high. The COCO metrics contain a breakdown for medium-sized humans under AP and AR that have bounding box area in the original image between between and . After resizing for low resolution, this corresponds to bounding boxes of height px.
We qualitatively study the performance of our method on images captured by self-driving cars as well as random crowded scenarios. We use the recently released nuScenes dataset . Since labels and evaluation protocols are not yet available we qualitatively study the results.
In the context of Re-Id, we investigate the popular and publicly available Market-1501 dataset . It consists of pixel crops of pedestrians. We apply the same model that we trained on COCO data. Figure 8 qualitatively compares extracted poses from Mask R-CNN  with our proposed method. The comparison shows a clear improvement of the poses extracted with our PifPaf method.
Performance on higher resolution images is not the focus of this paper, however other methods are optimized for full resolution COCO images and therefore we also show our results and comparisons for high resolution COCO poses.
The COCO keypoint detection task is evaluated like an object detection task, with the core metrics being variants of average precision (AP) and average recall (AR) thresholded at an object keypoint similarity (OKS) . COCO assumes a fixed ratio of keypoint size to bounding box area per keypoint type to define OKS. For each image, pose estimators have to provide the 17 keypoint locations per pose and a score for each pose. Only the top 20 scoring poses are considered for evaluation.
All our models are based on Imagenet pretrained base networks followed by custom, multiple head sub-networks. Specifically, we use the 64115 images in the 2017 COCO training set that have a person annotation for training. Our validation is done on the 2017 COCO validation set of 5000 images. The base networks are modified ResNet50/101/152 networks. The head networks are single-layer 1x1 sub-pixel convolutions that double the spatial resolution. The confidence component of a field is normalized with a sigmoid non-linearity.
The base network has various modification options. The strides of the input convolution and the input max-pooling operation can be changed. It is also possible to remove the max-pooling operation in the input block and the entire last block. The default modification used here is to remove the max-pool layer from the input block.
We apply only few and weak data augmentations. To create uniform batches, we crop images to squares where the side of the square is between 95% and 100% of the short edge of the image and the location is chosen randomly. These are large crops to keep as much of the training data as possible. Half of the time the entire image is used un-cropped and bars are added to make it square. The subsequent resizing uses bicubic interpolation. Training images and annotations are randomly horizontally flipped.
The components of the fields that form confidence maps are trained with independent binary cross entropy losses. We use losses for the scale components of the PIF fields and use Laplace losses for all vectorial components.
During training, we fix the running statistics of the Batch Normalization operations to their pretrained values . We use the SGD optimizer with a learning rate of , momentum of 0.95, batch size of 8 and no weight decay. We employ model averaging to extract stable models for validation. At each optimization step, we update an exponentially weighted version of the model parameters. Our decay constant is
. The training time for 75 epochs of ResNet101 on two GTX1080Ti is approximately 95 hours.
We compare our proposed PifPaf method against the reproducible state-of-the-art bottom-up OpenPose  and top-down Mask R-CNN  methods. While our goal is to outperform bottom-up approaches, we still report results of a top-down approach to evaluate the strength of our method. Since this is an emulation of small humans within a much larger image, we modified existing methods to prevent upscaling of small images.
Table 1 presents our quantitative results on the COCO dataset. We outperform the bottom-up OpenPose and even the top-down Mask R-CNN approach on all metrics. These numbers are overall lower than their higher resolution counterparts. The two conceptually very different baseline methods show similar performance while our method is clearly ahead by over 18% in AP.
Our quantitative results emulate the person distribution in urban street scenes using a public, annotated dataset. Figure 6 shows qualitative results of the kind of street scenes we want to address. Not only do we have less false positives, we detect pedestrians who partially occlude each other. It is interesting to see that a critical gesture such as “waving” towards a car is only detected with our method. Both Mask-RCNN and OpenPose have not accurately estimated the arm gesture in the first row of Figure 6. Such level of difference can be fundamental in developing safe self-driving cars.
We further show qualitative results on more crowded images in Figure 7. For perspectives like the one in the second row, we observe that bounding boxes of close-by pedestrians occlude further away pedestrians. This is a difficult scenario for top-down methods. Bottom-up methods perform here better which we can also observe for our PifPaf method.
To quantify the performance on the Market-1501 dataset, we created a simplified accuracy metric. The accuracy is 43% for Mask R-CNN and 96% for PifPaf. The evaluation is based on the number of images with a correct pose out of 202 random images from the train set. A correct pose has up to three joints misplaced.
Other methods are optimized for higher resolution images. For a fair comparison, we show a quantitative comparison on the high resolution COCO 2017 test-dev set in Table 2. We perform on par with the best existing bottom-up method.
|Mask R-CNN ||63.1||58.0||70.4|
|PersonLab  – single-scale||66.5||62.4||72.3|
|PifPaf – single-scale (ours)||66.7||62.4||72.9|
We studied the effects of various design decisions that are summarized in Table 3.
|Laplace (using in decoder)||45.5||31.4||64.9|
We found that we can tune the performance towards smaller or larger objects by modifying the overall scale of and so we studied its impact. However, the real improvement is obtained with the Laplace-based loss. The added scale component to the PIF field improved AP of our ResNet101 model from 64.5% to 65.7%.
Metrics for varying ResNet backbones are in Table 4. For the same backbone, we outperform PersonLab by 9.5% in AP with a simultaneous 32% speed up.
|ResNet101||65.7 (60.0)||240 (355)||175|
We have developed a new bottom-up method for multi-person 2D human pose estimation that addresses failure modes that are particularly prevalent in the transportation domain, i.e., in self-driving cars and social robots. We demonstrated that our method outperforms previous state-of-the-art methods in the low resolution regime and performs on par at high resolution.
The proposed PAF fields can be applied to other tasks as well. Within the image domain, predicting structured image concepts  is an exciting next step.
We would like to thank EPFL SCITAS for their support with compute infrastructure.
Person re-identification by multi-channel parts-based cnn with improved triplet loss function.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016.