FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.
In this work, we establish dense correspondences between RGB image and a surface-based representation of the human body, a task we refer to as dense human pose estimation. We first gather dense correspondences for 50K persons appearing in the COCO dataset by introducing an efficient annotation pipeline. We then use our dataset to train CNN-based systems that deliver dense correspondence 'in the wild', namely in the presence of background, occlusions and scale variations. We improve our training set's effectiveness by training an 'inpainting' network that can fill in missing groundtruth values and report clear improvements with respect to the best results that would be achievable in the past. We experiment with fully-convolutional networks and region-based models and observe a superiority of the latter; we further improve accuracy through cascading, obtaining a system that delivers highly0accurate results in real time. Supplementary materials and videos are provided on the project page http://densepose.orgREAD FULL TEXT VIEW PDF
In this paper, we address the problem of learning 3D human pose and body...
Body orientation estimation provides crucial visual cues in many
In this paper we propose to learn a mapping from image pixels into a den...
Human pose estimation has recently made significant progress with the
This paper investigates the task of 2D human whole-body pose estimation,...
Human pose estimation aims to locate the human body parts and build huma...
In this work, we focus on the task of learning and representing dense
FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.
FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.
This work aims at pushing further the envelope of human understanding in images by establishing dense correspondences from a 2D image to a 3D, surface-based representation of the human body. We can understand this task as involving several other problems, such as object detection, pose estimation, part and instance segmentation either as special cases or prerequisites. Addressing this task has applications in problems that require going beyond plain landmark localization, such as graphics, augmented reality, or human-computer interaction, and could also be a stepping stone towards general 3D-based object understanding.
The task of establishing dense correspondences from an image to a surface-based model has been addressed mostly in the setting where a depth sensor is available, as in the Vitruvian manifold of , metric regression forests , or the more recent dense point cloud correspondence of . By contrast, in our case we consider a single RGB image as input, based on which we establish a correspondence between surface points and image pixels.
Several other works have recently aimed at recovering dense correspondences between pairs  or sets of RGB images [48, 10] in an unsupervised setting. More recently,  used the equivariance principle in order to align sets of images to a common coordinate system, while following the general idea of groupwise image alignment, e.g. [23, 21].
While these works are aiming at general categories, our work is focused on arguably the most important visual category, humans. For humans one can simplify the task by exploiting parametric deformable surface models, such as the Skinned Multi-Person Linear (SMPL) model of , or the more recent Adam model of  obtained through carefully controlled 3D surface acquisition. Turning to the task of image-to-surface mapping, in , the authors propose a two-stage method of first detecting human landmarks through a CNN and then fitting a parametric deformable surface model to the image through iterative minimization. In parallel to our work,  develop the method of  to operate in an end-to-end fashion, incorporating the iterative reprojection error minimization as a module of a deep network that recovers 3D camera pose and the low-dimensional body parametrization.
Our methodology differs from all these works in that we take a full-blown supervised learning approach and gather ground-truth correspondences between images and a detailed, accurate parametric surface model of the human body: rather than using the SMPL model at test time we only use it as a means of defining our problem during training. Our approach can be understood as the next step in the line of works on extending the standard for humans in [26, 1, 19, 7, 40, 18, 28]. Human part segmentation masks have been provided in the Fashionista , PASCAL-Parts , and Look-Into-People (LIP)  datasets; these can be understood as providing a coarsened version of image-to-surface correspondence, where rather than continuous coordinates one predicts discretized part labels. Surface-level supervision was only recently introduced for synthetic images in , while in  a dataset of 8515 images is annotated with keypoints and semi-automated fits of 3D models to images. In this work instead of compromising the extent and realism of our training set we introduce a novel annotation pipeline that allows us to gather ground-truth correspondences for 50K images of the COCO dataset, yielding our new DensePose-COCO dataset.
Our work is closest in spirit to the recent DenseReg framework , where CNNs were trained to successfully establish dense correspondences between a 3D model and images ‘in the wild’. That work focused mainly on faces, and evaluated their results on datasets with moderate pose variability. Here, however, we are facing new challenges, due to the higher complexity and flexibility of the human body, as well as the larger variation in poses. We address these challenges by designing appropriate architectures, as described in Sec. 3, which yield substantial improvements over a DenseReg-type fully convolutional architecture. By combining our approach with the recent Mask-RCNN system of  we show that a discriminatively trained model can recover highly-accurate correspondence fields for complex scenes involving tens of persons with real-time speed: on a GTX 1080 GPU our system operates at 20-26 frames per second for a image or 4-5 frames per second for a image.
Our contributions can be summarized in three points. Firstly, as described in Sec. 2, we introduce the first manually-collected ground truth dataset for the task, by gathering dense correspondences between the SMPL model  and persons appearing in the COCO dataset. This is accomplished through a novel annotation pipeline that exploits 3D surface information during annotation.
Secondly, as described in Sec. 3, we use the resulting dataset to train CNN-based systems that deliver dense correspondence ‘in the wild’, by regressing body surface coordinates at any image pixel. We experiment with both fully-convolutional architectures, relying on Deeplab , and also with region-based systems, relying on Mask-RCNN , observing a superiority of region-based models over fully-convolutional networks. We also consider cascading variants of our approach, yielding further improvements over existing architectures.
Thirdly, we explore different ways of exploiting our constructed ground truth information. Our supervision signal is defined over a randomly chosen subset of image pixels per training sample. We use these sparse correspondences to train a ‘teacher’ network that can ‘inpaint’ the supervision signal in the rest of the image domain. Using this inpainted signal results in clearly better performance when compared to either sparse points, or any other existing dataset, as shown experimentally in Sec. 4.
Our experiments indicate that dense human pose estimation is to a large extent feasible, but still has space for improvement. We conclude our paper with some qualitative results and directions that show the potential of the method. We will make code and data publicly available from our project’s webpage, http://densepose.org.
Gathering rich, high-quality training sets has been a catalyst for progress in the classification , detection and segmentation [8, 26] tasks. There currently exists no manually collected ground-truth for dense human pose estimation for real images. The works of  and  can be used as surrogates, but as we show in Sec. 4 provide worse supervision.
In this Section we introduce our COCO-DensePose dataset, alongside with evaluation measures that allow us to quantify progress in the task in Sec. 4. We have gathered annotations for 50K humans, collecting more then 5 million manually annotated correspondences.
We start with a presentation of our annotation pipeline, since this required several design choices that may be more generally useful for 3D annotation. We then turn to an analysis of the accuracy of the gathered ground-truth, alongside with the resulting performance measures used to assess the different methods.
In this work, we involve human annotators to establish dense correspondences from 2D images to surface-based representations of the human body. If done naively, this would require ‘hunting vertices’ for every 2D image point, by manipulating a surface through rotations - which can be frustratingly inefficient. Instead, we construct an annotation pipeline through which we can efficiently gather annotations for image-to-surface correspondence.
As shown in Fig. 2, in the first stage we ask annotators to delineate regions corresponding to visible, semantically defined body parts. These include Head, Torso, Lower/Upper Arms, Lower/Upper Legs, Hands and Feet. In order to use simplify the UV parametrization we design the parts to be isomorphic to a plane, partitioning the limbs and torso into lower-upper and frontal-back parts.
For head, hands and feet, we use the manually obtained UV fields provided in the SMPL model . For the rest of the parts we obtain the unwrapping via multi-dimensional scaling applied to pairwise geodesic distances. The UV fields for the resulting 24 parts are visualized in Fig. 1 (right).
We instruct the annotators to estimate the body part behind the clothes, so that for instance wearing a large skirt would not complicate the subsequent annotation of correspondences. In the second stage we sample every part region with a set of roughly equidistant points obtained via k-means and request the annotators to bring these points in correspondence with the surface. The number of sampled points varies based on the size of the part and the maximum number of sampled points per part is 14. In order to simplify this task we ‘unfold’ the part surface by providing six pre-rendered views of the same body part and allow the user to place landmarks on any of them Fig.3. This allows the annotator to choose the most convenient point of view by selecting one among six options instead of manually rotating the surface.
As the user indicates a point on any of the rendered part views, its surface coordinates are used to simultaneously show its position on the remaining views – this gives a global overview of the correspondence. The image points are presented to the annotator in a horizontal/vertical succession, which makes it easier to deliver geometrically consistent annotations by avoiding self-crossings of the surface. This two-stage annotation process has allowed us to very efficiently gather highly accurate correspondences. If we quantify the complexity of the annotation task in terms of the time it takes to complete it, we have seen that the part segmentation and correspondence annotation tasks take approximately the same time, which is surprising given the more challenging nature of the latter task. Visualizations of the collected annotations are provided in Fig. 4, where the partitioning of the surface and U, V coordinates are shown in Fig. 1.
We assess human annotator with respect to a gold-standard measure of performance. Typically in pose estimation one asks multiple annotators to label the same landmark, which is then used to assess the variance in position, e.g.[26, 36]. In our case, we can render images where we have access to the true mesh coordinates used to render a pixel. We thereby directly compare the true position used during rendering and the one estimated by annotators, rather than first estimating a ’consensus’ landmark location among multiple human annotators.
In particular, we provide annotators with synthetic images generated through the exact same surface model as the one we use in our ground-truth annotation, exploiting the rendering system and textures of . We then ask annotators to bring the synthesized images into correspondence with the surface using our annotation tool, and for every image estimate the geodesic distance between the correct surface point, and the point estimated by human annotators :
where measures the geodesic distance between two surface points.
For any image , we annotate and estimate the error only on a randomly sampled set of surface points
and interpolate the errors on the remainder of the surface. Finally, we average the errors across allexamples used to assess annotator performance.
As shown in Fig. 5 the annotation errors are substantially smaller on small surface parts with distinctive features that could help localization (face, hands, feet), while on larger uniform areas that are typically covered by clothes (torso, back, hips) the annotator errors can get larger.
We consider two different ways of summarizing correspondence accuracy over the whole human body, including pointwise and per-instance evaluation.
This approach evaluates correspondence accuracy over the whole image domain through the Ratio of Correct Point (RCP) correspondences, where a correspondence is declared correct if the geodesic distance is below a certain threshold. As the threshold varies, we obtain a curve , whose area provides us with a scalar summary of the correspondence accuracy. For any given image we have a varying set of points coming with ground-truth signals. We summarize performance on the ensemble of such points, gathered across images. We evaluate the area under the curve (AUC), , for two different values of yielding and respectively, where is understood as being an accuracy measure for more refined correspondence. This performance measure is easily applicable to both single- and multi-person scenarios and can deliver directly comparable values. In Fig. 6, we provide the per-part pointwise evaluation of the human annotator performance on synthetic data, which can be seen as an upper bound for the performance of our systems.
where is the set of ground truth points annotated on person instance , is the vertex estimated by a model at point , is the ground truth vertex and is a normalizing parameter. We set so that a single point has a GPS value of if its geodesic distance from the ground truth equals the average half-size of a body segment, corresponding to approximately cm. Intuitively, this means that a score of can be achieved by a perfect part segmentation model, while going above that also requires a more precise localization of a point on the surface.
Once the matching is performed, we follow the COCO challenge protocol [26, 37] and evaluate Average Precision (AP) and Average Recall (AR) at a number of GPS thresholds ranging from 0.5 to 0.95, which corresponds to the range of geodesic distances between and cm. We use the same range of distances to perform both per-instance and per-point evaluation.
We now turn to the task of training a deep network that predicts dense correspondences between image pixels and surface points. Such a task was recently addressed in the Dense Regression (DenseReg) system of  through a fully-convolutional network architecture . In this work, we introduce improved architectures by combining the DenseReg approach with the Mask-RCNN architecture , yielding our ‘DensePose-RCNN’ system. We develop cascaded extensions of DensePose-RCNN that further improve accuracy and describe a training-based interpolation method that allows us to turn a sparse supervision signal into a denser and more effective variant.
The simplest architecture choice consists in using a fully convolutional network (FCN) that combines a classification and a regression task, similar to DenseReg. In a first step, we classify a pixel as belonging to either background, or one among several region parts which provide a coarse estimate of surface coordinates. This amounts to a labelling task that is trained using a standard cross-entropy loss. In a second step, a regression system indicates the exact coordinates of the pixel within the part. Since the human body has a complicated structure, we break it into multiple independent pieces and parameterize each piece using a local two-dimensional coordinate system, that identifies the position of any node on this surface part.
Intuitively, we can say that we first use appearance to make a coarse estimate of where the pixel belongs to and then align it to the exact position through some small-scale correction. Concretely, coordinate regression at an image position can be formulated as follows:
where in the first stage we assign position to the body part
that has highest posterior probability, as calculated by the classification branch, and in the second stage we use the regressorthat places the point in the continuous coordinates parametrization of part . In our case, can take 25 values (one is background), meaning that is a 25-way classification unit, and we train 24 regression functions , each of which provides 2D coordinates within its respective part . While training, we use a cross-entropy loss for the part classification and a smooth loss for training each regressor. The regression loss is only taken into account for a part if the pixel is within the specific part.
Using an FCN makes the system particularly easy to train, but loads the same deep network with too many tasks, including part segmentation and pixel localization, while at the same time requiring scale-invariance which becomes challenging for humans in COCO. Here we adopt the region-based approach of [34, 15], which consists in a cascade of proposing regions-of-interest (ROI), extracting region-adapted features through ROI pooling [16, 15] and feeding the resulting features into a region-specific branch. Such architectures decompose the complexity of the task into controllable modules and implement a scale-selection mechanism through ROI-pooling. At the same time, they can also be trained jointly in an end-to-end manner .
We adopt the settings introduced in , involving the construction of Feature Pyramid Network  features, and ROI-Align pooling, which have been shown to be important for tasks that require spatial accuracy. We adapt this architecture to our task, so as to obtain dense part labels and coordinates within each of the selected regions.
As shown in Fig. 7, we introduce a fully-convolutional network on top of ROI-pooling that is entirely devoted to these two tasks, generating a classification and a regression head that provide the part assignment and part coordinate predictions, as in DenseReg. For simplicity, we use the exact same architecture used in the keypoint branch of Mask-RCNN, consisting of a stack of 8 alternating
fully convolutional and ReLU layers with 512 channels. At the top of this branch we have the same classification and regression losses as in the FCN baseline, but we now use a supervision signal that is cropped within the proposed region.
During inference, our system operates at 25fps on 320x240 images and 4-5fps on 800x1100 images using a GTX1080 graphics card.
Inspired by the success of recent pose estimation models based on iterative refinement [45, 30] we experiment with cascaded architectures. Cascading can improve performance both by providing context to the following stages, and also through the benefits of deep supervision .
As shown in Fig. 8, we do not confine ourselves to cascading within a single task, but also exploit information from related tasks, such as keypoint estimation and instance segmentation, which have successfully been addressed by the Mask-RCNN architecture . This allows us to exploit task synergies and the complementary merits of different sources of supervision.
Even though we aim at dense pose estimation at test time, in every training sample we annotate only a sparse subset of the pixels, approximately 100-150 per human. This does not necessarily pose a problem during training, since we can make our classification/regression losses oblivious to points where the ground-truth correspondence was not collected, simply by not including them in the summation over the per-pixel losses . However, we have observed that we obtain substantially better results by “inpainting” the values of the supervision signal on positions that were not originally annotated. For this we adopt a learning-based approach where we firstly train a “teacher” network (depicted in Fig. 9) to reconstruct the ground-truth values wherever these are observed, and then deploy it on the full image domain, yielding a dense supervision signal. In particular, we only keep the network’s predictions on areas that are labelled as foreground, as indicated by the part masks collected by humans, in order to ignore network errors on background regions.
In all of the following experiments, we assess the methods on a test set of 1.5k images containing 2.3k humans, using as training set of 48K humans. Our test-set coincides with the COCO keypoints-minival partition used by  and the training set with the COCO-train partition. We are currently collecting annotations for the remainder of the COCO dataset, which will soon allow us to also have a competition mode evaluation.
Before assessing dense pose estimation ‘in the-wild’ in Sec. 4.3, we start in Sec. 4.1 with the more restricted ‘Single-Person’ setting where we use as inputs images cropped around ground-truth boxes. This factors out the effects of detection performance and provides us with a controlled setting to assess the usefulness of the COCO-DensePose dataset.
We start in Sec. 4.1.1 by comparing the COCO-DensePose dataset to other sources of supervision for dense pose estimation and then in Sec. 4.1.2 compare the performance of the model-based system of  with our discriminatively-trained system. Clearly the system of  was not trained with the same amount of data as our model; this comparison therefore serves primarily to show the merit of our large-scale dataset for discriminative training.
We start by assessing whether COCO-DensePose improves the accuracy of dense pose estimation with respect to the prior semi-automated, or synthetic supervision signals described below.
A semi-automated method is used for the ‘Unite the People’ (UP) dataset of , where human annotators verified the results of fitting the SMPL 3D deformable model  to 2D images. However, model fitting often fails in the presence of occlusions, or extreme poses, and is never guaranteed to be entirely successful – for instance, even after rejecting a large fraction of the fitting results, the feet are still often misaligned in . This both decimates the training set and obfuscates evaluation, since the ground-truth itself may have systematic errors.
Synthetic ground-truth can be established by rendering images using surface-based models [32, 31, 35, 11, 5, 29]. This has recently been applied to human pose in the SURREAL dataset of , where the SMPL model  was rendered with the CMU Mocap dataset poses . However, covariate shift can emerge because of the different statistics of rendered and natural images.
Since both of these two methods use the same SMPL surface model as the one we use in our work, we can directly compare results, and also combine datasets. We render our dense coordinates and our dense part labels on the SMPL model for all 8514 images of UP dataset and 60k SURREAL models for comparison.
In Fig. 12
we assess the test performance of ResNet-101 FCNs of stride 8 trained with different datasets, using a Deeplab-type architecture. During training we augment samples from all of the datasets with scaling, cropping and rotation. We observe that the surrogate datasets lead to weaker performance, while their combination yields improved results. Still, their performance is substantially lower than the one obtained by training on our DensePose dataset, while combining the DensePose with SURREAL results in a moderate drop in network performance. Based on these results we rely exclusively on the DensePose dataset for training in the remaining experiments, even though domain adaptation could be used in the future to exploit synthetic sources of supervision.
The last line in the table of Fig. 12 (’DensePose’) indicates an additional performance boost that we get by using the COCO human segmentation masks in order to replace background intensities with an average intensity during both training and testing and also by evaluating the network at multiple scales and averaging the results. Clearly, the results with other methods are not directly comparable, since we are using additional information to remove background structures. Still, the resulting predictions are substantially closer to human performance – we therefore use this as the ‘teacher network’ to obtain dense supervision for the experiments in Sec. 4.2.
In Fig. 12 we compare our method to the SMPLify pipeline of , which fits the 3D SMPL model to an image based on a pre-computed set of landmark points. We use the code provided by  with both DeeperCut pose estimation landmark detector  for 14-landmark results and with the 91-landmark alternative proposed in . Note that these landmark detectors were trained on the MPII dataset. Since the whole body is visible in the MPII dataset, for a fair comparison we separately evaluate on images where 16/17 or 17/17 landmarks are visible and on the whole test set. We observe that while being orders of magnitude faster (0.04-0.25” vs 60-200”) our bottom-up, feedforward method largely outperforms the iterative, model fitting result. As mentioned above, this difference in accuracy indicates the merit of having at our disposal DensePose-COCO for discriminative training.
Having established the merit of the DensePose-COCO dataset, we now turn to examining the impact of network architecture on dense pose estimation in-the-wild. In Fig. 12 we summarize our experimental findings using the same RCP measure used in Fig. 12.
We observe firstly that the FCN-based performance in-the-wild (curve ‘DensePose-FCN’) is now dramatically lower than that of the DensePose curve in Fig. 12. Even though we apply a multi-scale testing strategy that fuses probabilities from multiple runs using input images of different scale , the FCN is not sufficiently robust to deal with the variability in object scale.
We then observe in curve ‘DensePose-RCNN’ a big boost in performance thanks to switching to a region-based system. The networks up to here have been trained using the sparse set of points that have been manually annotated. In curve ‘DensePose-RCNN-Distillation’ we see that using the dense supervision signal delivered by our DensePose system on the training set yields a substantial improvement. Finally, in ‘DensePose-RCNN-Cascade’ we show the performance achieved thanks to the introduction of cascading: Sec. 3.3 almost matches the ’DensePose’ curve of Fig. 12.
|DensePose + masks||51.9||85.5||54.7||39.4||53.9||61.1||89.7||65.5||42.0||62.4|
|DensePose + keypoints||52.8||85.6||56.2||42.2||54.7||62.6||89.8||67.7||45.4||63.7|
|Multi-task learning with cascading|
|DensePose + masks||52.8||85.5||56.1||40.3||54.6||62.0||89.7||67.0||42.4||63.3|
|DensePose + keypoints||55.8||87.5||61.2||48.4||57.1||63.9||91.0||69.7||50.3||64.8|
This is a remarkably positive result: as described in Sec. 4.1, the ‘DensePose’ curve corresponds to a very privileged evaluation, involving (a) cropping objects around their ground-truth boxes and fixing their scale (b) removing background variation from both training and testing, by using ground-truth object masks and (c) ensembling over scales. It can therefore be understood as an upper bound of what we could expect to obtain when operating in-the-wild. We see that our best system is marginally below that level of performance, which clearly reveals the power of the three modifications we introduce, namely region-based processing, inpainting the supervision signal, and cascading.
In Table 1 we report the AP and AR metrics described in Sec. 2 as we change different choices in our architecture. We have conducted experiments using both ResNet-50 and ResNet-101 backbones and observed an only insignificant boost in performance with the larger model (first two rows in Table 1). The rest of our experiments are therefore based on the ResNet-50-FPN version of DensePose-RCNN. The following two experiments shown in the middle section of Table 1 indicate the impact on multi-task learning.
Augmenting the network with the mask or keypoint branches yields improvements with any of these two auxiliary tasks. The last section of Table 1 reports improvements in dense pose estimation obtained through cascading using the network setup from Fig. 8. Incorporating additional guidance in particular from the keypoint branch significantly boosts performance.
In this section we provide additional qualitative results to further demonstrate the performance of our method. In Fig. 13 we show qualitative results generated by our method, where the correspondence is visualized in terms of ‘fishnets’, namely isocontours of estimated UV coordinates that are superimposed on humans. As these results indicate, our method is able to handle large amounts of occlusion, scale, and pose variation, while also successfully hallucinating the human body behind clothes such as dresses or skirts.
In this work we have tackled the task of dense human pose estimation using discriminative trained models. We have introduced COCO-DensePose, a large-scale dataset of ground-truth image-surface correspondences and developed novel architectures that allow us to recover highly-accurate dense correspondences between images and the body surface in multiple frames per second. We anticipate that this will pave the way both for downstream tasks in augmented reality or graphics, but also help us tackle the general problem of associating images with semantic 3D object representations.
We thank the authors of  for sharing their code, Piotr Dollar for guidance and proposals related to our dataset’s quality, Tsung-Yi Lin for his help with COCO-related issues and H. Yiğit Güler for his help with backend development.
Unsupervised domain adaptation by backpropagation.In ICML, 2015.