MEBOW: Monocular Estimation of Body Orientation In the Wild

11/27/2020
by   Chenyan Wu, et al.
Amazon
Penn State University
0

Body orientation estimation provides crucial visual cues in many applications, including robotics and autonomous driving. It is particularly desirable when 3-D pose estimation is difficult to infer due to poor image resolution, occlusion or indistinguishable body parts. We present COCO-MEBOW (Monocular Estimation of Body Orientation in the Wild), a new large-scale dataset for orientation estimation from a single in-the-wild image. The body-orientation labels for around 130K human bodies within 55K images from the COCO dataset have been collected using an efficient and high-precision annotation pipeline. We also validated the benefits of the dataset. First, we show that our dataset can substantially improve the performance and the robustness of a human body orientation estimation model, the development of which was previously limited by the scale and diversity of the available training data. Additionally, we present a novel triple-source solution for 3-D human pose estimation, where 3-D pose labels, 2-D pose labels, and our body-orientation labels are all used in joint training. Our model significantly outperforms state-of-the-art dual-source solutions for monocular 3-D human pose estimation, where training only uses 3-D pose labels and 2-D pose labels. This substantiates an important advantage of MEBOW for 3-D human pose estimation, which is particularly appealing because the per-instance labeling cost for body orientations is far less than that for 3-D poses. The work demonstrates high potential of MEBOW in addressing real-world challenges involving understanding human behaviors. Further information of this work is available at https://chenyanwu.github.io/MEBOW/.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

page 8

page 12

page 13

page 14

page 15

01/20/2022

Estimating Egocentric 3D Human Pose in the Wild with External Weak Supervision

Egocentric 3D human pose estimation with a single fisheye camera has dra...
09/14/2020

Beyond Weak Perspective for Monocular 3D Human Pose Estimation

We consider the task of 3D joints location and orientation prediction fr...
01/07/2017

Group Visual Sentiment Analysis

In this paper, we introduce a framework for classifying images according...
07/23/2020

Whole-Body Human Pose Estimation in the Wild

This paper investigates the task of 2D human whole-body pose estimation,...
04/22/2020

Yoga-82: A New Dataset for Fine-grained Classification of Human Poses

Human pose estimation is a well-known problem in computer vision to loca...
08/13/2021

FrankMocap: A Monocular 3D Whole-Body Pose Estimation System via Regression and Integration

Most existing monocular 3D pose estimation approaches only focus on a si...
05/18/2021

Single View Geocentric Pose in the Wild

Current methods for Earth observation tasks such as semantic mapping, ma...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a)
(b)
Figure 1: Overview of the MEBOW dataset. (a) Distribution of the body orientation labels in the dataset and examples. (b) Comparison of the distribution of the captured human body instance resolution for our dataset and that for the TUD dataset [6]. The -axis represents (), where and are the width and height of the human body instance bounding box in pixels, respectively.

Human body orientation estimation (HBOE) aims at estimating the orientation of a person with respect to the camera point of view. It is important for a number of industrial applications, e.g., robots interacting with people and autonomous driving vehicles cruising through crowded urban areas. Given a predicted -D human pose, commonly in the form of a skeleton with dozens of joints, the body orientation can be inferred. Hence, one may argue that HBOE is a simpler task compared with -D human pose estimation and directly solvable using pose estimation models. Nonetheless, HBOE warrants to be tackled as a standalone problem for three reasons. First, -D pose may be difficult to infer due to poor image resolution, occlusion, or indistinguishable body parts, all of which are prevalent in in-the-wild images. Second, under certain scenarios, the orientation of the body is already sufficient to be used as the cue for downstream prediction or planning tasks. Third, much reduced computational cost for body orientation model compared to a -D model makes it more appealing for on-device deployment. Moreover, body orientation estimation and -D pose estimation may be complementary in addressing real-world challenges involving understanding human behaviors.

HBOE has been studied in recent years [6, 8, 10, 14, 18, 19, 27, 33, 45, 53, 54]. A primary bottleneck, however, is a lack of a large-scale, high-precision, diverse-background dataset. Previously, the TUD dataset [6] has been the most widely used dataset for HBOE. But it only has about images, and the orientation labels are of low precision because they are quantized into eight bins/classes. Hara et al. [18] relabeled the TUD dataset with continuous orientation labels, but the scale limitation is left unaddressed. And we verify experimentally that the model trained on it generalizes much worse to in-the-wild images compared with the much larger dataset we present here. Because body orientation could be inferred from a -D pose label (in the form of a list of -D coordinates for predefined joints), 3-D human pose datasets, e.g., Human3.6M, could be used to train body orientation estimation models after necessary preprocessing. However, those datasets are commonly only recorded indoors (due to the constraint of motion capture systems), with a clean background, bearing little occlusion problems, and for a limited number of human subjects. All of these limitations make it less likely for the body orientation models developed on existing

-D pose datasets to generalize well to images captured in the wild, in which various occlusion, lighting conditions, and poses could arise. Given the enormous success of large-scale datasets in advancing vision research, such as ImageNet 

[13] for image classification, KITTI [15] for optical flow, and COCO [26]

for object recognition and instance segmentation among many others, we believe the creation of a large-scale, high-precision dataset is urgent to the development of HBOE models, particularly those data-hungry deep learning-based ones.

In this paper, we present the COCO-MEBOW (Monocular Estimation of Body Orientation in the Wild) dataset, which consists of high-precision body orientation labels for K human instances within K images from the COCO dataset [26]. Our dataset uses bins to partition the , with each bin covering only , which is much more fine-grained than all previous datasets while within the human cognition limit. The distributions of the collected orientation labels and some example cropped images of human bodies in our dataset are shown in Fig. 1(a). Details and the creation process will be introduced in Sec. 3.2. For brevity, we will call our dataset MEBOW in the rest of this paper.

To demonstrate the value of our dataset, we conducted two sets of experiments. The first set of experiments focused on HBOE itself. We first present a strong but simple baseline model for HBOE which is able to outperform previous state-of-the-art models [53] on the TUD dataset (with continuous orientation labels). We then compare the performance of our baseline model under four settings: training on TUD and evaluating on MEBOW, training on MEBOW and evaluating on TUD, training on TUD and evaluating on TUD, training on MEBOW and evaluating on MEBOW. We observe that the model trained on MEBOW generalizes well to TUD but not vice versa.

The second set of experiments focused on demonstrating the feasibility of boosting estimation performance through using our dataset as an additional, relative low-cost source of supervision. Our model is based on existing work on weakly-supervised -D human pose model using both -D pose dataset and -D pose dataset as the source of supervision. And the core of our model is a novel orientation loss which enables us to leverage the body orientation dataset as an additional source of supervision. We demonstrate in Sec. 4.2

that our triple-source weakly-supervised learning approach could bring significant performance gains over the baseline dual-source weakly-supervised learning approach. This shows that our dataset could be useful for not only HBOE but also other vision tasks, among which the gain in

-D pose estimation is demonstrated in this paper.

Our main contributions are summarized as follows.

  1. We present MEBOW, a large-scale high-precision human body orientation dataset.

  2. We established a simple baseline model for HBOE, which, when trained with MEBOW, is shown to significantly outperform state-of-the-art models trained on existing dataset.

  3. We developed the first triple-source solution for -D human pose estimation using our dataset as one of the three supervision sources, and it significantly outperforms a state-of-the-art dual-source solution for -D human pose estimation. This not only further demonstrates the usefulness of our dataset but also points out and validates a new direction of improving -D human pose estimation by using significantly lower-cost labels (i.e., body orientation).

2 Related Work

Human body orientation datasets. The TUD multi-view pedestrians dataset [6] is the most widely used dataset for benchmarking HBOE models. Most recent HBOE algorithms, e.g.[6, 18, 19, 53], use it for training and evaluation. This dataset consists of images captured outdoors, each containing one or more pedestrians, each of which is labeled with a bounding box and a body orientation. The body orientation labels only have eight bins, i.e., front, back, left, right, diagonal front, diagonal back diagonal left, diagonal right. This labeling is rather coarse-grained, and many of the images are gray-scale images. Later work [18] enhances the TUD dataset by providing continuous orientation labels, each of which is averaged from the orientation labels collected from five different labelers. There are also some other less used datasets for HBOE. Their limitations, however, make them only suitable for HBOE under highly constrained settings but not for in-the-wild applications. For example, the 3DPes dataset [7] ( images) and CASIA gait dataset [41] ( frames of videos capturing subjects) have been used in [53] and [42, 27], respectively. And their body orientation labels are -bin based and -bin based, respectively, which are also coarse-grained. Moreover, the human bodies in the images of these two datasets are all walking pedestrians captured from a downward viewpoint by one or a few fixed outdoor surveillance cameras. The MCG-RGBD datasets [28] has a wider diversity of poses and provides depth maps in addition to the RGB images. But all its images were captured indoors and from only subjects. Because human orientation can be computed given a full -D pose skeleton, we can convert a human -D pose dataset, e.g., the Human3.6M dataset [20], to a body orientation dataset for HBOE research. However, due to the constraint of the motion capture system, those -D pose datasets often only cover indoor scenes and are sampled frames of videos for only a few subjects. These constraints make them not as rich as our MEBOW dataset, which is based on COCO [26], in both contextual information and the variety of background. The size of the Human3.6M dataset [20] (K frames) is also much smaller than MEBOW (K).

Human body estimation algorithms. Limited by the relative small size and the coarse-grained orientation label (either -bin based or

-bin based) of existing datasets discussed above, approaches based on feature engineering and traditional classifiers 

[6, 45, 14, 33, 10, 54, 8], e.g., SVM, have been favored for HBOE. Deep learning-based methods [42, 12] also treat HBOE as a classification problem. For example, the method in [42] uses a -layer classification network to predict which bin out of the eight different bins represents the orientation given an input; the method in [12] uses a

-layer neural network as the classification network. These methods all use simple network architecture due to the small size of the available datasets for training. And the obtained model only works for certain highly constrained environment similar to that was used for collecting training images. Given the continuous orientation label provided by 

[18] for the TUD dataset, some recent work [18, 19, 53] attempted to address more fine-grained body orientation prediction. Most notably, Yu et al. [53] utilizes the key-points detection by another -D pose model as an additional cue for continuous orientation prediction. Still, deep learning-based methods are held back by the lack of a large-scale HBOE dataset. Direct prediction of body orientation from an image is valid because not only labeling a training dataset is simpler but also better performance could be achieved by directly addressing the orientation estimation problem. As a supporting evidence, [16]

shows that a CNN and Fisher encoding-based method taking in features extracted from 2-D images outperforms state-of-the-art methods based on 3-D information (

e.g., 3-D CAD models or 3-D landmarks) for multiple object orientation estimation problems.

-D pose estimation. The lack of large training data covering diverse settings is a major problem for robust -D pose estimation. Efforts [52, 30, 43, 56, 49, 48] have been made to address this by using additional source of supervision, mainly -D pose dataset (e.g., MPII [5]). The general idea is to design some novel loss for the data with weak labels (-D pose) to penalize incorrect -D pose prediction on those additional data with much more diverse human subjects and background variations so that the learnt model could better generalize to those data. Our work shows a new direction following this line of research, which is to use our large-scale, high-precision, cost-effective body orientation dataset as a new source of weak supervision. Some other ideas complementary to the above idea for improving -D pose estimation include: (1) enforcing extra prior knowledge such as a parameterized -D human mesh model [17, 24, 9, 23, 22, 35, 38], the ordinal depth [36], and temporal information (such as adjacent frame consistency) [25, 39]; and (2) leveraging images simultaneously captured from different views [40, 21], mainly for indoor dataset collected in a highly constrained environment (e.g., Human3.6M).

3 The Method

3.1 Definition of Body Orientation

Figure 2: Definition of body orientation.

Previous datasets including TUD all assume that the human body orientation is self-explanatory from the image, which is adequate for small dataset with a consistent camera point of view. For large dataset of in-the-wild images containing all kinds of human poses and camera points of view, a formal definition of the human orientation is necessary for both annotation and modeling. As illustrated in Fig. 2, without loss of generality, we define the human orientation

as the angle between the projection vector of the chest facing direction (

) onto the - plane and the direction of the axis , where the , , vectors are defined by the image plane and the orientation of the camera. Given a -D human pose, the chest facing direction can be computed by , where is the shoulder direction defined by the vector from the right shoulder to the left one, and is the torso direction defined by the vector from the midpoint of the left- and right-shoulder joints to the midpoint of the left- and right-hip joints.

3.2 MEBOW Dataset Creation

We choose the COCO dataset [26] as the source of images for orientation labeling for the following reasons. First, the COCO dataset has rich contextual information. And the diversity of human instances captured within the COCO dataset in terms of poses, lighting condition, occlusion types, and background makes it suitable for developing and evaluating models for body orientation estimation in the wild. Secondly, the COCO dataset already has bounding box labels for human instances, making it easier for body orientation labeling. To make our dataset large scale, after neglecting ambiguous human instances, we labeled all suitable human instances within the total images, out of which images (associated with human instances) are used for training and images (associated with human instances) for testing. To our knowledge, MEBOW is the largest HBOE dataset. The number of labeled human instances in our dataset is about times that of TUD. To make our dataset of high precision, we choose a -bin annotation scheme, which not only is much more fine-grained than former -bin or

-bin annotation used by other HBOE datasets, but also accounts for the cognitive limits of human labelers and the variance of labels between different labelers. Fig. 

1(a) shows the distribution of our orientation labels, along with some example human instances. It can be seen that our dataset covers all possible body orientation, with a Gaussian like peak around , which is natural because photos with humans tend to capture the main person from the front. Another advantage of our dataset is that the image resolution of the labeled human instances is much more diverse than all previous datasets, as shown in Fig. 1(b). This is especially helpful for training models for practical applications in which both high- and low-resolution human instances can be captured because the distance between the camera and the subject and the weather condition can both vary. We summarize the main advantages of MEBOW over previous HBOE datasets in Table 1.

Dataset # subjects # bins Diversity Occlusion
TUD[6]
3DPes[7]
CASIA[41]
MEBOW ✓✓✓ ✓✓✓
Table 1: Comparison of previous HBOE datasets with MEBOW. Continuous body orientation labels of TUD are provided by [18].

Annotation tool. The annotation tool we used for labeling body orientation is illustrated in Fig. A1 of Appendix A. On the left side, one image from the dataset containing human body instance(s) is displayed on the top. The associated cropped human instances is displayed at the bottom, from which the labeler could select which human instance to label by a mouse click. In the middle, the selected cropped human instance is displayed. On the right side, a slider is provided to adjust the orientation label in the range of (default , step size ), together with a clock-like circle and a red arrow visualizing the current labeled orientation. The labeler could first mouse-adjust the slider for coarse-grained orientation selection and then click either the clock-wise++ or counter clock-wise++ button (or using associated keyboard shortcuts) for fine-grained adjustments. The red arrow serves as a visual reference such that the labeler can compare it with the human body in the middle to ensure that the final orientation label is an accurate record of his/her comprehension. To maximize label consistency, on the bottom right corner, the labeler can refer to some example human body instances already labeled with the same orientation the labeler current selects.

Evaluation method. Given our high-precision -bin annotation, we propose to add Accuracy-, Accuracy-, and Accuracy-

as new evaluation metrics, where Accuracy-

is defined as the percentage of the samples that are predicted within from the ground-truth orientation. As discussed in [18], mean absolute error (MAE) of the angular distance can be strongly influenced by a few large errors. However, Accuracy-

is less sensitive to the outliers, hence deserves more attentions as an evaluation criterion.

3.3 Baseline HBOE Model

Just as most previous work in HBOE, our baseline model assumes the human instances are already detected and the input is a cropped-out human instance. The cropping could be based on either the ground truth or the predicted bounding box. And for the ease of experiments, we used the ground-truth bounding boxes provided by the COCO dataset in all of our experiments.

(a)
(b)
Figure 3: Our baseline HBOE model. (a) Network architecture. We adopt HRNet and ResNet units as the backbone network and the head network, respectively. Intermediate feature representations are combined to feed into the head network.(b) Illustration of orientation bins (black ticks) and our orientation loss for regressing

to the “circular” Gaussian target probability function.

The overall network architecture of our baseline model is shown in Fig. 3

(a), which can be trained end-to-end. The cropped images of subjects are first processed through a backbone network as the feature extractor. The extracted features are then concatenated and processed by a few more residual layers, with one fully connected layer and a softmax layer at the end. The output are

neurons, , representing the probability of every possible orientation bin being the best one to represent the body orientation of the input image. More specifically, represents the probability of the body orientation to be within the -th bin in Fig. 3(b), i.e., within the range of . As for the objective function of the model, our approach is different from previous approaches that either directly regress the orientation parameter (Approach 1 and 2 of [19]) or treat the orientation estimation as a pure classification problem (Approach 3 of [19], and [18]), where each bin is a different class. Instead, we take inspiration from the heat map regression idea, which has been extremely successful in key-point estimation [34, 46]

, and let the loss function for

be:

(1)

where is the “circular” Gaussian probability, as illustrated in Fig. 3(b) (red curve):

(2)

and is the ground-truth orientation bin. Basically, we are regressing a Gaussian function centered at the ground truth orientation bin. And the intuition behind this is that the closer a orientation bin is to the ground-truth orientation bin label , the higher the probability the model should assign to it. We found this approach significantly eased the learning process of the neural network. And of note, we have attempted to use standard classification loss function, e.g. cross entropy loss between and the ground truth represented by one hot vector, but the loss could not converge.

Choice of network architecture. We also considered ResNet- and ResNet-

(initialized from the weights of the model trained for ImageNet classification task) to be the architecture of our network. We observe that HRNet

Head provides much better performance in experiments. This could be explained by the fact that the HRNet and its pretrained model are also trained on COCO images and designed for a closer related task—-D pose estimation.

3.4 Enhancing 3-D Pose Estimation

It is extremely difficult to obtain -D joint labels using existing technologies, hence models trained on indoor -D pose dataset generalize poorly to in-the-wild images, such as COCO images. There have been attempts [47, 48] to leverage -D pose datasets, such as MPII and COCO, as a second source of supervision to enhance both the performance and robustness of -D pose models. We believe the orientation labels in our COCO-based dataset can be complementary to the -D pose labels and provide additional supervision. To that end, we developed a triple-source weakly-supervised solution for -D pose estimation, the core of which is a body orientation loss for utilizing the orientation labels.

We choose [48] as the base to build our model. Following their notation, we denote to be the coordinate of any location, and (of size ) to be the normalized heat map for joint output by the backbone network. Then, the predicted location for joint is:

(3)

Next, loss can be used to supervise the network for images with -D pose labels. For images with -D pose labels, -D heat vector and heat vector is computed as:

(4)
(5)

And loss can be used to supervise the network for images with -D pose labels.

Let’s define the loss function for images with orientation labels. For the ease of notation, we use , , , and to denote the predicted coordinates of left shoulder, right shoulder, left hip, and right hip, respectively, by Eq. 3. Then the estimated shoulder vector and torso vector can be represented by:

(6)
(7)

following the definition in Sec. 2 and Fig. 2. And the chest facing direction can be computed by

(8)

where is the Euclidean norm. Since the (estimated) orientation angle defined in Fig. 2 can be computed by projecting onto the - plane, we know the following equations hold:

(9)
(10)

And we define the orientation loss to be:

(11)

where is the ground truth orientation label. Finally, , , and can be used jointly with proper weighting between them such that the three sources of supervision, i.e., -D pose labels, -D pose labels, and orientation labels, can all be used towards training a robust -D pose estimation model.

4 Experimental Results

The proposed MEBOW dataset has been tested in two sets of experiments for demonstrating its usefulness. In Sec. 4.1, we show how MEBOW can help advance HBOE by using the baseline model we proposed in Sec. 3.3. In Sec. 4.2, we show how MEBOW can help improve -D body pose estimation by using the triple-source weakly-supervised solution described in Sec. 3.4.

Implementation.

All the codes used in the experiments were implemented with PyTorch 

[3]. For the HBOE experiments in Sec. 4.1, The ResNet backbone is based on the public codes [4], and is initialized from an ImageNet pretrained model. The HRNet backbone is based on the public codes [1], and is initialized from a pretrained model for COCO -D pose estimation. The same input image preprocessing steps for the MEBOW and TUD datasets are applied, including normalizing the input images to , and flipping and scaling augmentation. We use Adam optimizer (learning rate ) to train the network for epochs. For the 3-D pose estimation experiments described in Sec. 4.2, our codes are based on public codes [2]. The network is initialized from an ImageNet pretrained model. Input images are normalized to . Rotation, flipping, and scaling are used to augment Human3.6M and MPII. To avoid the deformation of orientation, we do not carry out rotation augmentation for the images in MEBOW. The network is trained for 300 epochs. The Adam is the optimizer. The learning rate remains .

4.1 Body Orientation Estimation

First, we validate the baseline model we proposed in Sec. 3.3. Specifically, we train it on the TUD dataset and compare its performance with other state-of-the-art models reported in the literature. The results are shown in Table 2. Our model significantly outperforms all of other models in terms of MAE, Accuracy-, and Accuracy-, which are standard metrics on the TUD dataset. This could be attributed to both our novel loss function for regressing the target “circular” Gaussian probability and the power of HRNet [46] and its pretrained model.

Method   MAE Acc.- Acc.-
AKRF-VW [18] 34.7 68.6 78
DCNN [19] 26.6 70.6 86.1
CPOEHK [53] 15.3 75.7 96.8
ours 8.4 95.1 99.7
Human [18] 0.93 90.7 99.3
Table 2: HBOE evaluation on the TUD dataset (with continuous orientation label). Ours was trained on the TUD training set and evaluated on its test set. We converted the continuous orientation label to -bin orientation label illustrated in Fig. 3.

To show the advantage of MEBOW over TUD in terms of diverse background and rich in-the-wild environment, we train our baseline model under four settings to compare the generalization capability of the same architecture (our proposed baseline model) trained on TUD and MEBOW. Our experimental results are shown in Table 3. It can be seen that the performance drop of our baseline model trained on the TUD training set when it is evaluated on the MEBOW test set versus on the TUD test set is much higher than that of the same model trained on the MEBOW training set when it is evaluated on the TUD test set versus on the MEBOW test set. This suggests that the improved diversity, and the inclusion of more challenging cases in MEBOW (compared with TUD) actually helps improve the robustness of models. We observe that Accuracy- for our model trained on MEBOW even improved slightly when evaluated on TUD versus on MEBOW. We also observe that the performance of our model, which is only trained on MEBOW (row Table 3), can even exceed the previous state-of-the-art result on TUD (row Table 2). Experiments of similar fashion and motivation have been conducted in Sec.  (Table. ) of [26] to demonstrate the advantage of the COCO dataset.

Training Testing    MAE Acc.- Acc.-
TUD TUD     8.4 95.1 99.7
TUD MEBOW    
MEBOW MEBOW     8.4 93.9 98.2
MEBOW TUD    
Table 3: Comparison of the generalization capability of the same model trained on TUD and on MEBOW.

As for the choice of the network architecture and the parameter , we conducted ablation experiments for both of them, with the results summarized in Table 4. HRNetHead (initialized with pretrained weights for the COCO -D pose estimation task) gives significant better results than ResNet- or ResNet-. And setting leads to the best performing model. Hence, we used the model with the HRNetHead and for experiments associated with Table 2 and Table 3. Some qualitative prediction examples of this model are presented in Fig. 4.

Architecture    MAE Acc.- Acc.- Acc.-
ResNet-50 10.465 66.9 88.3 94.6
ResNet-101 10.331 67.8 88.2 94.7
HRNetHead 8.579 69.3 89.6 96.4
8.529 69.6 91.0 96.6
8.427 69.3 90.6 96.7
8.393 68.6 90.7 96.9
8.556 68.2 90.9 96.7
8.865 66.5 90.1 96.6
Table 4: Ablation study on the choice of network architecture and the effect of different in Eq. 2. Evaluation is done on MEBOW.
(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) (q) (r) (s) (t) (u) (v) (w) (x) (y) (z) (aa) (ab) (ac) (ad) (ae) (af) (ag) (ah) (ai) (aj)
Figure 4: HBOE results generated by our baseline model (with HRNet as the backbone and ) on MEBOW (row , for respective images in row ) and TUD dataset (row for respective images in row ). Row are prediction results by [18] and they are directly cropped from the original paper. Red arrow: ground truth; Blue arrow: prediction.
Method Dir. Dis. Eat. Gre. Phon. Pose Pur. Sit SitD. Smo. Phot. Wait Walk WalkD. WalkP. Average
Chen et al. [11] 89.9 97.6 90.0 107.9 107.3 139.2 93.6 136.1 133.1 240.1 106.7 106.2 87.0 114.1 90.6 114.2
Tome et al. [50] 65.0 73.5 76.8 86.4 86.3 110.7 68.9 74.8 110.2 172.9 85.0 85.8 86.3 71.4 73.1 88.4
Zhou et al. [55] 87.4 109.3 187.1 103.2 116.2 143.3 106.9 99.8 124.5 199.2 107.4 118.1 114.2 79.4 97.7 79.9
Metha et al. [30] 59.7 69.7 60.6 68.8 76.4 85.4 59.1 75.0 96.2 122.9 70.8 68.5 54.4 82.0 59.8 74.1
Pavlakos et al. [37] 58.6 64.6 63.7 62.4 66.9 70.8 57.7 62.5 76.8 103.5 65.7 61.6 67.6 56.4 59.5 66.9
Moreno et al. [32] 69.5 80.2 78.2 87.0 100.8 102.7 76.0 69.7 104.7 113.9 89.7 98.5 82.4 79.2 77.2 87.3
Sun et al. [47] 52.8 54.8 54.2 54.3 61.8 53.1 53.6 71.7 86.7 61.5 67.2 53.4 47.1 61.6 53.4 59.1
Sharma et al. [44] 48.6 54.5 54.2 55.7 62.6 72.0 50.5 54.3 70.0 78.3 58.1 55.4 61.4 45.2 49.7 58.0
Moon et al. [31] 50.5 55.7 50.1 51.7 53.9 46.8 50.0 61.9 68.0 52.5 55.9 49.9 41.8 56.1 46.9 53.3
Sun et al. [48] 47.5 47.7 49.5 50.2 51.4 43.8 46.4 58.9 65.7 49.4 55.8 47.8 38.9 49.0 43.8 49.6
Baseline- 44.4 47.4 49.0 67.7 50.0 41.8 45.6 59.9 92.9 48.8 57.1 65.4 38.7 50.5 42.2 53.4
Baseline- 46.1 47.8 49.1 66.3 48.0 43.5 46.7 59.3 85.0 47.0 54.0 61.9 38.6 50.1 49.7
ours 44.6 47.1 46.0 60.5 47.7 41.8 46.0 57.8 82.3 47.2 56.0 56.7 38.0 49.5 41.8
Table 5: -D human pose estimation evaluation on the Human3.6M dataset using mean per joint position error (MPJPE). Our baseline is a re-implementation of Sun et al. [48], trained on Human3.6M + MPII, as in the original paper. Our baseline 2 is a re-implementation of Sun et al. [48], trained on Human3.6M + MPII + COCO (-D Pose). The best and second best are marked with color.
Method Hip Knee Ankle Torso Neck Head Nose Shoulder Elbow Wrist X Y Z (Depth)
Baseline- 24.6 49.0 73.8 40.6 51.9 55.6 56.9 52.5 66.8 84.8 14.6 19.4 39.8
Baseline-
ours
Table 6: -D human pose estimation per joint evaluation on the Human3.6M dataset using mean per joint position error (MPJPE). The error is the average of the left joint and the right joint.
Method MAE Acc.- Acc.- Acc.-
Baseline 26.239 34.7 63.7 77.7
Baseline 2 13.888 31.9 74.5 86.8
ours 11.023 44.8 83.4 94.2
Table 7: -D human pose estimation evaluation on the test portion of MEBOW.

4.2 Enhanced -D Body Pose Estimation

Data. we use the HumanM dataset (-D pose), the MPII dataset (-D pose), the COCO dataset (-D pose), and our MEBOW orientation labels. We train our triple-source weakly-supervised model proposed in Sec. 3.4 and two dual-source weakly-supervised baseline models for comparison. Both of the baseline models are trained using a re-implementation of [48], which uses a combination of (defined in Sec. 3.4). The difference is that the Baseline- only uses HumanM dataset (-D pose) and the MPII dataset (-D pose), while the Baseline- uses COCO dataset (-D pose) on top of the first baseline. Our method leverages the orientation labels from our MEBOW dataset on top of the second baseline and uses a combination of . Following the practice of [48], within a batch during the stochastic training, we sampled the same number of images from Human, MPII, and COCO datasets.

We evaluated and compared our model and the two baselines in multiple ways, both quantitatively and qualitatively. First, we followed the Protocol II in [48] and used the mean per joint position error (MPJPE) as the metric to evaluate them on the test set of the HumanM dataset. The evaluation results are shown in Table 5, along with the evaluation results for other competitive models copied from their papers. We have tried our best to train Baseline- but still cannot obtain a model with a performance as good as that reported in [48]. This, however, does not hinder us from making a fair comparison between Baseline-, Baseline-, and our model. From Table 5, we can see that by adding MEBOW as the third (weak) supervision source and using our proposed orientation loss , we can achieve significantly better average MPJPE than both Baseline- and Baseline-. If we break down MPJPE metric into different motion categories, our approach also achieves the best MPJPE metric in most ( out of ) motion categories. We also did breakdown analysis of the MPJPE metric in terms of different joints and X-, Y-, Z- part of the joint coordinates in Table 6. For nearly all joints, our method achieves significant better results. And our method is positive on improving Y- and Z- part of the joint coordinate but neutral for improving X- part of the joint coordinate. This is not surprising since our orientation loss only considers the Y- and Z- part of after the projection on to the - plane in Fig. 2. Some qualitative examples of -D pose estimation by our model, along with the ground truth and the predictions by the two baseline models are displayed in Fig. 5. Second, we conduct evaluation of the -D pose prediction on the COCO test set. Since the ground-truth -D pose is unknown for the COCO dataset, we took a step back and conducted the quantitative evaluation by comparing the orientation computed from the predicted -D pose against the ground-truth orientation label provided by our MEBOW dataset. As shown in Table 7, our model significantly outperforms both Baseline- and Baseline-, which suggests our model for -D pose estimation generalizes much better to in-the-wild images. Fig. 6 shows a few qualitative results of -D pose prediction on the COCO test set.

(a) Input (b) G. T. (c) Baseline (d) Baseline (e) ours (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) Input (q) G. T. (r) Baseline (s) Baseline (t) ours (u) (v) (w) (x) (y) (z) (aa) (ab) (ac) (ad)
Figure 5: Example -D pose estimation results on the Human3.6M dataset. (G.T. is the abbreviation for Ground Truth.) More example results can be viewed in Appendix E.
(a) Input (b) Baseline (c) Baseline (d) ours (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) (q) Input (r) Baseline (s) Baseline (t) ours (u) (v) (w) (x) (y) (z) (aa) (ab) (ac) (ad) (ae) (af)
Figure 6: Example -D pose estimation results on the COCO dataset. More example results can be viewed in Appendix F.

5 Conclusions

We introduced a new COCO-based large-scale, high-precision dataset for human body orientation estimation in the wild. Through extensive experiments, we demonstrated that our dataset could be very useful for both body orientation estimation and -D pose estimation. In the meanwhile, we presented a simple, yet effective model for human body orientation estimation, which can serve as a baseline for future HBOE model development using our dataset. And we proposed a new orientation loss for utilizing body orientation label as the third supervision source. In the future, it would be interesting to explore how our dataset can be used for other vision tasks, e.g., person re-identification (ReID) and bodily expressed emotion recognition [29].

Acknowledgments

A portion of the computation used the Extreme Science and Engineering Discovery Environment (XSEDE), which is an infrastructure supported by National Science Foundation (NSF) grant number ACI-1548562 [51]. J.Z. Wang was supported by NSF grant no. 1921783.

References

  • [1] Note: https://github.com/leoxiaobin/deep-high-resolution-net.pytorch Cited by: §4.
  • [2] Note: https://github.com/JimmySuen/integral-human-pose Cited by: §4.
  • [3] Note: https://pytorch.org Cited by: §4.
  • [4] Note: https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py Cited by: §4.
  • [5] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele (2014-06) 2D human pose estimation: new benchmark and state of the art analysis. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §2.
  • [6] M. Andriluka, S. Roth, and B. Schiele (2010) Monocular 3D pose estimation and tracking by detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 623–630. Cited by: Figure 1, §1, §2, §2, Table 1.
  • [7] D. Baltieri, R. Vezzani, and R. Cucchiara (2011) 3DPeS: 3D people dataset for surveillance and forensics. In Proceedings of the Joint ACM Workshop on Human Gesture and Behavior Understanding, pp. 59–64. Cited by: §2, Table 1.
  • [8] D. Baltieri, R. Vezzani, and R. Cucchiara (2012) People orientation recognition by mixtures of wrapped distributions on random trees. In Proceedings of the European Conference on Computer Vision, pp. 270–283. Cited by: §1, §2.
  • [9] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black (2016) Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In Proceedings of the European Conference on Computer Vision, pp. 561–578. Cited by: §2.
  • [10] C. Chen, A. Heili, and J. Odobez (2011) Combined estimation of location and body pose in surveillance video. In Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 5–10. Cited by: §1, §2.
  • [11] C. Chen and D. Ramanan (2017) 3D human pose estimation= 2D pose estimation+ matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7035–7043. Cited by: Table A1, Table 5.
  • [12] J. Choi, B. Lee, and B. Zhang (2016)

    Human body orientation estimation using convolutional neural network

    .
    arXiv preprint arXiv:1609.01984. Cited by: §2.
  • [13] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §1.
  • [14] T. Gandhi and M. M. Trivedi (2008) Image based estimation of pedestrian orientation for improving path prediction. In Proceedings of the IEEE Intelligent Vehicles Symposium, pp. 506–511. Cited by: §1, §2.
  • [15] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: §1.
  • [16] A. Ghodrati, M. Pedersoli, and T. Tuytelaars (2014) Is 2D information enough for viewpoint estimation?. In Proceedings of the British Machine Vision Conference, Vol. 101, pp. 102. Cited by: §2.
  • [17] P. Guan, A. Weiss, A. O. Balan, and M. J. Black (2009) Estimating human shape and pose from a single image. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1381–1388. Cited by: §2.
  • [18] K. Hara and R. Chellappa (2017) Growing regression tree forests by classification for continuous object pose estimation. International Journal of Computer Vision 122 (2), pp. 292–312. Cited by: §1, §2, §2, §3.2, §3.3, Table 1, Figure 4, Table 2.
  • [19] K. Hara, R. Vemulapalli, and R. Chellappa (2017) Designing deep convolutional neural networks for continuous object orientation estimation. arXiv preprint arXiv:1702.01499. Cited by: §1, §2, §2, §3.3, Table 2.
  • [20] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu (2013) Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7), pp. 1325–1339. Cited by: §2.
  • [21] K. Iskakov, E. Burkov, V. Lempitsky, and Y. Malkov (2019) Learnable triangulation of human pose. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.
  • [22] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik (2018) End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122–7131. Cited by: §2.
  • [23] N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis (2019) Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2252–2261. Cited by: §2.
  • [24] C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J. Black, and P. V. Gehler (2017) Unite the people: closing the loop between 3D and 2D human representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6050–6059. Cited by: §2.
  • [25] J. Lin and G. H. Lee (2019) Trajectory space factorization for deep video-based 3D human pose estimation. Cited by: §2.
  • [26] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In Proceedings of the European Conference on Computer Vision, pp. 740–755. Cited by: §1, §1, §2, §3.2, §4.1.
  • [27] P. Liu, W. Liu, and H. Ma (2017) Weighted sequence loss based spatial-temporal deep learning framework for human body orientation estimation. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), pp. 97–102. Cited by: §1, §2.
  • [28] W. Liu, Y. Zhang, S. Tang, J. Tang, R. Hong, and J. Li (2013) Accurate estimation of human body orientation from rgb-d sensors. IEEE Transactions on Cybernetics 43 (5), pp. 1442–1452. Cited by: §2.
  • [29] Y. Luo, J. Ye, R. B. Adams, J. Li, M. G. Newman, and J. Z. Wang (2019) ARBEE: towards automated recognition of bodily expression of emotion in the wild. International Journal of Computer Vision 128 (1), pp. 1–25. Cited by: §5.
  • [30] D. Mehta, H. Rhodin, D. Casas, O. Sotnychenko, W. Xu, and C. Theobalt (2016)

    Monocular 3D human pose estimation using transfer learning and improved CNN supervision

    .
    arXiv preprint arXiv:1611.09813 1 (3), pp. 5. Cited by: §2, Table 5.
  • [31] G. Moon, J. Y. Chang, and K. M. Lee (2019) Camera distance-aware top-down approach for 3D multi-person pose estimation from a single rgb image. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10133–10142. Cited by: Table A1, Table 5.
  • [32] F. Moreno-Noguer (2017) 3D human pose estimation from a single image via distance matrix regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2823–2832. Cited by: Table A1, Table 5.
  • [33] C. Nakajima, M. Pontil, B. Heisele, and T. Poggio (2003) Full-body person recognition system. Pattern Recognition 36 (9), pp. 1997–2006. Cited by: §1, §2.
  • [34] A. Newell, K. Yang, and J. Deng (2016) Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision, pp. 483–499. Cited by: §3.3.
  • [35] M. Omran, C. Lassner, G. Pons-Moll, P. Gehler, and B. Schiele (2018) Neural body fitting: unifying deep learning and model based human pose and shape estimation. In Proceedings of the International Conference on 3D Vision (3DV), pp. 484–494. Cited by: §2.
  • [36] G. Pavlakos, X. Zhou, and K. Daniilidis (2018) Ordinal depth supervision for 3D human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7307–7316. Cited by: §2.
  • [37] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis (2017) Coarse-to-fine volumetric prediction for single-image 3D human pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7025–7034. Cited by: Table 5.
  • [38] G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis (2018) Learning to estimate 3D human pose and shape from a single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 459–468. Cited by: §2.
  • [39] D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli (2019) 3D human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7753–7762. Cited by: §2.
  • [40] H. Qiu, C. Wang, J. Wang, N. Wang, and W. Zeng (2019) Cross view fusion for 3D human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4342–4351. Cited by: §2.
  • [41] R. Raman, P. K. Sa, B. Majhi, and S. Bakshi (2016) Direction estimation for pedestrian monitoring system in smart cities: an hmm based approach. IEEE Access 4, pp. 5788–5808. Cited by: §2, Table 1.
  • [42] M. Raza, Z. Chen, S. Rehman, P. Wang, and P. Bao (2018) Appearance based pedestrians head pose and body orientation estimation using deep learning. Neurocomputing 272, pp. 647–659. Cited by: §2, §2.
  • [43] G. Rogez and C. Schmid (2016) Mocap-guided data augmentation for 3D pose estimation in the wild. In Advances in Neural Information Processing Systems, pp. 3108–3116. Cited by: §2.
  • [44] S. Sharma, P. T. Varigonda, P. Bindal, A. Sharma, and A. Jain (2019) Monocular 3D human pose estimation by generation and ordinal ranking. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2325–2334. Cited by: Table A1, Table 5.
  • [45] H. Shimizu and T. Poggio (2004) Direction estimation of pedestrian from multiple still images. In Proceedings of the IEEE Intelligent Vehicles Symposium, pp. 596–600. Cited by: §1, §2.
  • [46] K. Sun, B. Xiao, D. Liu, and J. Wang (2019) Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.3, §4.1.
  • [47] X. Sun, J. Shang, S. Liang, and Y. Wei (2017) Compositional human pose regression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2602–2611. Cited by: Table A1, §3.4, Table 5.
  • [48] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei (2018) Integral human pose regression. In Proceedings of the European Conference on Computer Vision, pp. 529–545. Cited by: Table A1, Appendix C, §2, §3.4, §3.4, §4.2, §4.2, Table 5.
  • [49] B. Tekin, P. Márquez-Neila, M. Salzmann, and P. Fua (2017) Learning to fuse 2D and 3D image cues for monocular body pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3941–3950. Cited by: §2.
  • [50] D. Tome, C. Russell, and L. Agapito (2017) Lifting from the deep: convolutional 3D pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2500–2509. Cited by: Table 5.
  • [51] J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, et al. (2014) XSEDE: accelerating scientific discovery. Computing in Science & Engineering 16 (5), pp. 62–74. Cited by: Acknowledgments.
  • [52] H. Yasin, U. Iqbal, B. Kruger, A. Weber, and J. Gall (2016) A dual-source approach for 3D pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4948–4956. Cited by: §2.
  • [53] D. Yu, H. Xiong, Q. Xu, J. Wang, and K. Li (2019) Continuous pedestrian orientation estimation using human keypoints. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. Cited by: §1, §1, §2, §2, Table 2.
  • [54] G. Zhao, M. Takafumi, K. Shoji, and M. Kenji (2012) Video based estimation of pedestrian walking direction for pedestrian protection system. Journal of Electronics (China) 29 (1-2), pp. 72–81. Cited by: §1, §2.
  • [55] X. Zhou, M. Zhu, G. Pavlakos, S. Leonardos, K. G. Derpanis, and K. Daniilidis (2018) Monocap: monocular human motion capture using a CNN coupled with a geometric prior. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (4), pp. 901–914. Cited by: Table A1, Table 5.
  • [56] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei (2017) Towards 3D human pose estimation in the wild: a weakly-supervised approach. In Proceedings of the IEEE International Conference on Computer Vision, pp. 398–407. Cited by: §2.

Appendix A Labeling Tool

The interface of our human body orientation labeling tool is illustrated in Fig. A1.

Figure A1: User interface of the labeling tool.

Appendix B More Details on the HBOE Experiments

(a) (a)
(b) (b)
(c) (c)
(d) (d)
Figure A2: Breakdown analysis of the performance of our HBOE baseline model.

Breakdown analysis of the errors. First, we show the cumulative percentage of correct HBOE prediction with respect to the threshold of a correct prediction in Fig. A2 (a) and (b). Specifically, we compare the performance of our baseline model trained on the MEBOW dataset and that trained on the TUD dataset, respectively, using 1) the test set of the MEBOW dataset (Fig. A2 (a)) and 2) the test set of the TUD dataset (Fig. A2 (b)). Based on the same set of experiments in Table 3, these two sub-figures present a more detailed comparison, and they also support that the model trained on our MEBOW dataset has much better generalizability than it trained on TUD dataset. Second, we show how our baseline HBOE model performs when the camera point of view is towards the Front, Back, Left, and Right of the person in Fig. A2 (d). The association of the ground-truth orientations with the Front, Back, Left, and Right breakdown categories are shown in Fig. A2 (c). It is not surprising that our model performs best when the camera point of view is towards the Front of the person because a larger portion of MEBOW dataset falls into this category, as shown in Fig. 1 (a) in the main paper.

Appendix C Additional -D Human Pose Estimation Evaluation on the Human3.6M Dataset

We also conducted -D human pose estimation experiments with Protocol I in [48]. The evaluation results are shown in Tabel A1.

Method PA MPJPE
Chen et al. [11] 82.7
Moreno et al. [32] 76.5
Zhou et al. [55] 55.3
Sun et al. [47] 48.3
Sharma et al. [44] 40.9
Sun et al. [48] 40.6
Moon et al. [31] 34.0
Baseline 34.7
Baseline 2 34.3
ours 33.1
Table A1: -D human pose estimation evaluation on the Human3.6M dataset using Protocol I. Our baseline is a re-implementation of Sun et al. [48], trained on Human3.6M + MPII, as in the original paper. Our baseline 2 is a re-implementation of Sun et al. [48], trained on Human3.6M + MPII + COCO (-D Pose).

Appendix D More Qualitative Human Body Orientation Estimation Results

More qualitative human body orientation estimation examples are shown in Fig. A3 to supplement Fig. 4 in the main paper.

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) (q) (r) (s) (t) (u) (v) (w) (x) (y) (z) (aa) (ab) (ac) (ad) (ae) (af) (ag) (ah) (ai) (aj) (ak) (al) (am) (an) (ao) (ap) (aq) (ar) (as) (at) (au) (av) (aw)
Figure A3: HBOE results generated by our baseline model (with HRNet as the backbone and ) on MEBOW (row to row ) and TUD dataset (row ). Red arrow: ground truth; Blue arrow: prediction.

Appendix E More Qualitative -D Pose Estimation Results on the Human3.6M Dataset

More example -D pose estimation results on the test set of the Human3.6M dataset are included in Fig. A4.

(a) Input (b) Ground-truth (c) Baseline (d) Baseline (e) ours (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) (q) (r) (s) (t) (u) (v) (w) (x) (y) (z) (aa) (ab) (ac) (ad) (ae) (af) (ag) (ah) (ai) (aj) (ak) (al) (am) (an) (ao) (ap) (aq) (ar) (as) (at) (au) (av) (aw) (ax) (ay) (az) (ba) (bb) (bc) (bd) Input (be) Ground truth (bf) Baseline (bg) Baseline (bh) ours (bi) (bj) (bk) (bl) (bm) (bn) (bo) (bp) (bq) (br) (bs) (bt) (bu) (bv) (bw) (bx) (by) (bz) (ca) (cb) (cc) (cd) (ce) (cf) (cg) (ch) (ci) (cj) (ck) (cl) (cm) (cn) (co) (cp) (cq) (cr) (cs) (ct) (cu) (cv) (cw) (cx) (cy) (cz) (da) (db) (dc) (dd) (de) (df)
Figure A4: More example -D pose estimation results on the Human3.6M dataset.

Appendix F More Qualitative -D Pose Estimation Results on the COCO Dataset

More example -D pose estimation results on the test set of the COCO dataset are shown in Fig. A5.

(a) Input (b) Baseline (c) Baseline (d) ours (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) (q) (r) (s) (t) (u) (v) (w) (x) (y) (z) (aa) (ab) (ac) (ad) (ae) (af) (ag) (ah) (ai) (aj) (ak) (al) (am) (an) (ao) Input (ap) Baseline (aq) Baseline (ar) ours (as) (at) (au) (av) (aw) (ax) (ay) (az) (ba) (bb) (bc) (bd) (be) (bf) (bg) (bh) (bi) (bj) (bk) (bl) (bm) (bn) (bo) (bp) (bq) (br) (bs) (bt) (bu) (bv) (bw) (bx) (by) (bz) (ca) (cb)
Figure A5: More example -D pose estimation results on the COCO dataset.