Full-Body Awareness from Partial Observations

by   Chris Rockwell, et al.
University of Michigan

There has been great progress in human 3D mesh recovery and great interest in learning about the world from consumer video data. Unfortunately current methods for 3D human mesh recovery work rather poorly on consumer video data, since on the Internet, unusual camera viewpoints and aggressive truncations are the norm rather than a rarity. We study this problem and make a number of contributions to address it: (i) we propose a simple but highly effective self-training framework that adapts human 3D mesh recovery systems to consumer videos and demonstrate its application to two recent systems; (ii) we introduce evaluation protocols and keypoint annotations for 13K frames across four consumer video datasets for studying this task, including evaluations on out-of-image keypoints; and (iii) we show that our method substantially improves PCK and human-subject judgments compared to baselines, both on test videos from the dataset it was trained on, as well as on three other datasets without further adaptation. Project website: https://crockwell.github.io/partial_humans



page 2

page 3

page 8

page 22

page 23

page 24

page 25

page 27


Learning Local Recurrent Models for Human Mesh Recovery

We consider the problem of estimating frame-level full human body meshes...

Learning 3D Human Dynamics from Video

From an image of a person in action, we can easily guess the 3D motion o...

Leveraging MoCap Data for Human Mesh Recovery

Training state-of-the-art models for human body pose and shape recovery ...

Human Mesh Recovery from Multiple Shots

Videos from edited media like movies are a useful, yet under-explored so...

StylePeople: A Generative Model of Fullbody Human Avatars

We propose a new type of full-body human avatars, which combines paramet...

Everybody Is Unique: Towards Unbiased Human Mesh Recovery

We consider the problem of obese human mesh recovery, i.e., fitting a pa...

Unified 3D Mesh Recovery of Humans and Animals by Learning Animal Exercise

We propose an end-to-end unified 3D mesh recovery of humans and quadrupe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider the images in Fig. 1

: what are these people doing? Are they standing or sitting? While a human can readily recognize what is going on in the images, having a similar understanding is a severe challenge to current human 3D pose estimation systems. Unfortunately, in the world of Internet video, frames like these are the rule rather than rarities since consumer videos are recorded not with the goal of providing clean demonstrations of people performing poses, but are instead meant to show something interesting to people who already know how to parse 3D poses. Accordingly, while videos from consumer sharing sites may be a useful source of data for learning how the world works

[2, 14, 59, 62], most consumer videos depict a confusing jumble of limbs and torsos flashing across the screen. The goal of this paper is to make sense of this jumble.

Figure 1: We present a simple but highly effective framework for adapting human pose estimation methods to highly truncated settings that requires no additional pose annotation. We evaluate the approach on HMR [24] and CMR [26] by annotating four Internet video test sets: VLOG [14] (top-left, top-middle), Cross-Task [62] (top-right, bottom-left), YouCookII [59] (bottom-middle), and Instructions [2] (bottom-right).

Current work in human pose estimation is usually not up to the challenge of the jumble of Internet footage. Recent work in human pose estimation [3, 9, 24, 34, 37] is typically trained and evaluated on 2D and 3D pose datasets [4, 19, 21, 30, 36] that show full human poses from level cameras often in athletic settings Fig. 2 (left). Unfortunately, Internet footage tends to be like Fig. 2 (right), and frequently only part of the body is visible to best show off how to perform a task or highlight something of interest. For instance, on VLOG [14], all human joints are visible in only 4% of image frames. Meanwhile, all leg keypoints are not visible 63% of the time, and head keypoints such as eyes are not visible in about 45% of frames. Accordingly, when standard approaches are tested on this sort of data, they tend to fail catastrophically, which we show empirically.

We propose a simple but surprisingly effective approach in Section 3 that we apply to multiple forms of human mesh recovery. The key insight is to combine both cropping and self-training on confident video frames: cropping introduces the model to truncation, video matches context to truncations. After pre-training on a cropped version of a standard dataset, we identify reliable predictions on a large unlabeled video dataset, and promote these instances to the training set and repeat. Unlike standard self-training, we add crops, which lets confident full-body predictions (identified via [5]) provide a training signal for challenging crops. This approach requires no extra annotations and takes k iterations of additional training (with total time hours on a single RTX2080 Ti GPU).

We demonstrate the effectiveness of our approach on two human 3D mesh recovery techniques – HMR [24] and CMR [26] – and evaluate on four consumer-video datasets – VLOG [14], Instructions [2], YouCookII [59], and Cross-Task [62]. To lay the groundwork for future work, we annotate keypoints on 13k frames across these datasets and provide a framework for evaluation in and out of images. In addition to keypoints, we evaluate using human-study experiments. Our experiments in Section 4 demonstrate the effectiveness of our method compared to off-the-shelf mesh recovery and training on crops from a standard image dataset (MPII). Our approach improves PCK both in-image and out-of-image across methods and datasets: e.g., after training on VLOG, our approach leads to a 20.7% improvement on YouCookII over off-the-shelf HMR and a 10.9% improvement over HMR trained on crops (with gains of 36.4% and 19.1% on out-of-image keypoints) Perceptual judgments by annotators show similar gains: e.g., on Cross-Task, our proposed method improves the chance of a CMR output being rated as correct by 25.6% compared to off-the-shelf performance.

Figure 2: Partially Visible Humans. Consumer video, seen in datasets like VLOG [14], Instructions [62], or YouCook2 [59], is considerably different from canonical human pose datasets. Most critically, only part of a person is typically visible within an image, making pose estimation challenging. In fact, all keypoints are only visible in 4% of VLOG test set images, while all leg joints are not visible 61% of the time. Four of the most common configurations of visible body parts are listed above.

2 Related Work

Human Pose Estimation In the Wild: Human pose estimation has improved substantially in recent years due in part to improved methods for 2D [9, 17, 37, 50, 56] and 3D [1, 28, 34, 42, 44, 60] pose, which typically utilize deep networks as opposed to classic approaches such as deformable part models [7, 10, 12, 58]. Performance of such pose models also relies critically on datasets [4, 18, 19, 21, 30, 53, 36, 47]. By utilizing annotated people in-the-wild, methods have moved toward understanding realistic, challenging settings, and become more robust to occlusion, setting, challenging pose, and scale variation [3, 35, 36, 39, 40, 61].

However, these in-the-wild datasets still rarely encounter close, varied camera angles common in consumer Internet video, which can result in people being only partially within an image. Furthermore, images that do contain truncated people are sometimes filtered out [24]. As a result, the state-of-the-art on common benchmarks performs poorly in consumer videos. In this work, we utilize the unlabeled video dataset VLOG to improve in this setting.

3D Human Mesh Estimation: A 3D mesh is a rich representation of pose, which is employed for the method presented in this paper. Compared to keypoints, a mesh represents a clear understanding of a person’s body invariant to global orientation and scale. A number of recent methods [6, 24, 27, 41, 51, 55] build this mesh by learning to predict parametric human body models such as SMPL [31] or the closely-related Adam [23]. To increase training breadth, some of these methods train on 2D keypoints [24, 41] and utilize a shape prior.

The HMR [24] model trains an adversarial prior with a variety of 2D keypoint datasets to demonstrate good performance in-the-wild, making it a strong candidate to extend to more challenging viewpoints. CMR [26] also produces strong results using a similar in-the-wild training methodology. We therefore apply our method to both models, rapidly improving performance on Internet video.

Understanding Partially-Observed People: Much of the prior work studying global understanding of partially-observed people comes from ego-centric action recognition [11, 29, 33, 46, 48]. Methods often use observations of the same human body-parts between images, typically hands [11, 29, 33]

, to classify global activity. In contrast, our goal is to predict pose, from varied viewpoints.

Some recent work explores ego-centric pose estimation. Recent setups use cameras mounted in a variety of clever ways, such as chest [20, 45], bike helmet [43], VR goggles [49], and hat [57]. However, these methods rely on camera always being in the same spot relative to the human to make predictions. On the other hand, our method attains global understanding of the body by training on entire people to reason about unseen joints as it encounters less visible images.

Prior work also focuses on pose estimation specifically in cases of occlusion [15, 16]. While this setting requires inference of non-visible joints, it does not face the same scale variation occurring in consumer video, which can contain people much larger than the image. Some recent work directly addresses truncation. Vosoughi and Amer predict truncated 3D keypoints on random crops of Human3.6M [54]. In concurrence with our work, Exemplar Fine-Tuning [22] uses upper-body cropping to improve performance in Internet video [53]. Nevertheless, consumer Internet video (Fig. 2) faces more extreme truncation. We show cropping alone is not sufficient for this setting; rather cropping and self-training on confident video frames provides the best results.

3 Approach

Figure 3: Our method adapts human pose models to truncated settings by self-training on cropped images. After pre-training using an annotated pose dataset, the method applies small translations to an unlabeled video dataset and selects predictions with consistent pose predictions across translations as pseudo-ground-truth. Repeating the process increases the training set to include more truncated people.

Our goal is the ability to reconstruct a full human-body 3D mesh from an image of part or all of a person in consumer video data. We demonstrate how to do this using a simple but effective self-training approach that we apply to two 3D human mesh recovery models, HMR [24] and CMR [26]. Both systems can predict a mesh by regressing SMPL [31] parameters from which a human mesh can be generated, but work poorly on this consumer video data.

Our method, shown in Fig. 3, adapts each method to this challenging setting of partial visibility by sequentially self-training on confident mesh and keypoint predictions. Starting with a model trained on crops from a labeled dataset, the system makes predictions on video data. We then identify confident predictions using the equivariance technique of Bahat and Shakhnarovich [5]. Finally, using the confident examples as pseudo-ground-truth, the model is trained to map crops of the confident images to the full-body inferences, and the process of identifying confident images and folding them into a training set is continued. Our only assumption is that we can identify frames containing a single person (needed for training HMR/CMR). In this paper, we annotate this for simplicity but assume this can be automated via an off-the-shelf detection system.

3.1 Base Models

Our base models [24, 26] use SMPL [31], which is a differentiable, generative model of human 3D meshes. SMPL maps parameters to output a triangulated mesh with vertices. consists of parameters : joint rotations , shape parameters , global rotation , global translation , and global scale . We abstract each base model as a function mapping an image to a SMPL parameter . As described in [24], the SMPL parameters can be used to yield a set of 2D projected keypoints .

Our training process closely builds off the original methods: we minimize a sum of losses on a combination of projected 2D keypoints , predicted vertices , and SMPL parameters . The most important distinctions are, we assume we have access to SMPL parameters for each image, and we train on all annotated keypoints, even if they are outside the image. We describe salient differences between the models and original training below.

HMR [24]: Kanazawa et al. use MoSh [32, 52] for their ground truth SMPL loss. However, this data is not available in most images, and thus the model relies primarily on keypoint loss. Instead, we train directly on predicted SMPL rotations, available in all images; we find loss works best. To encourage our network to adapt to poses of new datasets, we do not use a discriminator loss. We also supervise (), though experiments indicated this did not impact performance (less than 1% difference on keypoint results). The loss for a single datapoint is:


CMR [26]: CMR additionally regresses predicted mesh, and has intermediate losses after the Graph CNN. We do not change their loss, other than by always using 3D supervision and out-of-image keypoints. It is distinct from our HMR loss as it uses loss on keypoints and SMPL parameters, and converts and to rotation matrices for training [38], although they note conversion does not change quantitative results. The loss for a single datapoint is:


such that , each norm is reduced by its number of elements, and keypoint and mesh losses are also applied after the Graph CNN. While Kolotouros et al. train the Graph CNN before the MLP, we find the pretrained model trains well with both losses simultaneously.

3.2 Iterative Adaptation to Partial Visibility

Our approach follows a standard self-training approach to semi-supervised learning. In self-training, one begins with an

initial model as well as a collection of unlabeled data . Here, the inputs are images, outputs SMPL parameters, and model either CMR or HMR. The key idea is to use the inferences of each round’s model to produce labeled data for training the next round’s model . More specifically, at each iteration , the model is applied to each element of , and a confident prediction subset is identified. Then, predictions of model on elements are treated as new ground-truth for training the next round model . In standard self-training, the new training set is the original unlabeled inputs and model outputs, or . In our case, this would never learn to handle more cropped people, and the training set is thus augmented with transformations of the confident samples, or for some set of crops . The new model is retrained and the process is repeated until convergence. We now describe more concretely what we mean by each bolded point.

Initial Model: We begin by training the pretrained HMR and CMR models on MPII (Fig. 3, left) such that we apply cropping transformations to images and keypoints. SMPL predictions from full images are used for supervision, and are typically very accurate considering past training on this set. This training scheme is the same as that used for self-training (Fig. 3, right), except we use MPII ground truth keypoints instead of pseudo-ground truths.

Identifying Confident Predictions: In order to apply self-training, we need to be able to find confident outputs of each of our SMPL-regressing models. Unfortunately, it is difficult to extract confidence from regression models because there is no natural and automatically-produced confidence measure unlike in classification where measures like entropy provide a starting point.

We therefore turn to an empirical result of Bahat and Shakhnarovich [5]

that invariance to image transformations is often indicative of confidence in neural networks. Put simply, confident predictions of networks tend to be more invariant to small transformations (e.g., a shift) than non-confident predictions. We apply this technique in our setting by examining changes of parameters after applying small translational jitter: we apply the model

to copies of the image with the center jittered 10 and 20 pixels and look at joint rotation parameters

. We compute the variance of each joint rotation parameter across the jittered samples, then average the variances across joints. For HMR, we define confident samples as ones with a variance below

(chosen empirically). For CMR, for simplicity, we ensure that we have the same acceptance rate as HMR of 12%; this results in a similar variance threshold of .

Applying Transformations: The set of inputs and confident pseudo-label outputs that can be used for self-training is not enough. We therefore apply a family of crops that mimic empirical frequencies found in consumer video data. Specifically, crops consist of 23% most of body visible, 29% legs not visible, 10% head not visible, and 22% only hands or arms visible. Examples of these categories are shown in Fig. 2. Although proportions were chosen empirically from VLOG, other consumer Internet video datasets considered [2, 59, 62] exhibit similar visibility patterns, and we empirically show that our results generalize.

Retraining: Finally, given the set of samples of crops of confident images and corresponding full bodies, we retrain each model.

3.3 Implementation Details

Our cropping procedure and architecture is detailed in supplemental for both HMR [24] and CMR [26]; we initialize both with weights pretrained on in-the-wild data. On MPII, we continue using the same learning rate and optimizer used by each model (1e-5 for HMR, 3e-4 for CMR, both use Adam [25]) until validation loss converges. Training converges within 20k iterations in both cases.

Next, we identify confident predictions as detailed above on VLOG. We use the subset of the hand-contact state dataset containing single humans, which consists of 132k frames in the train + validation set. We note we could have used a simple classifier to filter by visible people in a totally unlabeled setting. Our resulting confident train + validation set is 15k images. We perform the same cropping transformations as in MPII, and continue training with the same parameters. Validation loss converges within 10k iterations. We repeat this semi-supervised component one additional time, and the new train + validation set is of approximately size 40k. Training again takes less than 10k iterations.

Figure 4: Randomly sampled positive and negative predictions by number of keypoints visible, as identified by workers. Images with fewer keypoints visible are typically more difficult, and our method improves most significantly in these cases (Table 2).

4 Experiments

We now describe a set of experiments done to investigate the following experimental questions: (1) how well do current 3D human mesh recovery systems work on consumer internet video? (2) can we improve the performance of these systems, both on an absolute basis and in comparison to alternate simple models that do not self-train? We answer these questions by evaluating the performance of a variety of adaptation methods applied to both HMR [24] and CMR [26] on four independent datasets of Internet videos. After introducing the datasets (Sec. 4.1) and experimental setup (Sec. 4.2), we describe experiments on VLOG (Sec. 4.3), which we use for our self-training video adaptation. To test the generality of our conclusions, we then repeat the same experiments on three other consumer datasets without any retraining (Sec. 4.4). We validate our choice of confidence against two other methods in (Sec. 4.5).

4.1 Datasets and Annotations

Independent Joint Statistics Joint Joint Statistics
Neck + Head + Fully Upper Only All But
Ankle Knee Hip Wrist Elbow Should. Face Visible Torso Arms Head
Average 7.0 12.9 31.9 71.9 50.8 53.9 51.1 2.8 26.2 31.8 18.8
VLOG [14] 10.5 20.0 34.0 71.5 54.5 61.0 53.3 4.0 32.0 24.0 14.0
Instructions [2] 14.0 24.5 32.5 73.5 52.0 44.7 43.3 5.0 14.0 32.0 18.0
YouCook II [59] 0.0 1.0 30.0 71.0 45.5 53.7 52.7 0.0 28.0 39.0 24.0
Cross-Task [62] 3.5 6.0 31.0 71.5 51.0 56.3 55.0 2.0 31.0 32.0 19.0
Table 1: Joint visibility statistics across the four consumer video datasets that we use. Across multiple consumer video datasets, fully visible people are exceptionally rare (), in contrast to configurations like an upper torso or only pieces of someone’s arms, or much of a body but no head. Surprisingly, the most likely to be visible joint is actually wrists, more than x more likely than hips and x more likely than knees.

We rely on four datasets of consumer video from the Internet for evaluating our method: VLOG [14], Instructions [2], YouCookII [59], and Cross-Task [62]. Evaluation on VLOG takes place on a random 5k image subset of the test set detailed in Sec. 3.3. For evaluation on Instructions, YouCookII, and Cross-Task, we randomly sample test-set frames (Instructions we sample from the entire dataset, which is used for cross-validation), which are filtered via crowd-workers by whether there is a single person, and then randomly subsample 5k subset.

Finally, to enable automatic metrics like PCK, we obtain joint annotations on all four datasets. We annotate keypoints for the 19 joints reprojected from HMR, or the 17 COCO keypoints along with the neck and head top from MPII. Annotations are crowd-gathered by workers who must pass a qualification test and are monitored by sentinels, and is detailed in supplemental. We show statistics of these joints in Table 

1, which show quantitatively the lack of visible keypoints. In stark contrast to canonical pose datasets, the head is often not visible. Instead, the most frequently visible joints are wrists.

4.2 Experimental Setup

We evaluate our approaches as well as a set of baselines that test concrete hypotheses using two styles of metrics: 2D keypoint metrics, specifically PCK measured on both in-image joints as well as via out-of-image joints (via evaluation on crops); and 3D Mesh Human Judgments, where crowd workers evaluate outputs of the systems on an absolute or relative basis.

2D Keypoint Metrics: Our first four metrics compare predicted keypoints with annotated ones. Our base metric is PCK @ 0.5 [4], the percent of keypoints within a threshold of 0.5 times head segment, the most commonly reported threshold on MPII. Our first metric, Uncropped PCK, is performance on images where the head is visible to define PCK. We choose PCK since head segment length is typically undistorted in our data, as opposed to alternates where identifying a stable threshold is difficult: PCP [13] is affected by our high variance in body 3D orientation, and PCPm [4] by high inter-image scale variation.

PCK is defined only on images where the head is visible (a shortcoming we address with human judgment experiments). In addition to being a subset, these frames are not representative of typical visibility patterns in consumer video (as shown in Fig. 2 and Table 1), so we evaluate on crops. We sample crops to closely match the joint visibility statistics of each entire annotated test set (detailed in supplemental). We can then evaluate In-Image PCK, or PCK on joints in the cropped image. Because the original image contains precise annotations of joints not visible in the crop, we can also evaluate Out-of-Image PCK, or PCK on joints outside the crop. Total PCK is PCK on both. We calculate PCK on each image and then average over images. Not doing this gives significantly more weight to images with many keypoints in them, and ignores images with few.

3D Mesh Human Judgments: While useful, keypoint metrics like PCK suffer from a number of shortcomings. They can only be evaluated on a subset of images: this ranges from 37% of images from Instructions to 50% of images in Cross-Task. Moreover, these subsets are not necessarily representative, as argued before. Finally, in the case of out-of-image keypoints, PCK does not distinguish between plausible predictions that happen to be incorrect according to the fairly exacting PCK metric, and implausible guesses. We therefore turn to human judgments, measuring results in absolute and comparative terms.

Mesh Score/Absolute Judgment: We show workers an image and single mesh, and ask them to classify it as largely correct or not (precise definition in supplemental), from which we can calculate Percentage of Good Meshes: the proportion of predicted meshes workers consider good. Predictions from all methods are aggregated and randomly ordered, so are evaluated by the same pool of workers.

Relative Judgment: As a verification, we also perform A/B testing on HMR predictions. We follow a standard A/B paradigm and show human workers an image and two meshes in random order and ask which matches the image better with the option of a tie; when workers cannot agree, we report this as a tie.

Baselines: We compare our proposed model with two baselines to answer a few scientific questions.

Base Method: We compare with the base method being used, either HMR [24] or CMR [26], without any further training past their original pose dataset training sets. This both quantifies how well 3D pose estimation methods work on consumer footage and identifies when our approach improves over this model.

Crops: We also compare with a model trained on MPII Crops (including losses on out-of-image keypoints). This tests whether simply training the model on crops is sufficient compared to also self-training on Internet video.

4.3 Results on VLOG

Our first experiments are on VLOG [14], the dataset that we train on. We begin by showing qualitative results, comparing our method with a number of baselines in Fig. 4.3. While effective on full-body cases, the initial methods perform poorly on truncated people. Training on MPII Crops prepares the model to better identify truncated people, but self-training on Internet video provides the model context clues it can associate with people largely outside of images — some of the largest improvements occur when key indicators such as sinks and tables (Fig. 4.3) are present. In Fig. 4.3, the model identifies distinct leg and head poses outside of images given minute difference in visible pose and appearance.

Figure 5: Selected comparison of results on VLOG [14]. We demonstrate sequential improvement between ablations on HMR (left) and CMR (right). Training on MPII Crops prepares the model for truncation, while self-training provides context clues it can associate with full-body pose, leading to better predictions, particularly outside images.
Figure 6: Shots focused on hands occur often in consumer video. While the visible body may look similar across instances, full-body pose can vary widely, meaning keypoint detection is not sufficient for full-body reasoning. After self-training, our method learns to differentiate activity such as standing and sitting given similar visible body.

Human 3D Mesh Judgments: We then consider human 3D Mesh Judgments, which quantitatively confirm the trends observed in the qualitative results. We report the frequency that each method’s predictions were rated as largely correct on the test set, broken down by the number of visible joints, in Table 2. Our approach always outperforms using the base method, and only is outperformed by Crops on full or near-full keypoint visibility. These performance gains are particularly strong in the less-visible cases compared to both the base method and crops. For instance, by using our technique, HMR’s performance in highly truncated configurations (1-3 Keypoints Visible) is improved by 23.7 points compared to the base and 11.0 compared to using crops.

HMR [24] CMR [26]
By # of Visible Joints By # of Visible Joints
1-3 4-6 7-9 10-12 13-15 16-19 All 1-3 4-6 7-9 10-12 13-15 16-19 All
Base 19.2 52.7 70.1 80.1 85.2 82.1 60.6 13.6 37.9 53.4 68.0 79.0 74.8 51.1
Crops 31.9 68.7 76.9 86.8 91.0 85.9 69.4 33.7 65.8 75.7 82.5 88.4 80.9 67.5
Full 42.9 72.1 82.8 89.6 92.3 83.1 73.9 40.9 71.2 80.2 86.0 89.2 79.5 71.2
Table 2: Percentage of Good Meshes on VLOG, as judged by human workers. We report results on All images and examine results by number of visible keypoints.
Method HMR [24] CMR [26]
Cropped Uncr. Cropped Uncr.
Total In Out Total Total In Out Total
Base 48.6 65.2 14.7 68.5 36.1 50.2 13.2 49.5
Crops 51.6 65.3 24.2 68.8 47.3 58.1 26.2 59.5
Ours 55.9 61.6 38.9 68.7 50.9 60.3 34.6 58.1
Table 3: PCK @ 0.5 on VLOG. We compute PCK on the 1.8k image VLOG test set, in which the head is fully visible, as Uncr. Total. These images are then Cropped to emulate the keypoint visibility statistics of the entire dataset, on which we can calculate PCK In and Out of cropped images, and their union Total.

2D Keypoints: We next evaluate keypoints, reporting results for all four variants in Table 3. On cropped evaluations that match the actual distribution of consumer video, our approach produces substantial improvement, increasing performance overall for both HMR and CMR. On the uncropped images where the head of the person is visible (which is closer to distributions seen on e.g., MPII), our approach remains approximately the same for HMR and actually improves by 8.6% for CMR. We note our method underperforms within cropped images on HMR. There are two reasons for this: first, supervising on out-of-image keypoints encourages predictions outside of images, sacrificing marginal in-image performance gains. Second, the cost of supervising on self-generated keypoints is reduced precision in familiar settings. Nevertheless, CMR improves enough using semi-supervision to still increase on in-image-cropped keypoints.

4.4 Generalization Evaluations

We now test generalization to other datasets. Specifically, we take the approaches evaluated in the previous section and apply them directly to Instructions [2], YouCookII [59], and Cross-Task [62] with no further training. This tests whether the additional learning is simply overfitting to VLOG. We show qualitative results of our system applied to these datasets in Fig.  7. Although the base models also work poorly on these consumer videos, simply training on VLOG is sufficient to produce more reasonable outputs.

Figure 7: Results on External Datasets. While our method trains on Internet Vlogs, performance generalizes to other Internet video consisting of a variety of activities and styles; specifically instructional videos and cooking videos.
Instructions [2] YouCook II [59] Cross-Task [62]
1-6 All 1-6 All 1-6 All 1-6 All 1-6 All 1-6 All
Base 10.7 42.2 7.4 30.9 15.9 54.6 8.5 41.8 13.6 52.6 7.7 37.9
Crops 25.5 53.8 28.8 52.5 24.9 60.8 24.3 60.2 22.3 59.0 21.5 57.7
Full 37.3 60.5 35.2 57.0 43.4 71.5 39.9 68.5 37.0 68.1 31.0 63.5
Table 4: Percentage of Good Meshes on External Datasets, as judged by human workers. We report results on All images and in the case of few visible keypoints.

3D Mesh Judgments: This is substantiated quantitatively across the full dataset since, as shown in Table 4, HMR and CMR perform poorly out-of-the-box. Our approach, however, can systematically improve their performance without any additional pose annotations: gains over the best baseline range from 4.5 percentage points (CMR tested on Instructions) to 10.7 percentage points (HMR tested on YouCookII). Our outputs are systematically preferred by humans in A/B tests (Table 5): our approach is x – x more likely to be picked as preferable compared to the base system than the reverse, and similarly x – x more likely to be picked as preferable to crops than the reverse.

2D Keypoints: Finally, we evaluate PCK. Our approach produces strong performance gains on two out of the three datasets (YouCookII and Cross-Task), while its performance is more mixed on Instructions relative to MPII Crops. We hypothesize the relatively impressive performance of MPII Crops is due to 40% of this dataset consisting of car mechanical fixes. These videos frequently feature people bending down, for instance while replacing car tires. Similar activities such as swimming are more common in MPII than VLOG. The corresponding array of outdoor scenes also provides less context to accurately infer out-of-image body parts. Yet, strong human judgment results (Table 4, 5) indicate training on VLOG improves coarse prediction quality, even in this setting.

4.5 Additional Comparisons

To validate our choice of confidence, we consider two alternative criteria for selecting confident images: agreement between HMR and CMR SMPL parameters, and agreement between HMR and Openpose [8] keypoints. For fair comparison, implementations closely match our confidence method; full details and tables are in supplemental. Compared to both, our system does about the same or better across datasets, but does not require running two systems. Agreement with CMR yields cropped keypoint accuracy of 1.5-2.7% lower, and uncropped accuracy of 0.6% higher - 0.6% lower. Agreement with Openpose is stronger on uncropped images: 0.3%-2.4% higher, but weaker on uncropped: 1.3%-3.5% lower.

Method VLOG [14] Instructions [2] YouCookII [59] Cross-Task [62]
Base Crops Base Crops Base Crops Base Crops
Crops 53/28/19 - 56/28/16 - 49/35/16 - 46/39/15
Full 63/23/15 45/43/12 65/21/14 40/43/17 62/32/7 47/47/6 57/36/7 41/53/7
Table 5: A/B testing on All Datasets, using HMR. For each entry we report how frequently (%) the row wins/ties/loses to the column. For example, row 2, column 6 shows that our full method is preferred 47% of the time over a method trained on MPII Crops, and MPII Crops is preferred over the full method just 6% of the time
Method Instructions [2] YouCookII [59] Cross-Task [62]
HMR [24] CMR [26] HMR [24] CMR [26] HMR [24] CMR [26]
Total Out Total Out Total Out Total Out Total Out Total Out
Base 42.0 19.6 32.8 17.1 56.0 27.7 44.0 26.9 56.1 20.3 44.1 19.8
MPII Crops 50.6 33.7 47.9 33.9 65.8 45.0 65.0 48.6 62.9 32.5 61.9 38.2
Ours 48.7 36.4 44.8 33.7 76.7 64.1 70.7 58.5 74.5 57.2 66.9 47.9
Table 6: PCK @ 0.5 on External Datasets. We compute PCK in test set images in which the head is fully visible. These images are then cropped to emulate the keypoint visibility statistics of the entire dataset, on which we can calculate PCK on predictions outside the image.

We additionally consider performance of our model to the model after only the first iteration of VLOG training, through A/B testing (full table in supplemental). In all four datasets, the final method is x – x more likely to be picked as preferable to the model after only one round than the reverse.

5 Discussion

We presented a simple but effective approach for adapting 3D mesh recovery models to the challenging world of Internet videos. In the process, we showed that current methods appear to work poorly on Internet videos, presenting a new opportunity. Interestingly, while CMR outperforms HMR on Human3.6M, the opposite is true on this new data, suggesting that performance gains on standard pose estimation datasets do not always translate into performance gains on Internet videos. Thanks to the new annotations across the four video datasets, however, we can quantify this. These keypoint metrics are validated as a measure for prediction quality given general agreement with human judgement metrics in extensive testing. We see getting systems to work on consumer videos, including both the visible and out-of-image parts, as an interesting and impactful challenge and believe our simple method provides a strong baseline for work in this area.

Acknowledgments: This work was supported by the DARPA Machine Common Sense Program. We thank Dimitri Zhukov, Jean-Baptiste Alayrac, and Luowei Zhou, for allowing sharing of frames from their datasets, and Angjoo Kanazawa and Nikos Kolotouros for polished and easily extended code. Thanks to the members of Fouhey AI Lab and Karan Desai for the great suggestions!


  • [1] I. Akhter and M. J. Black (2015) Pose-conditioned joint angle limits for 3D human pose reconstruction. In CVPR, Cited by: §2.
  • [2] J. Alayrac, P. Bojanowski, N. Agrawal, I. Laptev, J. Sivic, and S. Lacoste-Julien (2016) Unsupervised learning from narrated instruction videos. In CVPR, Cited by: Figure 1, §1, §1, §3.2, §4.1, §4.4, Table 1, Table 4, Table 5, Table 6.
  • [3] R. Alp Güler, N. Neverova, and I. Kokkinos (2018) Densepose: dense human pose estimation in the wild. In CVPR, Cited by: §1, §2.
  • [4] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele (2014) 2D human pose estimation: new benchmark and state of the art analysis. In CVPR, Cited by: §1, §2, §4.2.
  • [5] Y. Bahat and G. Shakhnarovich (2018) Confidence from invariance to image transformations. arXiv preprint arXiv:1804.00657. Cited by: §1, §3.2, §3.
  • [6] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black (2016) Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In ECCV, Cited by: §2.
  • [7] L. Bourdev and J. Malik (2009) Poselets: body part detectors trained using 3D human pose annotations. In ICCV, Cited by: §2.
  • [8] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh (2018) OpenPose: realtime multi-person 2d pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008. Cited by: §4.5.
  • [9] Z. Cao, T. Simon, S. Wei, and Y. Sheikh (2017) Realtime multi-person 2D pose estimation using part affinity fields. In CVPR, Cited by: §1, §2.
  • [10] C. Desai and D. Ramanan (2012) Detecting actions, poses, and objects with relational phraselets. In ECCV, Cited by: §2.
  • [11] A. Fathi, A. Farhadi, and J. M. Rehg (2011) Understanding egocentric activities. In ICCV, pp. 407–414. Cited by: §2.
  • [12] P. Felzenszwalb, D. McAllester, and D. Ramanan (2008) A discriminatively trained, multiscale, deformable part model. In CVPR, Cited by: §2.
  • [13] V. Ferrari, M. Marin-Jimenez, and A. Zisserman (2008) Progressive search space reduction for human pose estimation. In CVPR, Cited by: §4.2.
  • [14] D. F. Fouhey, W. Kuo, A. A. Efros, and J. Malik (2018) From lifestyle VLOGs to everyday interactions. In CVPR, Cited by: Figure 1, Figure 2, §1, §1, §1, Figure 5, §4.1, §4.3, Table 1, Table 5.
  • [15] G. Ghiasi, Y. Yang, D. Ramanan, and C. C. Fowlkes (2014) Parsing occluded people. In CVPR, Cited by: §2.
  • [16] A. Haque, B. Peng, Z. Luo, A. Alahi, S. Yeung, and L. Fei-Fei (2016) Towards viewpoint invariant 3D human pose estimation. In ECCV, Cited by: §2.
  • [17] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele (2016) Deepercut: a deeper, stronger, and faster multi-person pose estimation model. In ECCV, Cited by: §2.
  • [18] C. Ionescu, F. Li, and C. Sminchisescu (2011) Latent structured models for human pose estimation. In ICCV, Cited by: §2.
  • [19] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu (2013) Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI. Cited by: §1, §2.
  • [20] H. Jiang and K. Grauman (2017) Seeing invisible poses: estimating 3D body pose from egocentric video. In CVPR, Cited by: §2.
  • [21] S. Johnson and M. Everingham (2010) Clustered pose and nonlinear appearance models for human pose estimation.. In BVMC, Cited by: §1, §2.
  • [22] H. Joo, N. Neverova, and A. Vedaldi (2020) Exemplar fine-tuning for 3d human pose fitting towards in-the-wild 3d human pose estimation. arXiv preprint arXiv:2004.03686. Cited by: §2.
  • [23] H. Joo, T. Simon, and Y. Sheikh (2018) Total capture: a 3D deformation model for tracking faces, hands, and bodies. In CVPR, Cited by: §2.
  • [24] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik (2018) End-to-end recovery of human shape and pose. In CVPR, Cited by: Figure 1, §1, §1, §2, §2, §2, §3.1, §3.1, §3.3, §3, §4.2, Table 2, Table 3, Table 6, §4.
  • [25] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Cited by: §3.3.
  • [26] N. Kolotouros, G. Pavlakos, and K. Daniilidis (2019) Convolutional mesh regression for single-image human shape reconstruction. In CVPR, Cited by: Figure 1, §1, §2, §3.1, §3.1, §3.3, §3, §4.2, Table 2, Table 3, Table 6, §4.
  • [27] C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J. Black, and P. V. Gehler (2017) Unite the people: closing the loop between 3D and 2D human representations. In CVPR, Cited by: §2.
  • [28] H. Lee and Z. Chen (1985) Determination of 3D human body postures from a single view. Computer Vision, Graphics, and Image Processing 30 (2), pp. 148–168. Cited by: §2.
  • [29] Y. Li, Z. Ye, and J. M. Rehg (2015) Delving into egocentric actions. In CVPR, Cited by: §2.
  • [30] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In ECCV, Cited by: §1, §2.
  • [31] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black (2015) SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34 (6). Cited by: §2, §3.1, §3.
  • [32] M. Loper, N. Mahmood, and M. J. Black (2014) MoSh: motion and shape capture from sparse markers. ACM Transactions on Graphics (TOG) 33 (6), pp. 220. Cited by: §3.1.
  • [33] M. Ma, H. Fan, and K. M. Kitani (2016) Going deeper into first-person activity recognition. In CVPR, Cited by: §2.
  • [34] J. Martinez, R. Hossain, J. Romero, and J. J. Little (2017) A simple yet effective baseline for 3d human pose estimation. In CVPR, Cited by: §1, §2.
  • [35] D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt (2017) Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 International Conference on 3D Vision (3DV), pp. 506–516. Cited by: §2.
  • [36] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H. Seidel, W. Xu, D. Casas, and C. Theobalt (2017) Vnect: real-time 3D human pose estimation with a single RGB camera. ACM Transactions on Graphics (TOG) 36 (4), pp. 44. Cited by: §1, §2.
  • [37] A. Newell, K. Yang, and J. Deng (2016) Stacked hourglass networks for human pose estimation. In ECCV, Cited by: §1, §2.
  • [38] M. Omran, C. Lassner, G. Pons-Moll, P. Gehler, and B. Schiele (2018)

    Neural body fitting: unifying deep learning and model based human pose and shape estimation

    In 2018 international conference on 3D vision (3DV), pp. 484–494. Cited by: §3.1.
  • [39] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy (2017) Towards accurate multi-person pose estimation in the wild. In CVPR, Cited by: §2.
  • [40] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis (2017) Coarse-to-fine volumetric prediction for single-image 3D human pose. In CVPR, Cited by: §2.
  • [41] G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis (2018) Learning to estimate 3D human pose and shape from a single color image. In CVPR, Cited by: §2.
  • [42] V. Ramakrishna, T. Kanade, and Y. Sheikh (2012) Reconstructing 3D human pose from 2D image landmarks. In ECCV, Cited by: §2.
  • [43] H. Rhodin, C. Richardt, D. Casas, E. Insafutdinov, M. Shafiei, H. Seidel, B. Schiele, and C. Theobalt (2016) Egocap: egocentric marker-less motion capture with two fisheye cameras. ACM Transactions on Graphics (TOG) 35 (6), pp. 1–11. Cited by: §2.
  • [44] G. Rogez and C. Schmid (2016) Mocap-guided data augmentation for 3d pose estimation in the wild. In Advances in neural information processing systems, Cited by: §2.
  • [45] G. Rogez, J. S. Supancic, and D. Ramanan (2015) First-person pose recognition using egocentric workspaces. In CVPR, Cited by: §2.
  • [46] M. S. Ryoo and L. Matthies (2013) First-person activity recognition: what are they doing to me?. In CVPR, Cited by: §2.
  • [47] L. Sigal, A. O. Balan, and M. J. Black (2010) Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision 87 (1-2), pp. 4. Cited by: §2.
  • [48] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In ECCV, Cited by: §2.
  • [49] D. Tome, P. Peluse, L. Agapito, and H. Badino (2019) xR-EgoPose: egocentric 3D human pose from an HMD camera. In ICCV, Cited by: §2.
  • [50] A. Toshev and C. Szegedy (2014) Deeppose: human pose estimation via deep neural networks. In CVPR, Cited by: §2.
  • [51] H. Tung, H. Tung, E. Yumer, and K. Fragkiadaki (2017)

    Self-supervised learning of motion capture

    In NeurIPS, Cited by: §2.
  • [52] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid (2017) Learning from synthetic humans. In CVPR, Cited by: §3.1.
  • [53] T. von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and G. Pons-Moll (2018) Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 601–617. Cited by: §2, §2.
  • [54] S. Vosoughi and M. A. Amer (2018) Deep 3D human pose estimation under partial body presence. In ICIP, Cited by: §2.
  • [55] D. Xiang, H. Joo, and Y. Sheikh (2019) Monocular total capture: posing face, body, and hands in the wild. In CVPR, Cited by: §2.
  • [56] B. Xiao, H. Wu, and Y. Wei (2018) Simple baselines for human pose estimation and tracking. In ECCV, Cited by: §2.
  • [57] W. Xu, A. Chatterjee, M. Zollhoefer, H. Rhodin, P. Fua, H. Seidel, and C. Theobalt (2019) Mo 2 cap 2: real-time mobile 3d motion capture with a cap-mounted fisheye camera. IEEE transactions on visualization and computer graphics 25 (5), pp. 2093–2101. Cited by: §2.
  • [58] Y. Yang and D. Ramanan (2011) Articulated pose estimation using flexible mixtures of parts. In CVPR, Cited by: §2.
  • [59] L. Zhou, C. Xu, and J. J. Corso (2018) Towards automatic learning of procedures from web instructional videos. In AAAI, Cited by: Figure 1, Figure 2, §1, §1, §3.2, §4.1, §4.4, Table 1, Table 4, Table 5, Table 6.
  • [60] X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis (2016) Sparseness meets deepness: 3D human pose estimation from monocular video. In CVPR, Cited by: §2.
  • [61] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei (2017) Towards 3D human pose estimation in the wild: a weakly-supervised approach. In ICCV, Cited by: §2.
  • [62] D. Zhukov, J. Alayrac, R. G. Cinbis, D. Fouhey, I. Laptev, and J. Sivic (2019) Cross-task weakly supervised learning from instructional videos. In CVPR, Cited by: Figure 1, Figure 2, §1, §1, §3.2, §4.1, §4.4, Table 1, Table 4, Table 5, Table 6.

Appendix 0.A Method

0.a.1 HMR Network Architecture

Our HMR network architecture is the same as the HMR model of Kanazawa et al. It consists of a ResNet-50 feature extractor, followed by a 3D regression refinement network, consisting of 3 fully-connected layers mapping to SMPL, global rotation, and camera parameters . The fully-connected layers concatenate image features, mean SMPL parameters , and default global rotation and camera parameters [

]. FC layers also use residual links and dropout. More details can be found in the original paper. Also like HMR, we use same-padding for image inputs, although for illustrative purposes images in the paper are shown with white or black padding.

0.a.2 CMR Network Architecture

Our CMR network architecture is the same as the CMR model of Kolotouros et al.

It consists first of a ResNet-50 encoder, with the final fully-connected layer removed. This outputs a 2048-D feature vector, which is attached to 3D coordinates of template mesh vertices. A series of graph convolutions then map to a single 3D mesh vertex set, and to camera parameters [

]. Finally, a multi-layer perceptron maps these vertices to SMPL parameters

and global rotations . Final predictions use camera parameters from graph convolutions [], and SMPL parameters and global rotations [] from the MLP. More details can be found in the original paper.

0.a.3 Confident Predictions

As detailed in the paper, predictions are considered confident if average variance of joint rotation parameters across jittered images is less than 0.005 for HMR, chosen empirically. For simplicity, threshold for CMR is chosen so that approximately the same number of confident images are chosen, resulting in a threshold of 0.004. The five images from which predictions are averaged are:

  1. the original image

  2. the image translated 10 pixels to the top left, padded on the bottom right

  3. the image translated 20 pixels to the top left, padded on the bottom right

  4. the image translated 10 pixels to the bottom right, padded on the top left

  5. the image translated 20 pixels to the bottom right, padded on the top left

0.a.4 Cropping

During training, the proposed method crops training and validation images into five categories: above hip, above shoulders, from knee to shoulder, around only an arm, and around only a hand. We show examples of cropping in Fig. 8. These crops correspond to common crops occurring in consumer video, displayed in Fig. 2 of the paper. Above hip corresponds to “Legs Not Visible”, knee to shoulder corresponds to “Head Not Visible”, and above shoulders corresponds to “Torso Not Visible”. For brevity, in Fig. 2, we condensed only an arm or only a hand into one image: “Arms / Hands Visible”.

During training, we sample crops with approximately the same frequency as they occur in the VLOG validation set. Proportions are: above hip in 29% of images, knee to shoulder in 10% of images, above shoulders in 16% of images, around one arm only in 9% of images, around one hand in 13% of images, and 23% of images we leave uncropped. On both MPII and VLOG, for both models, we crop to our target crops using keypoints. Ground truth keypoints are used on MPII, and reprojected keypoints from confident models are used on VLOG. Above hip crops are made from the lower of the hip keypoints. Knee to shoulder crops use the higher of knee keypoints to the bottom of shoulder keypoints. Above shoulder crops use the lower of shoulder and neck as the bottom of the crop. Elbow and wrist keypoints are used to approximate one arm and one hand crops. If keypoints used for cropping are outside of images, for simplicity we presume the image is already cropped and do not crop further. If a prospective crop would be smaller than 30 pixels, we also do not crop to prevent training on very low-resolution examples.

Figure 8: Generating Cropped Sets. Training and cropped testing use keypoints to crop images to target visibility specifications. Examples of each crop specification we use are pictured. Some images are left uncropped, and sometimes predefined crops do not further crop images (right).

Appendix 0.B Datasets

0.b.1 Dataset Annotation

As detailed in the paper, for each of VLOG, Instructions, YoucookII, and Cross-Task; we subsample 5k random frames containing exactly one person. Next, we use human annotators to label human keypoints on all of these frames. The full test sets consist of images in which at least one keypoint is annotated, on which workers can agree (details in Keypoint Annotation). They are of size 4.1k, 2.2k, 3.3k, and 3.9k, for VLOG, Instructions, YoucookII, and Cross-Task, correspondingly. Instructions is notably the smallest because it has a higher proportion of images containing no human keypoints. The most common instance of this occurring is when some of a hand is visible, which does not contain a keypoint, but a wrist is not visible, which corresponds to a keypoint.

All human annotations were gathered using thehive.ai, a website similar to Amazon Mechanical Turk. Annotators were given instructions and examples of each category. Then, they were given a test to identify whether they understood the task. Workers that passed the test were allowed to annotate images. However, as they annotated, they were shown, at random, gold standard sentinel examples (i.e., that we labeled), that tested their accuracy. The platform automates the entire process. Because some workers spoke only Spanish, we put directions in both English and Spanish. English annotation instructions are provided in Subsections 0.E.1, 0.E.2, and 0.E.3.

Keypoint Annotation

Although traditionally disagreement on labeling ground truth is handled by thehive.ai, the company does not currently support labeling keypoints in the instance there are an ambiguous number of keypoints visible, which occurs here. For instance, consider the case where a person’s elbow is at the edge of an image. Some annotators may label the elbow while others do not. To deal with annotations containing different number of visible joints, we combine predictions between workers on a given image ourselves, using the median number of joints annotated between workers, and average the locations.

Specifically, we have each image labeled three separate times. If two or three of the occurrences of the image see no joints labeled, we consider it as having no joints visible. In other cases, we take the median number of joints visible between the three instances, and average these joints across instances when possible. If any given joint has predictions differing by a large margin (10% of image size), we do not annotate it. If all three workers disagree on number of joints visible, we consider it ambiguous and do not add it to our test set.

3D Mesh Annotation

Absolute Judgment: Workers annotated whether the mesh was largely correct or not. The percent of times they annotated correct gives Percentage of Good Meshes. For mesh scoring, all ablations were scored at the same time, to avoid possible bias between scores between models. Additionally, model outputs and images were put in random order to avoid one person seeing many outputs from the same model in a row.

Relative Judgment: For A/B testing, workers were presented with an image and two outputs. They selected the output that best matched the image. Order in which predictions were seen in relation to the image was randomized, and which outputs were compared was also presented in random order.

0.b.2 Dataset Details and Statistics

As explained in the paper, we evaluate keypoint accuracy on images in which the head is visible in order to calculate PCK. This results in test subsets of size 1.8k, 0.8k, 1.5k, and 1.9k, for VLOG, Instructions, YoucookII, and Cross-Task, correspondingly. These sets are not representative of the full test-set visibility statistics, and do not allow for out-of-image keypoint evaluation. Therefore, we use cropping of body-parts to closely match aggregate test set statistics. We use the same canonical crops as during training, displayed in Fig. 8. However, we explicitly choose crop proportions to closely match full test sets.

Uncropped keypoint test sets are biased since their images always contain head keypoints; needed for computing PCK. Therefore, we must crop aggressively to match full test set statistics. Furthermore, above hip and above shoulder crops are not useful to this end, as they include head keypoints. Knee to shoulder keypoints also are not optimal as they exclude leg keypoints too often, while continuing to sometimes include shoulder and neck keypoints. Instead, to match full test set statistics, we utilize crops around hands and arms frequently, while leaving some images uncropped. Statistics on full test sets, uncropped keypoint test sets, and cropped keypoint test sets are detailed in Table 7.

VLOG Instructions YoucookII Cross-Task
Uncr. Cr. Full Uncr. Cr. Full Uncr. Cr. Full Uncr. Cr. Full
R Ank 0.16 0.07 0.11 0.22 0.11 0.13 0.00 0.00 0.00 0.04 0.02 0.03
R Kne 0.29 0.16 0.20 0.32 0.23 0.24 0.01 0.00 0.01 0.07 0.04 0.06
R Hip 0.48 0.30 0.34 0.64 0.40 0.32 0.51 0.33 0.30 0.46 0.30 0.31
L Hip 0.48 0.31 0.34 0.66 0.43 0.33 0.51 0.35 0.30 0.46 0.30 0.31
L Kne 0.29 0.16 0.20 0.33 0.23 0.25 0.01 0.00 0.01 0.07 0.04 0.06
L Ank 0.15 0.07 0.10 0.23 0.11 0.15 0.00 0.00 0.00 0.04 0.02 0.04
R Wri 0.78 0.65 0.73 0.83 0.69 0.76 0.78 0.61 0.72 0.76 0.59 0.72
R Elb 0.75 0.48 0.55 0.87 0.57 0.52 0.79 0.50 0.45 0.78 0.50 0.51
R Sho 0.95 0.63 0.61 0.95 0.43 0.45 0.99 0.60 0.53 0.97 0.60 0.56
L Sho 0.95 0.62 0.61 0.94 0.43 0.44 0.99 0.58 0.54 0.97 0.60 0.56
L Elb 0.76 0.49 0.54 0.88 0.56 0.52 0.78 0.51 0.46 0.78 0.51 0.51
L Wri 0.76 0.66 0.70 0.83 0.70 0.71 0.76 0.61 0.70 0.75 0.61 0.71
Neck 1.00 0.67 0.61 1.00 0.46 0.45 1.00 0.62 0.54 1.00 0.63 0.57
Head Top 1.00 0.55 0.47 1.00 0.28 0.38 1.00 0.36 0.48 1.00 0.47 0.51
Nose 0.95 0.61 0.57 0.98 0.42 0.47 1.00 0.52 0.54 0.99 0.56 0.57
L Eye 0.95 0.58 0.55 0.98 0.38 0.45 1.00 0.47 0.54 0.99 0.54 0.56
R Eye 0.95 0.58 0.55 0.98 0.36 0.45 1.00 0.48 0.54 0.99 0.53 0.56
L Ear 0.93 0.56 0.53 0.96 0.35 0.43 0.99 0.47 0.53 0.99 0.53 0.55
R Ear 0.93 0.56 0.53 0.94 0.33 0.42 0.99 0.47 0.53 0.98 0.53 0.55
Keypoints 13.5 8.7 8.8 14.5 7.5 7.9 13.1 7.5 7.7 13.1 7.9 8.2
Table 7: Proportion of Visible Joints in Test Sets. Proportion of dataset images containing a particular joint for each of: Uncropped Keypoint Test Set (Uncr.), Cropped Keypoint Test Set (CR.), and Full Test Set (Full). Also, mean number of keypoints (Keypoints) per image.

Appendix 0.C Additional Qualitative Results

0.c.1 Additional Results on VLOG

0.c.2 Additional Results on Instructions

0.c.3 Additional Results on YoucookII

0.c.4 Additional Results on Cross-Task

Appendix 0.D Additional Results

Comparison to One Round: The paper presents comprehensive comparison between our full method, the method after only MPII crops, and original baselines. As our method performs two iterations of training on VLOG, we additionally compare final performance to the method after only a single round of VLOG training in Table 8. As reported in the paper, workers prefer the model after a second iteration of training.

Method One Round
VLOG Instructions YouCookII Cross-Task
Full 18/74/8 15/76/9 18/75/7 15/80/5
Table 8: A/B testing on All Datasets, using HMR. For each entry we report how frequently (%) our full method wins/ties/loses to the method trained with only one round of VLOG training. For example, column 1 shows Full is preferred to One Round 18% of the time, and One Round is preferred just 8% of the time.

Comparison to other confidence methods: As reported in the paper, our selection of confidence also performs similarly to slightly better to agreement between CMR and HMR SMPL joints, and to agreement between HMR keypoints and Openpose keypoints. Full results are available in Table 9. CMR / HMR agreement is done in the same manner as our method, and thresholds are set to select approximately the same number of images as our method. Training takes approximately the same time, and two rounds of self-training are used. The same is true of comparison with Openpose, with the distinction Openpose can only predict keypoints inside an image, so we only consider joints both networks predict as in-image. Additionally, Openpose filters out unconfident keypoints, so we only compare joints predicted by both networks. We observe Openpose struggles especially if the face is truncated, so agreement is mostly in highly-visible settings. This leads to better uncropped keypoint accuracy, but worse cropped.

Method VLOG Instructions YouCookII Cross-Task
Cropped Uncr. Cropped Uncr. Cropped Uncr. Cropped Uncr.
Total Out Total Total Out Total Total Out Total Total Out Total
CMR agreement 54.3 35.9 68.1 47.2 33.9 78.5 74.0 59.5 94.9 72.2 51.8 90.7
Openpose agreement 54.6 34.6 71.1 46.1 31.8 79.8 73.2 58.8 95.7 71.3 50.0 92.2
Ours 55.9 38.9 68.7 48.7 36.4 77.9 76.7 64.1 95.4 74.5 57.2 91.1
Table 9: PCK @ 0.5 on All Datasets, using HMR. We compute PCK in test set images in which the head is fully visible. These images are then cropped to emulate the keypoint visibility statistics of the entire dataset, on which we can calculate PCK on predictions outside the image. We also compute PCK on the uncropped images.

Appendix 0.E Annotation Instructions

0.e.1 Keypoint Annotation Instructions

0.e.2 Mesh Scoring Instructions

0.e.3 A/B Testing Instructions