3D Pose Detection in Videos: Focusing on Occlusion

06/24/2020 ∙ by Justin Wang, et al. ∙ Stanford University 0

In this work, we build upon existing methods for occlusion-aware 3D pose detection in videos. We implement a two stage architecture that consists of the stacked hourglass network to produce 2D pose predictions, which are then inputted into a temporal convolutional network to produce 3D pose predictions. To facilitate prediction on poses with occluded joints, we introduce an intuitive generalization of the cylinder man model used to generate occlusion labels. We find that the occlusion-aware network is able to achieve a mean-per-joint-position error 5 mm less than our linear baseline model on the Human3.6M dataset. Compared to our temporal convolutional network baseline, we achieve a comparable mean-per-joint-position error of 0.1 mm less at reduced computational cost.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human pose detection has been an active area of research in the deep learning community since 2014 with Toshev et. al.’s DeepPose


, a work focused on 2D pose estimation. The problem involves detecting joint positions and bone lengths of humans in the context of both images and videos. In 3D pose estimation, additional challenges arise in regards to projecting from 2D image data onto 3D keypoint coordinates and vice-versa. However, in applications requiring extensive, realistic tracking of human motion, 3D pose estimation emerges as the only viable option. Practical examples include sports and medical fields, where athletes can analyze their footwork and body form in sharp detail, and medical professionals can deeply understand a patient’s gait before entering the operating room.

We focus on improving 3D pose estimation in video for occluded cases. Occluded joints are body joints that cannot be seen by the camera, either blocked by other joints, other body parts, or external objects. Researchers have shown that occlusion is a significant source of error in state-of-the-art models for both 2D and 3D pose estimation of single images [4, 8]. A recent body of literature by Cheng et. al. [1] has focused on 3D pose estimations in video, specifically tackling the problem of occlusion by making use of temporal information provided by videos which is unavailable in single frames.

We build a temporal convolutional network (TCN) for 3D pose estimation to work well specifically for occlusion cases. We plan to work with well-known 3D pose datasets, HumanEva and Human 3.6M, but we will primarily focus on HumanEva due to its simplified structure and lower resource and computational power requirements.

Since previous works have trained and fine-tuned 2D pose estimation models using Stacked Hourglass architectures on the frames of our datasets, we are able to acquire predicted 2D poses, provided by [5]

in the form of joint keypoints. From these 2D joint coordinates, we also generate occlusion predictions based on predefined heuristics, where we label the joint

if it is occluded and

otherwise. The inputs to our model are the predicted 2D joint coordinates and occlusion vector. Our labels are the 3D ground truth poses from the dataset in the form of keypoints. We also include ”ground truth” occlusion labels (

or ) generated from ground truth 3D keypoints using our own baseline heuristic. The outputs of our model are predicted 3D poses, represented as 3-dimensional coordinates in the camera’s frame.

2 Related Work

A landmark improvement in 2D pose estimation came via Newell et. al.’s Stacked Hourglass architecture [4]. Motivated by an understanding that human poses are best captured at different levels of detail (e.g. the location of faces and hands as opposed to the person’s overall orientation), this architecture consists of pooling and upsampling layers which looks like an hourglass, hence the name.

Building on the success of the Stacked Hourglass on 2D pose estimation, Martinez et. al. [3] constructed a simple baseline which used the 2D pose estimations produced by the Stacked Hourglass model as inputs into a linear model to produce 3D pose estimations.

Pavllo et. al. [5] focuses on 3D pose estimation on video. Specifically, [5]

’s approach uses a temporal convolutional network (TCN) instead of a linear model on top of the 2D pose estimations produced by the Stacked Hourglass model. Earlier methods incorporated recurrent neural networks (RNN) to capture the temporal relationship between frames in a video, and the temporal convolutional architecture builds on this relationship by allowing for parallel processing of multiple frames, something not possible with recurrent architectures.

However, [5] did not specifically focus on occlusion and therefore had many problems predicting occluded joints. A recent body of literature by Cheng et. al. [1] has focused on 3D pose estimations in video, specifically tackling the problem of occlusion. We seek to build off their implementation and results. To do so, we investigate why their model succeeds. Videos provide a sequence of frames that provide temporal information to better inform a model’s estimations in occluded settings by providing a context in which an occluded joint should be located. Cheng et. al. [1]

uses an occlusion-aware convolutional neural network to mitigate the effects of 3D pose estimation in occluded videos. Their ”Cylinder Man Model” is a heuristic that maps 3D ground truth joint keypoints to 2D ground truth pose heatmaps and takes occlusion into account. With the 2D ground truth heatmaps, they are able to come up with occlusion labels for each of the 2D keypoints. Cheng et. al. takes predicted 2D poses from a Stacked Hourglass model, filters out occluded joints, and trains a 2D temporal convolutional network to smooth the predicted 2D keypoints. Lastly, they input the smoothed 2D keypoints and occluded predictions into a 3D temporal convolutional network and generate 3D pose predictions, using 3D ground truth poses and occlusion labels.

Cheng et. al. primarily use 2D ”ground truth” heatmaps to train a Stacked Hourglass model to output 2D predicted heatmaps with more occlusions. Although they also input occlusion labels to their TCN, we seek to actually train our TCN to recognize the ground truth occlusions by explicitly adding to the loss function. We plan to iterate upon their cylinder man model heuristic to deal with occlusion.

Figure 1: Visualization generated by Martinez et al. [3]. The left column corresponds to 2D joint coordinates, the middle to ground truth 3D joint coordinates, and the right to predicted 3D join coordinates from 2D heatmaps.

3 Datasets

3.1 Human 3.6M

The Human 3.6M dataset [2], a 3D Pose dataset, consists of 3.6 million images from actors performing a few daily-life activities. There are a total of 4 cameras and 7 annotated subjects. The preprocessed Human3.6M dataset consists of 3D ground truth joint keypoints ordered by camera used during recording, camera parameters, actions conducted by subject, and subject names. As with the temporal convolutional and linear baseline models, we use Subjects 1, 5, 6, 7, 8 for training, and reserve Subjects 9 and 11 for evaluation.

We have access to the pre-processed dataset and have contacted the curators to gain access to the original video frames. However, we still have yet to hear a response.

3.2 HumanEva-I

We have gained full access to the HumanEva-I dataset [6], which contains 3D ground truth keypoints per frame, representing 15 joints (pelvis, thorax, shoulders, elbows, wrist, hips, knees, ankles, head), along with their original video frames. There are a total of seven video-sequences (four grayscale and three color) of four annotated subjects performing five different actions (Walking, Jogging, Throwing/Catching, Gesturing, and Boxing).

We generate our training, validation, and test data using a modified version of the pre-processing step used in [5]. Specifically, this pre-processing step involves calculating the projection from 3D ground truth joint keypoints to 2D joint keypoints using the provided motion capture and camera calibration data. Additionally, since the original video frames occasionally contain chunks of invalid joint keypoint measurements, we simply discard these.

For the remaining video frames, we only consider the three color cameras, the first three subjects, and the first video sequence take of each action. This leaves us with a total of 28,731 data entries, partitioned into a roughly 50/50 training and validation split.

4 Method

4.1 Clustered Ground Truth Occlusions

In order to generate ”ground truth” occlusion labels for 2D poses, we start with a simple heuristic called ”Clustered Occlusions”, similar to the method of [1]. For every frame, we have 17 joint keypoints in the form of 3D camera coordinates for the Human3.6M dataset, or 15 joint keypoints for the HumanEva-I dataset. Because most of the joints in these frames are occluded by other body parts or joints, our intuition to finding which joints are occluded is to find joints that are clustered together in the -plane, then mark the joint closest to the camera (smallest depth) as not occluded, and mark the other joints in the cluster as occluded. In other words, for each joint coordinate , we find the set of keypoints where

and is the tunable tolerance parameter. We currently use . Then, we add and all keypoints to form a cluster , and our non-occluded keypoint index from this set is

where . We mark all other points in set as occluded. We hope that this heuristic can generally fetch all joints that are observable by the camera, as we believe joints in close proximity occlude one another. This gives us a vector of occluded joints for each frame, where means occluded and means not occluded.

After applying the heuristic to get occluded joints, we generate ground truth 2D heatmaps for each existing 2D pose by placing a white circle with Gaussian smoothness at the image coordinates for joints that are not occluded, and doing nothing for joints that are occluded. Then, we take a center crop of the heatmap to crop the subject and resize the width and height to 128. Our ground truth and predicted 2D heatmaps can be visualized in Figures 2 and 3.

Figure 2: Ground truth 2D heatmaps (top) and predicted 2D heatmaps (bottom) for a sequence of poses. The ground truth heatmaps have less peaks (occluded keypoints).

This is only a simple heuristic, and we hope to test out how adding the ground truth occlusion labels affect our error.

Figure 3: Up close example of the occluded joints omitted in ground truth heatmap (top) compared to predicted heatmap (bottom).

4.2 Boxed Man Model

To improve our heuristic, we take inspiration from the cylinder man model delineated in [1]. The cylinder man model generates occlusion labels for 2D poses using 3D poses (Figure 3). Specifically, it models the head, body, arms, and legs as cylindrical segments. In general, if a certain joint is located within another joint’s cylindrical segment in 3D space, it is deemed occluded in the 2D space. More specifically, this technique calculates a visibility variable to determine the degree of occlusion for each joint.

We take this idea and adapt it to our computing needs, and introduce an occlusion technique that requires less computational power and memory than the original cylinder man model. We propose a boxed man model that generates occlusion labels for 2D poses using the original 2D poses. We visualize the original cylinder man, with equivalent proportions, squashed into 2D space.

For example, in the case of the Human3.6M dataset, keypoints 9 and 10 represent the top and bottom of the subject’s head. Define these keypoints to be and , as seen in Figure 4. We use these two keypoints and project them to four points which determine the bounds of our boxed approximation of the head. We determine the four points by first calculating slope of the line , which we define as , to then find the perpendicular slope . We then define


is a hyperparameter that determines the width of the boxes.

and are defined similarly and instead use ’s coordinates and . The box that entails the chest and torso area is defined by the four points that are provided in the keypoints, specifically 1, 4, 11, 14 in the Human3.6M dataset.

If a joint in the boxed man model is located within another joint’s boxed segment in 2D space, we deem it occluded. This simpler heuristic encourages our temporal convolutional network to learn poses based on joints which definitively not occluded. We believe that the temporal convolutional network will be able to learn poses for occluded joints from other camera positions in the dataset, and should learn as much as possible from the original data rather than be forced to learn from certain joints by a heuristic.

Figure 4: A visualization of the cylinder man model [1]. Each arm and leg are scaled to a diameter of 5 cm, and the head is scaled to a diameter of 10 cm.
Figure 5: The setup for the boxed man model. Note that and represent unit vectors for the and axes.

4.3 Temporal Convolutional Network Model

Our main model that produces 3D poses is the 3D temporal convolutional network (TCN) adapted from Pavllo et. al. [5]

. Taking a consecutive sequence of 2D joint keypoints, it uses temporal convolutions and residual connections to predict a frame’s 3D joint coordinates. The input layer applies a temporal convolution over the 2D keypoints with kernel size

, expanding the number of channels from two times the number of joints (for each and coordinate) to . Then, the model goes through Resnet-type blocks, which are connected through skip layers. Each block has a 1D convolution with kernel size and channels, followed by another convolution with kernel size

. All convolutional layers are followed by batch normalization, ReLU activation, and dropout layers. Finally, the last layer shrinks the number of channels and outputs the predicted 3D pose keypoints.

Because we are focusing on occlusion, we also input a sequence of predicted occluded joints into our TCN, where every joint is either for not occluded or for occluded. First, we apply a temporal convolution to the occluded vectors. Then, we apply sigmoid activation and zero out keypoints whose values in the occluded vectors are above some threshold. By doing this, we are effectively trying to learn the ground truth occluded vectors over the temporal convolution, hoping that we can successfully zero out the actually occluded keypoints before applying convolutions over the 2D keypoint sequence.

We explore two variants of our model with this input. Our first variant uses down-convolutions to drop from a temporal range of vectors to one occluded vector for the frame of the 3D pose we are predicting. This way, the outputted occluded vector can be directly compared to the frame’s ground truth occluded vector in the loss. Instead of focusing on learning only one occluded vector, our second variant does not have any down-convolutions and tries to learn the whole sequence of ground truth occluded vectors.

4.4 Loss Function

With the ground truth occlusion labels, we can now train our TCN to notice the occluded joints. To do so, we can add a loss to our existing loss function , which is the mean per joint positional error (MPJPE) between estimated and ground truth 3D poses:

where represents the number of examples and is the number of joints. Now, let the ground truth occlusion vector for a frame be and the predicted occlusions be , our modified loss function will be:

where and are tunable weights.

5 Experiments

5.1 Baselines

Pavllo et. al. [5] implemented 3D pose estimation in videos using temporal convolutions and semi-supervised training. We use this model as a baseline for our 3D pose estimation as the model does not actively seek to solve the problem of occlusion.

Another baseline whose work we seek to build upon and compare against is [1]. Because their network is occlusion-aware, we hope to achieve similar results or possibly improve on the problems that they face.

5.2 Results

We evaluate our models using the original mean per joint positional error (MPJPE) between estimated and ground truth 3D poses in millimeters (mm). The mean is calculated over all joints used in each image frame; in this case, for the Human3.6M dataset, and for the HumanEvaI dataset. We first align the pelvis as the root joint before comparing differences in Euclidian distance between the poses. The joints are also normalized with respect to the root joint.

5.3 Results on HumanEva-I

We trained and evaluated our model on HumanEva, looking at Subjects 1, 2, and 3 and Actions Walk, Jog, and Box as they are the subjects and actions focused on in [5]. We seek to compare mostly against [5] because we use the same network, but we also compare our results to [1]. Results are shown in Tables 1 through 5. Tables 1 through 3 show the results for different methods over actions, Table 4 shows Cheng et.al.’s results, and Table 5 shows the averages over all actions and subjects for methods.

Walk Subject 1 Subject 2 Subject 3 Average
1 13.9 10.2 46.6 23.6
2 14.4 10.2 46.8 23.8
3 14.1 10.0 46.7 23.6
4 13.8 10.1 46.5 23.5
Table 1: MPJPE of the action Walking for HumanEva (mm). (1: Pavllo et.al. [5]; 2: One vector, Clustered; 3: Many vectors, Clustered; 4: Many vectors, Boxed Man). Our multiple occlusion vector method coupled with the boxed man model keypoints achieves the best average MPJPE across subjects. Bolded numbers are the best among the methods.
Jog Subject 1 Subject 2 Subject 3 Average
1 20.9 13.1 13.8 15.9
2 20.7 13.0 13.7 15.8
3 21.0 13.1 13.8 16.0
4 21.1 13.0 13.7 15.9
Table 2: MPJPE of the action Jogging for HumanEva (mm). (1: Pavllo et.al. [5]; 2: One vector, Clustered; 3: Many vectors, Clustered; 4: Many vectors, Boxed Man). Our one vector method coupled with the simple clustered heuristic achieves the best average MPJPE across subjects. Bolded numbers are the best among the methods.
Box Subject 1 Subject 2 Subject 3 Average
1 23.8 33.7 32.0 29.8
2 23.7 33.0 32.0 29.6
3 23.9 33.2 31.7 29.6
4 23.9 33.4 31.6 29.6
Table 3: MPJPE of the action Boxing for HumanEva (mm). (1: Pavllo et.al. [5]; 2: One vector, Clustered; 3: Many vectors, Clustered; 4: Many vectors, Boxed Man). Our one vector method coupled with the simple clustered heuristic achieves the best average MPJPE across subjects. Bolded numbers are the best among the methods.
Action Subject 1 Subject 2 Subject 3 Average
Walk 11.7 10.1 22.8 14.9
Jog 18.7 11.4 11.0 13.7
Table 4: MPJPE for HumanEva by Cheng et.al. [1] (mm). They seem to have amazing performances across the board, most likely because they have extend their method from end-to-end and use heatmaps to be occlusion-aware. We do not have the compute power that they do to add heatmaps to our model.
Method Average
Pavllo et.al. [5] 23.11
One vector, Clustered 23.06
Many vectors, Clustered 23.06
Many vectors, Boxed Man 23.01
Table 5: Average MPJPE for HumanEva over all subjects and actions considered (mm). Our many vector variant with the boxed man model occlusion keypoints seems to work the best, beating the baseline and our other methods.

From these results, our variant of the temporal convolutional model that compares a sequence of occluded vectors to the sequence of ground truth vectors, coupled with our Boxed Man Model, works the best, achieving an average MPJPE of 23.01 over all subjects and actions. Our other variant that only uses the single frame occluded vector in the loss also performed better than our initial baseline, showing that the Clustered occlusion heuristic also worked well to prevent occluded joints from being wildly predicted.

5.4 Baseline Results on Human3.6M

After one epoch and convergence respectively, the linear baseline was able to achieve the following results on 3 of the 15 tasks, and on average:

Directions Photo SittingDown Average
60.86 95.53 117.99 77.24
39.5 56 69.4 47.7
Table 6: MPJPE of the linear baseline (mm).

After convergence, as reported in [1], the occlusion aware network was able to achieve the following results:

Directions Photo SittingDown Average
38.8 51.9 58.4 42.9
Table 7: MPJPE of the occlusion-aware model (mm).

5.5 Results on Human3.6M

We initially tried inputting 32 by 32 predicted and ground truth heatmaps into the TCN to combat occlusion, similar to [1]. However, because of our lack of computational power, we only were able to train a subset of the data for a few epochs. We ended up with an test error of 174.66 mm. We also tried changing from occluded heatmaps to occluded vectors, and we ended up with a test error of 50.36 mm. Our results on Human 3.6M are definitely stunted by the low computation power and the sheer size of the dataset.

5.6 Discussion

5.6.1 Baseline

For our baseline results, we selected the three tasks Directions, Photo, and SittingDown, because we believe they best exhibit the differences in the model’s performances. The difference in performance on the Photo action is around the mean of the differences in performance with respect to all actions. While the variation in MPJPE for the Directions task is small between the linear baseline and occlusion aware network, it is significantly larger for the SittingDown action. We believe this is due to the occluded nature of the action, as well as the significance of temporal information. Sitting down involves the knee joints occluding the hip joint at the end of the action from a ventral camera orientation. Given that the linear model is trained over single images, it is unable to learn the hip joint’s trajectory over the course of multiple frames. On the other hand, the occlusion aware network is able to use both its heuristic of whether or not the hip joint is occluded along with the hip joint’s trajectory via a sequence of video frames to predict where the hip joint should be located accurately.

5.6.2 Boxed Man Model

Naturally, due to the boxed man model’s anatomically derived occlusion method, it performed better than the baseline Gaussian model. We believe that another source of increased performance for the boxed man model is its stronger tendency to mark a given joint as occluded as opposed to the baseline. Occlusion serves as a form of dropout or regularization of the model. Specifically, we believe that feeding the network information about whether or not a joint is occluded eventually teaches the model to rely less on joints that are marked as occluded. Given that different actions cause inherently different occlusion patterns, the model will be less inclined to focus on a few joints during training. The boxed man model’s lax requirements on occlusion allows for a wider variation of non-occluded keypoint permutations.

6 Conclusion

Most of our work focuses on adapting temporal convolutional network to predict occluded human 3D poses from 2D ground truth heatmaps, and indeed, the average mean-per-join-position error across our three network variants was comparable to but nonetheless still lower than the baseline architecture in [5]. However, since this only constitutes the second half of the video to estimated 3D human pose pipeline, work remains to be done in improving occluded heatmap generation in the first place.

For example, training a stacked hourglass network using our clustered ground truth occlusions and boxed man models would make our task more consistent end-to-end. We initially began experimenting with such methodologies and re-configured an existing stacked hourglass implementation [4] (originally configured to work with the MPII human pose dataset) to work with the HumanEva-I dataset. However, due to limited computing resource, we decided to focus on training the temporal convolutional network.

Another similar idea involves adding data augmentation by manually blocking out non-occluded joints. This could simply take the form of dropping out random joints during training of the stacked hourglass network, or since significant pre-processing is already being performed on the videos, it could perhaps even involve editing the frames themselves.


  • [1] Y. Cheng, B. Yang, B. Wang, Y. Wending, and R. Tan. Occlusion-aware networks for 3d human pose estimation in video. In

    2019 IEEE/CVF International Conference on Computer Vision (ICCV)

    , pages 723–732, 2019.
  • [2] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, jul 2014.
  • [3] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple yet effective baseline for 3d human pose estimation. In ICCV, 2017.
  • [4] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In B. Leibe, J. Matas, M. Welling, and N. Sebe, editors, Computer Vision - 14th European Conference, ECCV 2016, Proceedings

    , Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pages 483–499, Germany, Jan. 2016. Springer Verlag.

  • [5] D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In

    Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2019.
  • [6] L. Sigal, A. Balan, and M. J. Black. HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1):4–27, Mar. 2010.
  • [7] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1653–1660, 2014.
  • [8] J. Walker, K. Marino, A. Gupta, and M. Hebert. The pose knows: Video forecasting by generating pose futures. CoRR, abs/1705.00053, 2017.