Explicit Pose Deformation Learning for Tracking Human Poses

11/17/2018 ∙ by Xiao Sun, et al. ∙ Microsoft 0

We present a method for human pose tracking that learns explicitly about the dynamic effects of human motion on joint appearance. In contrast to previous techniques which employ generic tools such as dense optical flow or spatio-temporal smoothness constraints to pass pose inference cues between frames, our system instead learns to predict joint displacements from the previous frame to the current frame based on the possibly changing appearance of relevant pixels surrounding the corresponding joints in the previous frame. This explicit learning of pose deformations is formulated by incorporating concepts from human pose estimation into an optical flow-like framework. With this approach, state-of-the-art performance is achieved on standard benchmarks for various pose tracking tasks including 3D body pose tracking in RGB video, 3D hand pose tracking in depth sequences, and 3D hand gesture tracking in RGB video.



There are no comments yet.


page 1

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human pose tracking is an essential component of many video understanding tasks such as visual surveillance, action recognition, and human-computer interaction. Compared to human pose estimation in single images, pose tracking in videos has shown to be a more challenging problem due in part to issues such as motion blur and uncommon poses with few or no examples in training sets. To help alleviate these problems, temporal information can be utilized as an additional cue for locating body joints.

A variety of approaches have been presented for exploiting temporal information in tracking human pose. Among them are techniques that use dense optical flow to propagate joint estimates from previous frames [45, 60, 65], model fitting with the predicted pose of the preceding frame as initialization [20, 11, 51, 61, 64, 73, 55, 56, 9, 5, 7, 75], and spatio-temporal smoothness priors as a pose estimation constraint [89, 53].

While these methods take advantage of pose similarity from frame to frame, we observe that they do not learn explicitly about the effect of human motions on the appearance and displacement of joints. Rather, the motions are predicted through generic machinery like optical flow, local optimization of an initialized model, and smoothness constraints. These methods may additionally utilize single-image pose estimation techniques trained on static human images, but this domain-specific training does not extend along the temporal dimension.

Figure 1: (Best viewed in color) Comparison between optical flow based pose deformation (FlowPD) and our explicitly learned pose deformation (ExplicitPD). From frame t-d to frame t, the appearance of the Right Wrist joint is significantly changed due to different shading/shadows and some motion blur. Optical flow gets misled and predicts the Right Wrist joint at the wrong place (close to Right Elbow) while ExplicitPD accurately tracks this joint.

In this paper, we propose an approach to human pose tracking that explicitly

learns about human pose deformations from frame to frame. Our system design reconfigures the optical flow framework to introduce elements from human pose estimation. Instead of densely predicting the motion vectors of general scene points as done for optical flow, our network is trained specifically to infer the displacements only of human joints, while also accounting for possible appearance changes due to motion. Figure 


displays a qualitative comparison of the two approaches. Inspired by heat maps in pose estimation, we leverage the other pixels to support the inference of each joint displacement. With training on labeled pose tracking datasets, our neural network learns to make use of information from both joint and non-joint pixels as well as appearance change caused by joint motion to predict joint displacements from one frame to the next. The joint estimates from this temporal inference are complementary to static pose estimates on individual frames. Using them together, our human pose tracking system learns explicitly about human poses in terms of both static and dynamic appearance.

The presented technique comes with the following practical benefits:

  • Generality: This approach can be employed for various pose tracking applications (body/hand, 2D/3D, with/without an object) and with different input modalities (RGB/depth). In our experiments, the system is demonstrated on 3D human body pose tracking in RGB video, 3D hand gesture tracking from depth sequences, and 3D hand pose tracking in RGB video.

  • Flexibility: The temporal tracking is complementary to single-frame pose estimation, and it can be utilized together with approaches such as local model fitting. Its speed-accuracy tradeoff is also controllable.

  • High performance: State-of-the-art results are obtained on standard benchmarks for the aforementioned pose tracking applications shown in our experiments.

The code and model will be released online upon publication of this work.

Figure 2: (Best viewed in color) Overview of the explicit pose deformation learning framework.

2 Related Work

In this section, we briefly review related approaches for human pose tracking.

Optical Flow Based Pose Deformation

A family of deep learning methods aid prediction in the current frame with information from its neighbors using dense optical flow 

[45, 60, 65]. Jain et al. [45] concatenate dense optical flow with RGB features to enhance the performance of pose estimation. Pfister et al. [60] use dense optical flow to warp the heat maps of previous frames to the current frame. The aligned heat maps are then combined by a learned convolutional layer to produce a pooled heat map. To better exploit the warped heat maps, Song et al. [65] present a spatio-temporal inference layer which performs message passing along spatial and temporal graph edges to reduce joint position uncertainty.

For these methods, an issue exists with their use of dense optical flow, which can be unreliable for non-rigid body regions that may undergo appearance change during motion. We propose a method trained specifically to predict the displacement of human joints directly, with the support of information from relevant local areas. This task-specific approach provides greater reliability than generic flow, while also not needing an additional post-processing step for heatmap warping or a sub-network to learn how to fuse multiple results. In addition, this approach is directly applicable to both 2D and 3D pose, while optical flow provides only 2D cues. An empirical comparison to an optical flow based technique is presented in Section 5.2.

Recurrent Neural NetworksRecurrent Neural Networks (RNNs), specifically with Long Short-Term Memory (LSTM) [36], have proven effective at exploiting temporal correlations in many tasks including language modeling [72], machine translation [8, 18] and video description [22]. Lin et al. [47] first introduce LSTM to 3D pose tracking in video. They propose a multi-stage sequential refinement framework to predict 3D pose sequences using previously predicted 3D poses. Following this paradigm, Coskun et al. [19]

use Kalman filters for better pose regularization. More recently, Hossain et al. 

[37] exploit recent progress in sequence-to-sequence networks [72] composed of layer-normalized [6] LSTM units with recurrent dropout [85]. Luo et al. [48] embed LSTM units into the multi-stage Convolutional Pose Machine (CPM) [81] framework, yielding LSTM Pose Machines that predict a sequence of poses from a sequence of images.

Alternatively, some works fully connect the input and output pose sequences directly using a fully connected [20, 24] or convolutional [32] network, which draws global dependencies between the input and output. Recently, Vaswani et al. [77] show that a simple self-attention mechanism eschewing recurrence can also model global input-output dependencies with better efficacy in machine translation models. The attention mechanism is efficient because it focuses on information relevant to what is currently being processed. Inspired by this, we propose a pixel weighting mechanism with which the pose deformation target depends only on relevant input pixels. But rather than learning to find the attention areas as in [77], our approach directly identifies them based on distance from the previously detected joints, as explained in Section 3.2. In the experiments of Section 5.2, we show that this approach compares favorably to RNN-based techniques.

Model Fitting A classic pose tracking paradigm is to locally optimize a model-based pose representation initialized from the previous frame [20, 11, 51, 61, 64, 73, 56, 9, 7, 75]. Elaborate model parameterization and objective function design are crucial for good performance. In these works, local optimization is usually slow due to the complexity of models and objective functions [11, 55, 56, 9, 73]. Moreover, the optimization is purely local and cannot recover from tracking failure arising from the well-known drifting problem. While re-initialization strategies [5, 7, 61, 64] or stochastic optimization methods [61, 55, 56] have been studied to alleviate drifting, the problem cannot fundamentally be resolved within the local optimization framework.

By contrast, we propose a completely data-driven approach that provides a globally optimal and model-free estimate of pose deformation from frame to frame for tracking. It is also complementary to these local model fitting techniques, which could potentially be used for refinement.

Tracking by Detection Some works follow the tracking by detection paradigm, which performs detection in each frame, estimates the pose of each detection result with an off-the-shelf technique, then either enforces spatio-temporal smoothness or computes joint associations for multi-person tracking.

Spatio-Temporal Smoothness. Zhou et al. [89] propose a synthesis of a deep-learning based 2D part detector, a sparsity-driven 3D reconstruction approach and a 3D temporal smoothness prior for 3D human pose estimation from RGB image sequences. Mueller et al. [53] add temporal smoothness to their spatial-temporal optimization to penalize deviations from constant pose velocity. Instead of employing generic smoothness priors, we seek a more targeted approach by specifically learning about human pose change.

Joint Association for Multi-Person Tracking. Multi-person pose tracking in videos aims to estimate the pose of all persons appearing in the video and assign a unique identity to each person [42, 31, 82, 83, 26, 39, 21]. These approaches differ from one another in their choice of metrics and features used for measuring similarity between a pair of poses and in their algorithms for joint identity assignment. Recent methods utilize metrics based on bounding box [31] and keypoint [82] similarity between a pair of poses, as well as on similarity between bounding-box image features [31] and optical flow information [39, 42, 83]. Recently, Doering et al. [21] learn a model that predicts Temporal Flow Fields which indicate the direction in which each body joint is going to move from two consecutive frames. Similar to our approach, they propose a task-specific representation instead of task-agnostic similarity metrics, but for the problem of joint association.

Note that in these multi-person techniques, the concept of ‘tracking’ is to associate joints of the same person together over time, using joints localized independently in each frame. By contrast, our work aims to improve joint localization by utilizing information from other frames.

method Baseline JointOne POI POIClose RandOne RandHalf Full None
D-MPJPE(mm) 19.0 14.8 17.2 14.6 24.2 18.2 28.2 39.2
Table 1: Comparison over different pixel sampling strategies.

3 Explicit Pose Deformation Learning

We define pose deformation as the displacement of a person’s body or hand joints from one time instance to the next. Let the pose be denoted as , a vector of dimensions for the 3D pose of joints. Given two successive frames and from a video sequence, our goal is to estimate the pose deformation of a person from time to :


A possible way to predict is by using a CNN with a fully connected layer. The two frames are passed through convolutional layers to generate convolutional features. Following common practice [14, 88, 69, 87, 71], the features undergo spatial downsampling by average pooling. They are then given to a fully connected layer which outputs a -dimensional joint displacement vector. We refer to this holistic fully connected approach as the Baseline.

In this work, we instead propose a novel per-pixel pose deformation that is inspired by the two most relevant tasks, pose and optical flow estimation. In the following, we describe our representation for pose deformation and how it is used for explicit learning of human pose changes.

3.1 Pose Deformation Representation

Solving for pose deformation is related to estimating pose and optical flow, which are both commonly formulated as per-pixel prediction problems. The predominant approach for pose estimation involves classifying each pixel into a joint class and generating a likelihood heat map for each joint 

[3]. For optical flow, the motion of each pixel is predicted to form a dense flow map [23, 38, 68].

In tailoring our method to pose deformation, we alter the conventional optical flow scheme to incorporate concepts from pose estimation. This is done by following two main principles. The first is that rather than computing the motion of each pixel, the focus should be on predicting the motion of pixels with specific semantics, namely joints. Similar to pose estimation, we use multiple pixels to support the estimation of each joint motion. This will be represented by a Joint Displacement Map. The second principle is to consider the varying relevance of pixels in estimating the deformation of a joint, much like the use of heat maps for estimating joint positions or the relative emphasis given to different image regions in attention-based vision techniques. The inclusion of non-relevant pixels and the over-emphasis of weakly relevant pixels would not only be unhelpful but could moreover be detrimental due to the noise they would introduce to the estimation. The relative relevance of pixels for estimating a joint’s motion is represented by a Pixel Weighting Map.

Joint Displacement Map. The body movement that leads to the displacement of a joint generally impacts many pixels in a video frame. Correspondingly, many pixels can provide information about the displacement of a joint. Therefore, our method generates a separate displacement map for each joint, where every pixel in a map predicts the displacement of the joint, as illustrated in Figure 2. For 3D pose, a joint displacement map is a 3D vector field for the x, y, z dimensions. For compactness, the joint index will henceforth be omitted, as all joints are processed separately in the same way.

Pixel Weighting Map. However, not every pixel in a joint displacement map is useful for estimating the motion of the joint. For example, pixels that lie far from the joint’s vicinity or within smooth and textureless areas are generally less informative for this prediction. To represent this information, a pixel weighting map is generated for each joint displacement map, as shown in Figure 2. Each pixel in contains a weight value in that indicates the relevance of the pixel for predicting the displacement of this joint.

3.2 Pixel Weighting Mechanism

Various methods exist for determining the pixel weights for a given joint. We experimentally examined several possibilities, some of which are based on the intuitive notion that pixels closer to the previous joint prediction or that have strong image features like corners, edges and ridges are generally more informative for estimating joint deformation.

Table 1 presents a comparison of different pixel sampling strategies. JointOne takes only the pixel at the ground truth joint location of the previous frame; RandOne randomly samples one pixel; RandHalf randomly samples half of the pixels, and Full uses all pixels. POI denotes sampling only points of interest. For this, we use a standard implementation of speeded up robust features (SURF) [10] from OpenCV [12]. POIClose

takes only the POI samples that lie close to the target joint (distance of less than 1/32 of the input image size). The evaluation metric is the mean per joint position error in millimeters and the lower the better (See Section 

5.1 for details). None means that no pixel is sampled but the ground truth pose of the previous frame is directly used as the current frame’s pose estimate. Each of these strategies is implemented in a fully convolutional network with the same backbone network as Baseline but with a deconvolutional head network (See Section 3.3 for details) instead of a fully connected layer. Connections from the input image are provided only according the sampling strategy.

A couple conclusions can be drawn from Table 1:

  • Sparse connections are superior to holistic mapping. It can be seen from the lower performance of Baseline and Full compared to JointOne and POI that processing all the pixels is not as effective as attending to some subset of the pixels that have relevance to the pose deformation.

  • Pixel selection matters. Randomly choosing the sparse connections (RandOne, RandHalf) yields lower performance than more purposeful selection (JointOne, POI).

These observations motivate us to propose a general pixel weighting formulation:


where is a decay function that determines the drop in weight for pixels farther away from the most informative ones. is the joint location distance map and is the points of interest distance map. Each pixel value in the distance map represents the distance from this pixel to the joint location or the closest point of interest. balances the two terms, and is a decay rate parameter. For example, if we use the exponential decay function


then Eq. 2 becomes


Several decay functions are empirically investigated in Section 5.

We have observed that using POIClose is more effective than JointOne. However, since the improvement is not significant and generating POI requires extra computation, we henceforth only use pixels based on their joint distance by setting in the remaining experiments.

3.3 Network Architecture

Our network architecture for pose deformation learning is based mainly on FlowNet [23, 38]. As illustrated in Figure 2, we simply stack two consecutive input images together and feed them into a fully convolutional network, as done in the ’FlowNetSimple’ architecture [23]. Then, a head network consisting of consecutive deconvolution layers is used to upsample the feature map to the targeted resolution as done in [71]. At the end, a 1x1 convolutional layer is used to produce the final joint displacement map.

Training Loss We directly minimize the absolute difference (L1 norm) of the predicted and ground truth joint displacement map weighted by the pixel weighting map. This loss is expressed as


where is the domain of the joint displacement map.

Inference In the testing phase, the final prediction is the average of all predictions in a joint displacement map weighted by the pixel weighting map:


Discussion. Several architectural variants of FlowNet are presented in the literature, including those that combine deep and shallow feature maps [23], stack input images at a later stage using a correlation layer [23], employ a multi-stage stacked architecture [38], compute small displacement refinement [38], and use a pyramid, warping and cost volume [68]. We use a simple architecture design to better highlight the effects of concatenated consecutive image input and the proposed pose deformation representation. These elements could be incorporated into other FlowNet variants as well, which we leave for future work.

4 Pose Deformation for Tracking

Given a single-frame pose prediction and a pose deformation prediction between two successive frames, the pose tracking task can be formulated as a simple linear optimization problem. For a video sequence with frames, we minimize the following error function to obtain the final tracking result :


where is a parameter to balance the single-frame and pose deformation terms.

In Eq. 7, we can see that the final pose prediction is determined from two orthogonal observations. The first component of this detection model attempts to answer “Does this local image area look like the correct type of joint from its appearance?” On the other hand, the second component attempts to answer “Is this how the joint location should change by looking at the transition from the previous to the current frame?” The final pose estimation is acquired through an optimization from both perspectives.

This approach can be applied not only to two consecutive frames, but also to frames separated by different durations. Accounting for additional frames can potentially enhance the pose tracking. For this purpose, we define a set of durations . If we have durations in , the error function of Eq. 7 can be extended to the following:


Minimizing with respect to is a linear least squares problem with constraints, which is sufficient to solve the unknowns. The duration set can be arbitrary. To validate the effectiveness of Eq. 8, we test four variants:

  • , which is the most common case of using the pose deformation from the previous frame to the current frame to provide one-step forward tracking.

  • , which additionally considers reverse pose deformation from the subsequent frame to the current frame, yielding one-step forward and backward tracking.

  • , which accounts for pose deformation from multiple past frames, providing multi-step forward tracking.

  • , which utilizes pose deformation from multiple past and future frames, yielding multi-step forward and backward tracking.

5 Experiments

Training Details ResNet [35] (ResNet-50 by default) is adopted as the backbone network as in [71]. The head network for the joint displacement map is fully convolutional. It first uses deconvolution layers (

kernels, stride 2) to upsample the feature map to the required resolution (

by default). The number of output channels is fixed to 256 as in [34, 71]. Then, a convolutional layer is used to produce

joint displacement maps. PyTorch 

[58] is used for implementation. Adam is employed for optimization. The input image is normalized to by default. Data augmentation includes random translation ( of the image size), scale (), rotation (

degrees) and flip. In all experiments, the base learning rate is 1e-5. It drops to 1e-6 when the loss on the validation set saturates. Four GPUs are utilized. The mini-batch size is 128, and batch normalization 

[40] is used. Other training details are provided with the individual experiments.

5.1 Datasets and Evaluation Metrics

Our approach can be applied to various pose tracking scenarios, including human body or hand (w/ or w/o an object), RGB or depth input, and 2D or 3D output. We evaluate our approach extensively on four challenging datasets that cover all of these factors to show its generality. Specifically, 3D human pose dataset Human3.6M [41] of RGB images, 3D hand pose dataset MSRA Hand 2015 [70] of depth images, and 3D hand pose datasets Dexter+Object [67] and Stereo Hand Pose [86] of RGB-D images are used.

Human3.6M [41] (HM36) is the largest 3D human pose benchmark of RGB images. Accurate 3D human joint locations are obtained from motion capture devices. It consists of 3.6 million video frames. 11 subjects are captured from 4 camera viewpoints, performing 15 activities.

For this benchmark, we employ the most widely used evaluation protocol in the literature [15, 76, 52, 90, 44, 50, 59, 84, 62, 11, 89, 74, 88, 71, 63]. Five subjects (S1, S5, S6, S7, S8, S9) are used in training and two subjects (S9, S11) are used in evaluation. For evaluation, most previous works use the mean per joint position error (MPJPE) in millimeters (mm). Since a pose estimation metric such as this is unsuitable for directly evaluating pose deformation performance, we instead introduce a new metric called D-MPJPE, which adds the estimated pose deformation to the ground truth pose of the previous frame to obtain the current frame pose estimation and then evaluate its MPJPE in mm.

MSRA Hand 2015 [70] (MSRA15) is a standard benchmark for 3D hand pose estimation from depth. It consists of 76.5k video frames. In total, the right hands of nine subjects are captured using Intel’s Creative Interactive Camera.

For evaluation, we use two common accuracy metrics in the literature [70, 80, 16, 29, 33, 54, 78, 79]. The first one is mean 3D joint error. The second one is the percentage of correct frames. A frame is considered correct if its maximum joint error is less than a small threshold.

Dexter+Object [67] (D+O) and Stereo Hand Pose [86] (SHP) are two standard 3D hand pose datasets composed of RGB-D images. D+O provides six test video sequences recorded using a static camera with a single hand interacting with an object. SHP provides 3D annotations of 21 keypoints for 18k stereo frame pairs, recording a person performing various gestures in front of six different backgrounds. Note that only the RGB images are used in our experiments.

We follow the evaluation protocol and implementation provided by [91] that uses 15k samples from the SHP dataset for training (SHP-train) and the remaining 3k samples for evaluation (SHP-eval). The model trained using SHP-train is also evaluated on D+O. For evaluation metrics, we use average End-Point-Error (EPE) and the Area Under Curve (AUC) on the Percentage of Correct Keypoints (PCK) as done in [43, 53, 91, 4].

2.5FPS JointOne POIClose Binary Gaussian Linear Exponential Easy(62.5%) Middle(17.8%) Hard(19.7%) None
FlowPD 26.8 27.4 30.2 27.1 28.9 27.6 9.52 29.7 79.0 39.2
ExplicitPD 39.2
8FPS JointOne POIClose Binary Gaussian Linear Exponential Easy(88.9%) Middle(8.49%) Hard(2.65%) None
FlowPD 9.08 9.19 10.1 9.11 9.70 9.29 6.19 25.4 53.8 12.7
ExplicitPD 12.7
25FPS JointOne POIClose Binary Gaussian Linear Exponential Easy(99.1%) Middle(0.85%) Hard(0.02%) None
FlowPD 3.44 3.46 3.68 3.42 3.58 3.47 3.25 24.5 48.3 4.32
ExplicitPD 4.32
Table 2: Comparison to optical flow for different frame rates and pixel sampling/weighting strategies. Easy, Middle and Hard represent joints whose displacement are less than , between and , greater than , respectively.

5.2 Experiments on Human3.6M

For single-frame pose estimation, previous works typically downsample the video sequences for efficiency. For pose tracking in videos, we use three downsampling rates, namely no downsampling (25FPS), three-step downsampling (8FPS) and ten-step downsampling (2.5FPS). The lower the FPS, the larger the pose deformation between two consecutive frames.

Effect of decay function We investigate four forms of decay function in Eq. 2. They are Binary, Gaussian, Linear and Exponential. (See supplementary materials for detailed definitions and performance evaluation.) We observe that JointOne is not the best, and better performance is obtained when the decay rate parameter is carefully selected for each decay function form. The performance differences between different decay functions are minor, but an ensemble of these functions yields considerable improvement. We thus use this Ensemble model as our final pose deformation estimator for tracking.

Comparison to optical flow Table 2 presents a comprehensive comparison between our explicitly learned pose deformation (ExplicitPD) and optical flow based pose deformation (FlowPD). We use the state-of-the-art FlowNet2.0 [38]

and its Caffe implementation 

[1] to compute optical flow between two consecutive frames, and then use the same pixel sampling/weighting scheme as in our method to produce FlowPD. Since pose deformation in the depth dimension cannot be determined from optical flow, we directly use the corresponding deformation in depth from ExplicitPD for FlowPD. The D-MPJPE metric is used for evaluation. Two conclusions can be drawn from the results. First, ExplicitPD clearly outperforms FlowPD under all frame rates using any sampling/weighting strategy, where the relative performance gain is shown as subscripts. Second, ExplicitPD is especially superior to FlowPD under large pose deformations. This can be seen from the larger relative improvement under lower frame rates (ExplicitPD outperforms FlowPD by 44.8% at 2.5FPS, but by 24.7% at 25FPS using JointOne sampling) and more challenging joints with large displacement (ExplicitPD outperforms FlowPD by 51.9% in Hard level, but by 31.4% in Easy level using JointOne sampling at 2.5FPS).

D-MPJPE(mm) map128 map64 map32 map16 map8
image256 16.5 14.8 13.9 14.3 16.2
image128 16.8 16.3 15.3 15.8 17.3
Table 3: Effect of image and joint displacement map resolution.

Effect of resolution Table 3 compares results for two image sizes and five output joint displacement map sizes. The D-MPJPE metric, 2.5FPS frame rate and JointOne sampling strategy are used. Not surprisingly, a larger image size leads to better accuracy, under all cases. However, interestingly, neither a very large nor small joint displacement map size gives the best performance. This illustrates that the connections between input and output should neither be too sparse nor too dense.

baselines [71] single
HM36S1(8FPS) 69.1 66.5 65.9 64.3
HM36S2(8FPS) 63.8 61.7 61.1 59.6
HM36+MPII(8FPS) 49.5 48.8 48.5 47.8
HM36S1(25FPS) 69.2 68.1 67.7 66.5
HM36S2(25FPS) 63.9 62.7 62.3 61.1
HM36+MPII(25FPS) 49.5 49.3 49.1 48.7
Table 4: Effect of ExplicitPD for tracking. (See supplementary materials for detailed results on all joints and dimensions.)
Method Tome Moreno Zhou Jahangiri Mehta Martinez Kanazawa Fang Sun Sárándi Sun Dabral Hossain Ours
 [76]  [52]  [90]  [44]  [50]  [49]  [46]  [27]  [69]  [63]  [71]  [20]  [37]
MPJPE 88.4 87.3 79.9 77.6 72.9 62.9 88.0 60.4 59.1 54.2 49.6 52.1 51.9 47.7
Table 5: Comparisons to previous work on Human3.6M. All methods used extra 2D training data. Our single-frame baseline uses extra MPII data, and ExplicitPD uses only HM36 data in training. Methods with exploit temporal information. Ours gives the lowest error.
Method Zhou [89] Tekin [74] Xingyi [88] Sun [69] Pavlakos [59] Sun [71] Lin [47] Coskun [19] Ours
MPJPE 113.0 125.0 107.3 92.4 71.9 64.1 73.1 71.0 59.3
Table 6: Comparison to previous work on Human3.6M. No extra training data is used. Methods with exploit temporal information. Ours yields the lowest error.

Effect of ExplicitPD for tracking Now we apply the explicitly learned pose deformation to the tracking framework introduced in Section 4 for better joint localization in videos. For the single frame pose estimator, we use the Pytorch implementation [2] of Integral Regression [71]. Specifically, three single-frame baselines are used. The first two use only HM36 data for training and use a one-stage (HM36S1) or two-stage (HM36S2) network architecture respectively. The third baseline mixes HM36 and MPII data for training and uses a one-stage network architecture (HM36+MPII). Note that these three baselines are the state of the art and thus produce strong baseline results.

Table 4 shows how joint localization accuracy improves using our explicitly learned pose deformation. The MPJPE metric and both 8FPS and 25FPS frame rates are used for evaluation. We can draw two conclusions. First, ExplicitPD effectively improves all single-frame baselines. Larger relative improvement can be obtained on the low-performance single-frame baseline. Second, multi-step tracking is effective. With more elements in the duration set, better performance is obtained (). Namely, the more observations we get from different durations in Eq. 8, the better pose tracking result we can get.

Comparison with the state of the art Previous works are commonly divided into two categories. The first uses extra 2D data for training. The second only uses HM36 data for training. They are compared to our method in Table 5 and Table 6 respectively. Methods marked with * are tracking based methods that exploit temporal information. Note that in Table 5, although Sárándi et al. [63] do not use extra 2D pose ground truth, they use extra 2D images for synthetic occlusion data augmentation where the occluders are realistic object segments from the Pascal VOC 2012 dataset [25]. Their data augmentation technique is also complementary to our method.

Our method sets a new state of the art on the Human3.6M benchmark under both categories. Specifically, it improves the state-of-the-art by 1.9mm (relative 3.8%) in Table 5, and 4.8mm (relative 7.5%) in Table 6.

5.3 Experiments on MSRA15

For depth image based hand pose tracking on the MSRA15 dataset, we use the original frame rate with no downsampling. ResNet-18 is adopted as the backbone network, and the input images are normalized to . Data augmentation includes random rotation ( degrees) and scale (%). No flip and translation augmentation is used. For single-frame pose estimation, many recent works provide their single-frame predictions [79, 17, 30, 28]. We use all of them as our single-frame baselines. In addition, we re-implement a new single-frame baseline using Integral Regression [71].

baselines single
PointNet [28] 8.50 7.82 7.63 7.30 7.19

SHPR-Net [17]
7.76 7.37 7.26 7.08 7.02
Point2Point [30] 7.71 7.19 7.04 6.85 6.80
DenseReg [79] 7.23 7.04 6.98 6.96 6.96
Integral [71] 8.42 7.10 6.78 6.42 6.35
Table 7: Effect of ExplicitPD on tracking.

Effect of ExplicitPD on tracking We apply our ExplicitPD on all of these single-frame baselines for tracking, and list the results in Table 7. The mean 3D joint error is used as the evaluation metric. Our observations are similar to those for the HM36 experiments. First, ExplicitPD significantly improves all single-frame baselines. The relative performance gain over the single-frame baseline is shown in the subscript. Second, more elements in the duration set leads to better tracking results. Specifically, our method improves the Integral Regression baseline by 2.07mm (relative 24.6%), establishing a new record of 6.35mm mean 3D joint error on this benchmark.

Figure 3: Comparisons to the state of the art on MSRA15.

Comparisons to the state of the art We compare our method to previous works in Table 7 and Figure 3 under the metric of mean 3D joint error and percentage of correct frames. All of the previous works are single-frame pose estimators, which are complementary to our tracking framework. Consistent improvements are obtained by applying our ExplicitPD model directly to the highest-performing methods under both evaluation metrics. Our method achieves a 6.35mm mean 3D joint error and outperforms the previous state-of-the-art result of  [79] by a large margin (11.1% relative improvement).

5.4 Experiments on SHP and D+O

[.4] Method AUC EPE mean Panteleris [57] 0.941 - Z&B  [91] 0.948 8.68 Mueller [53] 0.965 - Spurr [66] 0.983 8.56 Iqbal [43] 0.994 - Cai [13] 0.994 - Ours single 0.994 7.82 Ours track 0.995 7.62

Figure 4: Comparsions to state-of-the-art on SHP.

[.6] Method AUC EPE mean median Mueller [53] 0.56 - - Iqbal [43] 0.71 31.9 25.4 Ours single 0.72 28.9 24.7 Ours track 0.73 27.5 24.3

Figure 5: Comparisons to state-of-the-art on D+O.

For the experiments on these datasets, the training details including frame rate, backbone network, input image size and augmentation strategies are the same as in Section 5.3. For single-frame pose estimation, we re-implement a new baseline using Integral Regression [71].

Comparison to the state of the art We compare our method to the state of the art on the SHP and D+O datasets in Table 5 and Table 5, respectively. AUC and EPE are used as evaluation metrics. On both benchmarks, our re-implemented single-frame pose estimator already achieves state of the art performance, but then our tracking method brings consistent improvements to the single-frame baselines, setting the new state-of-the-art in performance.


1 Additional Results

Due to page limitations, some experimental results were not included in the main paper, and are reported in this supplement instead.

1.1 Experiments on Human3.6M

[.23] [.23] function formulation Binary Gaussian Linear Exponential

Figure 1: Illustration of different decay functions.
Figure 2: Different definitions of the decay function for Eq. 2 in the main paper.

[.23] [.23] method D-MPJPE Beta JointOne - Binary 5.0 Gaussian 0.4 Linear 0.2 Exponential 1.0 Ensemble -

Figure 3: Performance for the Binary decay function with respect to .
Figure 4: Best performance and corresponding of different decay functions.

Effect of decay function

We investigate four forms of the decay function for Eq. 2 in the main paper. They are defined in Table 4 and illustrated in Figure 4. As an example, Figure 4 shows the pose deformation estimation performance using the Binary decay function under different decay rates . The D-MPJPE metric and 2.5FPS frame rate are used. When , degenerates to JointOne and Full, respectively. It is seen that JointOne (i.e., using only the joint pixels) does not yield the best results, and we obtain better performance when is slightly larger than ( in the Binary case). Similar observations are obtained using other decay functions, with the corresponding best selection for each decay function listed in Table 4. The performance differences between different decay functions are minor, but an ensemble of these functions yields considerable improvement. We thus use this Ensemble model as our final pose deformation estimator for tracking.

baselines single
HM36S1(8FPS) 69.1 66.5 65.9 64.3
HM36S2(8FPS) 63.8 61.7 61.1 59.6
HM36+MPII(8FPS) 49.5 48.8 48.5 47.8
HM36S1(25FPS) 69.2 68.1 67.7 66.5
HM36S2(25FPS) 63.9 62.7 62.3 61.1
HM36+MPII(25FPS) 49.5 49.3 49.1 48.7
baselines single
HM36S1(25FPS) 69.2 66.8 65.8 64.9
HM36S2(25FPS) 63.9 61.3 60.6 59.8
HM36+MPII(25FPS) 49.5 48.8 48.3 48.2
Table 1: Effect of ExplicitPD for tracking. Larger duration time is used for 25FPS. Extension of Table 4 in the main paper.

Effect of ExplicitPD for tracking

In Table 4 of the main paper, it is seen that the tracking accuracy improvement at a higher frame rate (25PFS) is not as significant as at a lower frame rate (8FPS). This is because under the same duration set , the duration time at 8FPS is longer than at 25FPS. In order to get comparably good results at 25FPS, we increase the duration time at 25FPS by (i.e. by taking frame 3 as the frame after frame 0, and taking frame 4 as the frame after frame 1, etc.). The tracking results are reported in Table 1, bottom row. It can be seen that the tracking accuracy in this setting is comparable to the results at 8FPS.

FPS 2.5 8 25
Method FlowPD ExplicitPD FlowPD ExplicitPD FlowPD ExplicitPD
Average 26.8 9.08 3.44
Ankle 41.0 14.0 5.23
Knee 26.0 9.14 3.42
Hip 9.67 3.37 1.17
Torso 12.8 4.91 1.81
Neck 16.9 6.24 2.47
Head 21.1 7.75 3.39
Wrist 54.7 16.8 6.17
Elbow 40.1 13.2 4.93
Shoulder 21.0 7.57 3.00
x 16.0 5.00 1.81
y 12.3 3.92 1.41
z - 10.3 - 4.25 - 1.80
Table 2: Comparison to optical flow for different frame rates on all joints and dimensions. JointOne sampling strategy is used. The relative performance gain in is shown in the subscript. Extension of Table 2 (main paper), first column.
Method HM36S1 HM36S2 HM36+MPII
Average 69.1 63.8 49.5
Ankle 85.5 82.4 65.7
Knee 53.5 55.0 43.0
Hip 29.1 28.4 22.2
Torso 48.9 44.6 38.3
Neck 77.7 68.4 54.5
Head 80.4 69.8 48.9
Wrist 104.0 100.1 73.8
Elbow 92.7 80.4 61.8
Shoulder 76.5 68.9 55.4
x 20.3 19.8 13.8
y 17.9 17.4 14.3
z 56.4 50.5 40.2
Table 3: Detailed results on all joints and dimensions for single-frame baselines and methods. The relative performance gain in is shown in the subscript. Extension of Table 4 (main paper), top row.

Detailed Results on All Joints and Dimensions

In Table 2 of the main paper, we show that ExplicitPD outperforms FlowPD under all frame rates using any sampling/weighting strategy, with especially large differences under large pose deformations. Table 2 further reports the performance improvements from ExplicitPD to FlowPD on all the joints and dimensions. The performance gain in is shown as the subscript. JointOne sampling strategy is used. Since we directly use the corresponding deformation in depth from ExplicitPD for FlowPD, comparisons in the depth dimension are not reported. The conclusions are consistent with Table 2 in the main paper. First, ExplicitPD clearly outperforms FlowPD under all frame rates for all the joints and dimensions. Second, ExplicitPD is especially superior to FlowPD under large pose deformations. This can be seen from the larger relative improvements under lower frame rates and those end joints with large displacements, e.g., Ankle, Head and Wrist. Additionally, it is seen that improvement in the x dimension is larger than in the y dimension. This is due to more frequent body movements in the x dimension than in the y dimension in the dataset.

In Table 4 of the main paper, we show that ExplicitPD effectively and consistently improves all three single-frame baselines, namely HM36S1, HM36S2 and HM36+MPII. Table 3 further reports the performance improvement on to these single-frame baselines on all the joints and dimensions. We can conclude that our method effectively improves the accuracy for all the joints and dimensions, especially for the challenging ones like wrist, elbow, ankle joint and z dimension.

Qualitative Results

Figures 6,  7,  8 and  9 show additional qualitative comparison results of ExplicitPD and FlowPD. Displacements of Right Wrist and Right Elbow joints are shown by green and red arrows, respectively. Figure 5 is the color map for optical flow and pose deformation visualization. See our supplementary demo video for more vivid and dynamic results.

Figure 5: (Best viewed in color) Color map for optical flow and pose deformation visualization.
Figure 6: (Best viewed in color) Qualitative comparison results of ExplicitPD and FlowPD. Samples from subject 9, action directions, camera view 1. Displacements of Right Wrist and Right Elbow joints are shown by green and red arrows, respectively. See our supplementary demo video for more vivid and dynamic results.
Figure 7: (Best viewed in color) Qualitative comparison results of ExplicitPD and FlowPD. Samples from subject 9, action eating, camera view 2. Displacements of Right Wrist and Right Elbow joints are shown by green and red arrows, respectively. See our supplementary demo video for more vivid and dynamic results.
Figure 8: (Best viewed in color) Qualitative comparison results of ExplicitPD and FlowPD. Samples from subject 11, action smoking, camera view 3. Displacements of Right Wrist and Right Elbow joints are shown by green and red arrows, respectively. See our supplementary demo video for more vivid and dynamic results.
Figure 9: (Best viewed in color) Qualitative comparison results of ExplicitPD and FlowPD. Samples from subject 11, action taking photo, camera view 4. Displacements of Right Wrist and Right Elbow joints are shown by green and red arrows, respectively. See our supplementary demo video for more vivid and dynamic results.