Human pose tracking is an essential component of many video understanding tasks such as visual surveillance, action recognition, and human-computer interaction. Compared to human pose estimation in single images, pose tracking in videos has shown to be a more challenging problem due in part to issues such as motion blur and uncommon poses with few or no examples in training sets. To help alleviate these problems, temporal information can be utilized as an additional cue for locating body joints.
A variety of approaches have been presented for exploiting temporal information in tracking human pose. Among them are techniques that use dense optical flow to propagate joint estimates from previous frames [45, 60, 65], model fitting with the predicted pose of the preceding frame as initialization [20, 11, 51, 61, 64, 73, 55, 56, 9, 5, 7, 75], and spatio-temporal smoothness priors as a pose estimation constraint [89, 53].
While these methods take advantage of pose similarity from frame to frame, we observe that they do not learn explicitly about the effect of human motions on the appearance and displacement of joints. Rather, the motions are predicted through generic machinery like optical flow, local optimization of an initialized model, and smoothness constraints. These methods may additionally utilize single-image pose estimation techniques trained on static human images, but this domain-specific training does not extend along the temporal dimension.
In this paper, we propose an approach to human pose tracking that explicitly
learns about human pose deformations from frame to frame. Our system design reconfigures the optical flow framework to introduce elements from human pose estimation. Instead of densely predicting the motion vectors of general scene points as done for optical flow, our network is trained specifically to infer the displacements only of human joints, while also accounting for possible appearance changes due to motion. Figure1
displays a qualitative comparison of the two approaches. Inspired by heat maps in pose estimation, we leverage the other pixels to support the inference of each joint displacement. With training on labeled pose tracking datasets, our neural network learns to make use of information from both joint and non-joint pixels as well as appearance change caused by joint motion to predict joint displacements from one frame to the next. The joint estimates from this temporal inference are complementary to static pose estimates on individual frames. Using them together, our human pose tracking system learns explicitly about human poses in terms of both static and dynamic appearance.
The presented technique comes with the following practical benefits:
Generality: This approach can be employed for various pose tracking applications (body/hand, 2D/3D, with/without an object) and with different input modalities (RGB/depth). In our experiments, the system is demonstrated on 3D human body pose tracking in RGB video, 3D hand gesture tracking from depth sequences, and 3D hand pose tracking in RGB video.
Flexibility: The temporal tracking is complementary to single-frame pose estimation, and it can be utilized together with approaches such as local model fitting. Its speed-accuracy tradeoff is also controllable.
High performance: State-of-the-art results are obtained on standard benchmarks for the aforementioned pose tracking applications shown in our experiments.
The code and model will be released online upon publication of this work.
2 Related Work
In this section, we briefly review related approaches for human pose tracking.
Optical Flow Based Pose Deformation
A family of deep learning methods aid prediction in the current frame with information from its neighbors using dense optical flow[45, 60, 65]. Jain et al.  concatenate dense optical flow with RGB features to enhance the performance of pose estimation. Pfister et al.  use dense optical flow to warp the heat maps of previous frames to the current frame. The aligned heat maps are then combined by a learned convolutional layer to produce a pooled heat map. To better exploit the warped heat maps, Song et al.  present a spatio-temporal inference layer which performs message passing along spatial and temporal graph edges to reduce joint position uncertainty.
For these methods, an issue exists with their use of dense optical flow, which can be unreliable for non-rigid body regions that may undergo appearance change during motion. We propose a method trained specifically to predict the displacement of human joints directly, with the support of information from relevant local areas. This task-specific approach provides greater reliability than generic flow, while also not needing an additional post-processing step for heatmap warping or a sub-network to learn how to fuse multiple results. In addition, this approach is directly applicable to both 2D and 3D pose, while optical flow provides only 2D cues. An empirical comparison to an optical flow based technique is presented in Section 5.2.
Recurrent Neural NetworksRecurrent Neural Networks (RNNs), specifically with Long Short-Term Memory (LSTM) , have proven effective at exploiting temporal correlations in many tasks including language modeling , machine translation [8, 18] and video description . Lin et al.  first introduce LSTM to 3D pose tracking in video. They propose a multi-stage sequential refinement framework to predict 3D pose sequences using previously predicted 3D poses. Following this paradigm, Coskun et al. 
use Kalman filters for better pose regularization. More recently, Hossain et al. exploit recent progress in sequence-to-sequence networks  composed of layer-normalized  LSTM units with recurrent dropout . Luo et al.  embed LSTM units into the multi-stage Convolutional Pose Machine (CPM)  framework, yielding LSTM Pose Machines that predict a sequence of poses from a sequence of images.
Alternatively, some works fully connect the input and output pose sequences directly using a fully connected [20, 24] or convolutional  network, which draws global dependencies between the input and output. Recently, Vaswani et al.  show that a simple self-attention mechanism eschewing recurrence can also model global input-output dependencies with better efficacy in machine translation models. The attention mechanism is efficient because it focuses on information relevant to what is currently being processed. Inspired by this, we propose a pixel weighting mechanism with which the pose deformation target depends only on relevant input pixels. But rather than learning to find the attention areas as in , our approach directly identifies them based on distance from the previously detected joints, as explained in Section 3.2. In the experiments of Section 5.2, we show that this approach compares favorably to RNN-based techniques.
Model Fitting A classic pose tracking paradigm is to locally optimize a model-based pose representation initialized from the previous frame [20, 11, 51, 61, 64, 73, 56, 9, 7, 75]. Elaborate model parameterization and objective function design are crucial for good performance. In these works, local optimization is usually slow due to the complexity of models and objective functions [11, 55, 56, 9, 73]. Moreover, the optimization is purely local and cannot recover from tracking failure arising from the well-known drifting problem. While re-initialization strategies [5, 7, 61, 64] or stochastic optimization methods [61, 55, 56] have been studied to alleviate drifting, the problem cannot fundamentally be resolved within the local optimization framework.
By contrast, we propose a completely data-driven approach that provides a globally optimal and model-free estimate of pose deformation from frame to frame for tracking. It is also complementary to these local model fitting techniques, which could potentially be used for refinement.
Tracking by Detection Some works follow the tracking by detection paradigm, which performs detection in each frame, estimates the pose of each detection result with an off-the-shelf technique, then either enforces spatio-temporal smoothness or computes joint associations for multi-person tracking.
Spatio-Temporal Smoothness. Zhou et al.  propose a synthesis of a deep-learning based 2D part detector, a sparsity-driven 3D reconstruction approach and a 3D temporal smoothness prior for 3D human pose estimation from RGB image sequences. Mueller et al.  add temporal smoothness to their spatial-temporal optimization to penalize deviations from constant pose velocity. Instead of employing generic smoothness priors, we seek a more targeted approach by specifically learning about human pose change.
Joint Association for Multi-Person Tracking. Multi-person pose tracking in videos aims to estimate the pose of all persons appearing in the video and assign a unique identity to each person [42, 31, 82, 83, 26, 39, 21]. These approaches differ from one another in their choice of metrics and features used for measuring similarity between a pair of poses and in their algorithms for joint identity assignment. Recent methods utilize metrics based on bounding box  and keypoint  similarity between a pair of poses, as well as on similarity between bounding-box image features  and optical flow information [39, 42, 83]. Recently, Doering et al.  learn a model that predicts Temporal Flow Fields which indicate the direction in which each body joint is going to move from two consecutive frames. Similar to our approach, they propose a task-specific representation instead of task-agnostic similarity metrics, but for the problem of joint association.
Note that in these multi-person techniques, the concept of ‘tracking’ is to associate joints of the same person together over time, using joints localized independently in each frame. By contrast, our work aims to improve joint localization by utilizing information from other frames.
3 Explicit Pose Deformation Learning
We define pose deformation as the displacement of a person’s body or hand joints from one time instance to the next. Let the pose be denoted as , a vector of dimensions for the 3D pose of joints. Given two successive frames and from a video sequence, our goal is to estimate the pose deformation of a person from time to :
A possible way to predict is by using a CNN with a fully connected layer. The two frames are passed through convolutional layers to generate convolutional features. Following common practice [14, 88, 69, 87, 71], the features undergo spatial downsampling by average pooling. They are then given to a fully connected layer which outputs a -dimensional joint displacement vector. We refer to this holistic fully connected approach as the Baseline.
In this work, we instead propose a novel per-pixel pose deformation that is inspired by the two most relevant tasks, pose and optical flow estimation. In the following, we describe our representation for pose deformation and how it is used for explicit learning of human pose changes.
3.1 Pose Deformation Representation
Solving for pose deformation is related to estimating pose and optical flow, which are both commonly formulated as per-pixel prediction problems. The predominant approach for pose estimation involves classifying each pixel into a joint class and generating a likelihood heat map for each joint. For optical flow, the motion of each pixel is predicted to form a dense flow map [23, 38, 68].
In tailoring our method to pose deformation, we alter the conventional optical flow scheme to incorporate concepts from pose estimation. This is done by following two main principles. The first is that rather than computing the motion of each pixel, the focus should be on predicting the motion of pixels with specific semantics, namely joints. Similar to pose estimation, we use multiple pixels to support the estimation of each joint motion. This will be represented by a Joint Displacement Map. The second principle is to consider the varying relevance of pixels in estimating the deformation of a joint, much like the use of heat maps for estimating joint positions or the relative emphasis given to different image regions in attention-based vision techniques. The inclusion of non-relevant pixels and the over-emphasis of weakly relevant pixels would not only be unhelpful but could moreover be detrimental due to the noise they would introduce to the estimation. The relative relevance of pixels for estimating a joint’s motion is represented by a Pixel Weighting Map.
Joint Displacement Map. The body movement that leads to the displacement of a joint generally impacts many pixels in a video frame. Correspondingly, many pixels can provide information about the displacement of a joint. Therefore, our method generates a separate displacement map for each joint, where every pixel in a map predicts the displacement of the joint, as illustrated in Figure 2. For 3D pose, a joint displacement map is a 3D vector field for the x, y, z dimensions. For compactness, the joint index will henceforth be omitted, as all joints are processed separately in the same way.
Pixel Weighting Map. However, not every pixel in a joint displacement map is useful for estimating the motion of the joint. For example, pixels that lie far from the joint’s vicinity or within smooth and textureless areas are generally less informative for this prediction. To represent this information, a pixel weighting map is generated for each joint displacement map, as shown in Figure 2. Each pixel in contains a weight value in that indicates the relevance of the pixel for predicting the displacement of this joint.
3.2 Pixel Weighting Mechanism
Various methods exist for determining the pixel weights for a given joint. We experimentally examined several possibilities, some of which are based on the intuitive notion that pixels closer to the previous joint prediction or that have strong image features like corners, edges and ridges are generally more informative for estimating joint deformation.
Table 1 presents a comparison of different pixel sampling strategies. JointOne takes only the pixel at the ground truth joint location of the previous frame; RandOne randomly samples one pixel; RandHalf randomly samples half of the pixels, and Full uses all pixels. POI denotes sampling only points of interest. For this, we use a standard implementation of speeded up robust features (SURF)  from OpenCV . POIClose
takes only the POI samples that lie close to the target joint (distance of less than 1/32 of the input image size). The evaluation metric is the mean per joint position error in millimeters and the lower the better (See Section5.1 for details). None means that no pixel is sampled but the ground truth pose of the previous frame is directly used as the current frame’s pose estimate. Each of these strategies is implemented in a fully convolutional network with the same backbone network as Baseline but with a deconvolutional head network (See Section 3.3 for details) instead of a fully connected layer. Connections from the input image are provided only according the sampling strategy.
A couple conclusions can be drawn from Table 1:
Sparse connections are superior to holistic mapping. It can be seen from the lower performance of Baseline and Full compared to JointOne and POI that processing all the pixels is not as effective as attending to some subset of the pixels that have relevance to the pose deformation.
Pixel selection matters. Randomly choosing the sparse connections (RandOne, RandHalf) yields lower performance than more purposeful selection (JointOne, POI).
These observations motivate us to propose a general pixel weighting formulation:
where is a decay function that determines the drop in weight for pixels farther away from the most informative ones. is the joint location distance map and is the points of interest distance map. Each pixel value in the distance map represents the distance from this pixel to the joint location or the closest point of interest. balances the two terms, and is a decay rate parameter. For example, if we use the exponential decay function
then Eq. 2 becomes
Several decay functions are empirically investigated in Section 5.
We have observed that using POIClose is more effective than JointOne. However, since the improvement is not significant and generating POI requires extra computation, we henceforth only use pixels based on their joint distance by setting in the remaining experiments.
3.3 Network Architecture
Our network architecture for pose deformation learning is based mainly on FlowNet [23, 38]. As illustrated in Figure 2, we simply stack two consecutive input images together and feed them into a fully convolutional network, as done in the ’FlowNetSimple’ architecture . Then, a head network consisting of consecutive deconvolution layers is used to upsample the feature map to the targeted resolution as done in . At the end, a 1x1 convolutional layer is used to produce the final joint displacement map.
Training Loss We directly minimize the absolute difference (L1 norm) of the predicted and ground truth joint displacement map weighted by the pixel weighting map. This loss is expressed as
where is the domain of the joint displacement map.
Inference In the testing phase, the final prediction is the average of all predictions in a joint displacement map weighted by the pixel weighting map:
Discussion. Several architectural variants of FlowNet are presented in the literature, including those that combine deep and shallow feature maps , stack input images at a later stage using a correlation layer , employ a multi-stage stacked architecture , compute small displacement refinement , and use a pyramid, warping and cost volume . We use a simple architecture design to better highlight the effects of concatenated consecutive image input and the proposed pose deformation representation. These elements could be incorporated into other FlowNet variants as well, which we leave for future work.
4 Pose Deformation for Tracking
Given a single-frame pose prediction and a pose deformation prediction between two successive frames, the pose tracking task can be formulated as a simple linear optimization problem. For a video sequence with frames, we minimize the following error function to obtain the final tracking result :
where is a parameter to balance the single-frame and pose deformation terms.
In Eq. 7, we can see that the final pose prediction is determined from two orthogonal observations. The first component of this detection model attempts to answer “Does this local image area look like the correct type of joint from its appearance?” On the other hand, the second component attempts to answer “Is this how the joint location should change by looking at the transition from the previous to the current frame?” The final pose estimation is acquired through an optimization from both perspectives.
This approach can be applied not only to two consecutive frames, but also to frames separated by different durations. Accounting for additional frames can potentially enhance the pose tracking. For this purpose, we define a set of durations . If we have durations in , the error function of Eq. 7 can be extended to the following:
Minimizing with respect to is a linear least squares problem with constraints, which is sufficient to solve the unknowns. The duration set can be arbitrary. To validate the effectiveness of Eq. 8, we test four variants:
, which is the most common case of using the pose deformation from the previous frame to the current frame to provide one-step forward tracking.
, which additionally considers reverse pose deformation from the subsequent frame to the current frame, yielding one-step forward and backward tracking.
, which accounts for pose deformation from multiple past frames, providing multi-step forward tracking.
, which utilizes pose deformation from multiple past and future frames, yielding multi-step forward and backward tracking.
Training Details ResNet  (ResNet-50 by default) is adopted as the backbone network as in . The head network for the joint displacement map is fully convolutional. It first uses deconvolution layers (
kernels, stride 2) to upsample the feature map to the required resolution (by default). The number of output channels is fixed to 256 as in [34, 71]. Then, a convolutional layer is used to produce
joint displacement maps. PyTorch is used for implementation. Adam is employed for optimization. The input image is normalized to by default. Data augmentation includes random translation ( of the image size), scale (), rotation (
degrees) and flip. In all experiments, the base learning rate is 1e-5. It drops to 1e-6 when the loss on the validation set saturates. Four GPUs are utilized. The mini-batch size is 128, and batch normalization is used. Other training details are provided with the individual experiments.
5.1 Datasets and Evaluation Metrics
Our approach can be applied to various pose tracking scenarios, including human body or hand (w/ or w/o an object), RGB or depth input, and 2D or 3D output. We evaluate our approach extensively on four challenging datasets that cover all of these factors to show its generality. Specifically, 3D human pose dataset Human3.6M  of RGB images, 3D hand pose dataset MSRA Hand 2015  of depth images, and 3D hand pose datasets Dexter+Object  and Stereo Hand Pose  of RGB-D images are used.
Human3.6M  (HM36) is the largest 3D human pose benchmark of RGB images. Accurate 3D human joint locations are obtained from motion capture devices. It consists of 3.6 million video frames. 11 subjects are captured from 4 camera viewpoints, performing 15 activities.
For this benchmark, we employ the most widely used evaluation protocol in the literature [15, 76, 52, 90, 44, 50, 59, 84, 62, 11, 89, 74, 88, 71, 63]. Five subjects (S1, S5, S6, S7, S8, S9) are used in training and two subjects (S9, S11) are used in evaluation. For evaluation, most previous works use the mean per joint position error (MPJPE) in millimeters (mm). Since a pose estimation metric such as this is unsuitable for directly evaluating pose deformation performance, we instead introduce a new metric called D-MPJPE, which adds the estimated pose deformation to the ground truth pose of the previous frame to obtain the current frame pose estimation and then evaluate its MPJPE in mm.
MSRA Hand 2015  (MSRA15) is a standard benchmark for 3D hand pose estimation from depth. It consists of 76.5k video frames. In total, the right hands of nine subjects are captured using Intel’s Creative Interactive Camera.
For evaluation, we use two common accuracy metrics in the literature [70, 80, 16, 29, 33, 54, 78, 79]. The first one is mean 3D joint error. The second one is the percentage of correct frames. A frame is considered correct if its maximum joint error is less than a small threshold.
Dexter+Object  (D+O) and Stereo Hand Pose  (SHP) are two standard 3D hand pose datasets composed of RGB-D images. D+O provides six test video sequences recorded using a static camera with a single hand interacting with an object. SHP provides 3D annotations of 21 keypoints for 18k stereo frame pairs, recording a person performing various gestures in front of six different backgrounds. Note that only the RGB images are used in our experiments.
We follow the evaluation protocol and implementation provided by  that uses 15k samples from the SHP dataset for training (SHP-train) and the remaining 3k samples for evaluation (SHP-eval). The model trained using SHP-train is also evaluated on D+O. For evaluation metrics, we use average End-Point-Error (EPE) and the Area Under Curve (AUC) on the Percentage of Correct Keypoints (PCK) as done in [43, 53, 91, 4].
5.2 Experiments on Human3.6M
For single-frame pose estimation, previous works typically downsample the video sequences for efficiency. For pose tracking in videos, we use three downsampling rates, namely no downsampling (25FPS), three-step downsampling (8FPS) and ten-step downsampling (2.5FPS). The lower the FPS, the larger the pose deformation between two consecutive frames.
Effect of decay function We investigate four forms of decay function in Eq. 2. They are Binary, Gaussian, Linear and Exponential. (See supplementary materials for detailed definitions and performance evaluation.) We observe that JointOne is not the best, and better performance is obtained when the decay rate parameter is carefully selected for each decay function form. The performance differences between different decay functions are minor, but an ensemble of these functions yields considerable improvement. We thus use this Ensemble model as our final pose deformation estimator for tracking.
Comparison to optical flow Table 2 presents a comprehensive comparison between our explicitly learned pose deformation (ExplicitPD) and optical flow based pose deformation (FlowPD). We use the state-of-the-art FlowNet2.0 
and its Caffe implementation to compute optical flow between two consecutive frames, and then use the same pixel sampling/weighting scheme as in our method to produce FlowPD. Since pose deformation in the depth dimension cannot be determined from optical flow, we directly use the corresponding deformation in depth from ExplicitPD for FlowPD. The D-MPJPE metric is used for evaluation. Two conclusions can be drawn from the results. First, ExplicitPD clearly outperforms FlowPD under all frame rates using any sampling/weighting strategy, where the relative performance gain is shown as subscripts. Second, ExplicitPD is especially superior to FlowPD under large pose deformations. This can be seen from the larger relative improvement under lower frame rates (ExplicitPD outperforms FlowPD by 44.8% at 2.5FPS, but by 24.7% at 25FPS using JointOne sampling) and more challenging joints with large displacement (ExplicitPD outperforms FlowPD by 51.9% in Hard level, but by 31.4% in Easy level using JointOne sampling at 2.5FPS).
Effect of resolution Table 3 compares results for two image sizes and five output joint displacement map sizes. The D-MPJPE metric, 2.5FPS frame rate and JointOne sampling strategy are used. Not surprisingly, a larger image size leads to better accuracy, under all cases. However, interestingly, neither a very large nor small joint displacement map size gives the best performance. This illustrates that the connections between input and output should neither be too sparse nor too dense.
|Method||Zhou ||Tekin ||Xingyi ||Sun ||Pavlakos ||Sun ||Lin ||Coskun ||Ours|
Effect of ExplicitPD for tracking Now we apply the explicitly learned pose deformation to the tracking framework introduced in Section 4 for better joint localization in videos. For the single frame pose estimator, we use the Pytorch implementation  of Integral Regression . Specifically, three single-frame baselines are used. The first two use only HM36 data for training and use a one-stage (HM36S1) or two-stage (HM36S2) network architecture respectively. The third baseline mixes HM36 and MPII data for training and uses a one-stage network architecture (HM36+MPII). Note that these three baselines are the state of the art and thus produce strong baseline results.
Table 4 shows how joint localization accuracy improves using our explicitly learned pose deformation. The MPJPE metric and both 8FPS and 25FPS frame rates are used for evaluation. We can draw two conclusions. First, ExplicitPD effectively improves all single-frame baselines. Larger relative improvement can be obtained on the low-performance single-frame baseline. Second, multi-step tracking is effective. With more elements in the duration set, better performance is obtained (). Namely, the more observations we get from different durations in Eq. 8, the better pose tracking result we can get.
Comparison with the state of the art Previous works are commonly divided into two categories. The first uses extra 2D data for training. The second only uses HM36 data for training. They are compared to our method in Table 5 and Table 6 respectively. Methods marked with * are tracking based methods that exploit temporal information. Note that in Table 5, although Sárándi et al.  do not use extra 2D pose ground truth, they use extra 2D images for synthetic occlusion data augmentation where the occluders are realistic object segments from the Pascal VOC 2012 dataset . Their data augmentation technique is also complementary to our method.
5.3 Experiments on MSRA15
For depth image based hand pose tracking on the MSRA15 dataset, we use the original frame rate with no downsampling. ResNet-18 is adopted as the backbone network, and the input images are normalized to . Data augmentation includes random rotation ( degrees) and scale (%). No flip and translation augmentation is used. For single-frame pose estimation, many recent works provide their single-frame predictions [79, 17, 30, 28]. We use all of them as our single-frame baselines. In addition, we re-implement a new single-frame baseline using Integral Regression .
Effect of ExplicitPD on tracking We apply our ExplicitPD on all of these single-frame baselines for tracking, and list the results in Table 7. The mean 3D joint error is used as the evaluation metric. Our observations are similar to those for the HM36 experiments. First, ExplicitPD significantly improves all single-frame baselines. The relative performance gain over the single-frame baseline is shown in the subscript. Second, more elements in the duration set leads to better tracking results. Specifically, our method improves the Integral Regression baseline by 2.07mm (relative 24.6%), establishing a new record of 6.35mm mean 3D joint error on this benchmark.
Comparisons to the state of the art We compare our method to previous works in Table 7 and Figure 3 under the metric of mean 3D joint error and percentage of correct frames. All of the previous works are single-frame pose estimators, which are complementary to our tracking framework. Consistent improvements are obtained by applying our ExplicitPD model directly to the highest-performing methods under both evaluation metrics. Our method achieves a 6.35mm mean 3D joint error and outperforms the previous state-of-the-art result of  by a large margin (11.1% relative improvement).
5.4 Experiments on SHP and D+O
For the experiments on these datasets, the training details including frame rate, backbone network, input image size and augmentation strategies are the same as in Section 5.3. For single-frame pose estimation, we re-implement a new baseline using Integral Regression .
Comparison to the state of the art We compare our method to the state of the art on the SHP and D+O datasets in Table 5 and Table 5, respectively. AUC and EPE are used as evaluation metrics. On both benchmarks, our re-implemented single-frame pose estimator already achieves state of the art performance, but then our tracking method brings consistent improvements to the single-frame baselines, setting the new state-of-the-art in performance.
-  FlowNet2.0 Caffe Github. https://github.com/lmb-freiburg/flownet2.
-  Integral Regression Pytorch Github. https://github.com/JimmySuen/integral-human-pose.
-  MPII Leader Board. http://human-pose.mpi-inf.mpg.de.
-  SHP Evaluation Toolkit Github. https://github.com/lmb-freiburg/hand3d.
-  V. Athitsos and S. Sclaroff. Estimating 3d hand pose from a cluttered image. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 2, pages II–432. IEEE, 2003.
-  J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
-  A. Baak, M. Müller, G. Bharaj, H.-P. Seidel, and C. Theobalt. A data-driven approach for real-time full body pose reconstruction from a depth camera. In Consumer Depth Cameras for Computer Vision, pages 71–98. Springer, 2013.
-  D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
-  L. Ballan, A. Taneja, J. Gall, L. Van Gool, and M. Pollefeys. Motion capture of hands in action using discriminative salient points. In European Conference on Computer Vision, pages 640–653. Springer, 2012.
-  H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. In European conference on computer vision, pages 404–417. Springer, 2006.
-  F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European Conference on Computer Vision, pages 561–578. Springer, 2016.
-  G. Bradski and A. Kaehler. Opencv. Dr. Dobb’s journal of software tools, 3, 2000.
-  Y. Cai, L. Ge, J. Cai, and J. Yuan. Weakly-supervised 3d hand pose estimation from monocular rgb images. ECCV, Springer, 12, 2018.
-  J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4733–4742, 2016.
-  C.-H. Chen and D. Ramanan. 3d human pose estimation= 2d pose estimation+ matching. arXiv preprint arXiv:1612.06524, 2016.
-  X. Chen, G. Wang, H. Guo, and C. Zhang. Pose guided structured region ensemble network for cascaded hand pose estimation. arXiv preprint arXiv:1708.03416, 2017.
-  X. Chen, G. Wang, C. Zhang, T.-K. Kim, and X. Ji. Shpr-net: Deep semantic hand pose regression from point clouds. IEEE Access, 6:43425–43439, 2018.
-  K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
-  H. Coskun, F. Achilles, R. S. DiPietro, N. Navab, and F. Tombari. Long short-term memory kalman filters: Recurrent neural estimators for pose regularization. In ICCV, pages 5525–5533, 2017.
-  R. Dabral, A. Mundhada, U. Kusupati, S. Afaque, A. Sharma, and A. Jain. Learning 3d human pose from structure and motion. In Proceedings of the European Conference on Computer Vision (ECCV), pages 668–683, 2018.
-  A. Doering, U. Iqbal, and J. Gall. Joint flow: Temporal flow fields for multi person tracking. arXiv preprint arXiv:1805.04596, 2018.
-  J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
-  A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2758–2766, 2015.
-  M. Einfalt, D. Zecha, and R. Lienhart. Activity-conditioned continuous human pose estimation for performance analysis of athletes using the example of swimming. arXiv preprint arXiv:1802.00634, 2018.
-  M. Everingham and J. Winn. The pascal visual object classes challenge 2012 (voc2012) development kit. Pattern Analysis, Statistical Modelling and Computational Learning, Tech. Rep, 2011.
-  M. Fabbri, F. Lanzi, S. Calderara, A. Palazzi, R. Vezzani, and R. Cucchiara. Learning to detect and track visible and occluded body joints in a virtual world. arXiv preprint arXiv:1803.08319, 2018.
-  H.-S. Fang, Y. Xu, W. Wang, X. Liu, and S.-C. Zhu. Learning pose grammar to encode human body configuration for 3d pose estimation. In AAAI, 2018.
-  L. Ge, Y. Cai, J. Weng, and J. Yuan. Hand pointnet: 3d hand pose estimation using point sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8417–8426, 2018.
L. Ge, H. Liang, J. Yuan, and D. Thalmann.
3d convolutional neural networks for efficient and robust hand pose estimation from single depth images.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 5, 2017.
-  L. Ge, Z. Ren, and J. Yuan. Point-to-point regression pointnet for 3d hand pose estimation. ECCV, Springer, 1, 2018.
-  R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran. Detect-and-track: Efficient pose estimation in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 350–359, 2018.
-  G. Gkioxari, A. Toshev, and N. Jaitly. Chained predictions using convolutional neural networks. In European Conference on Computer Vision, pages 728–743. Springer, 2016.
-  H. Guo, G. Wang, X. Chen, C. Zhang, F. Qiao, and H. Yang. Region ensemble network: Improving convolutional network for hand pose estimation. In Image Processing (ICIP), 2017 IEEE International Conference on, pages 4512–4516. IEEE, 2017.
-  K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. In International Conference on Computer Vision, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  M. R. I. Hossain and J. J. Little. Exploiting temporal information for 3d pose estimation. arXiv preprint arXiv:1711.08585, 2017.
-  E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In IEEE conference on computer vision and pattern recognition (CVPR), volume 2, page 6, 2017.
-  E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres, and B. Schiele. Arttrack: Articulated multi-person tracking in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 4327. IEEE, 2017.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2014.
-  U. Iqbal, A. Milan, and J. Gall. Posetrack: Joint multi-person pose estimation and tracking. arXiv preprint arXiv:1611.07727, 2016.
-  U. Iqbal, P. Molchanov, T. Breuel, J. Gall, and J. Kautz. Hand pose estimation via latent 2.5 d heatmap regression. arXiv preprint arXiv:1804.09534, 2018.
-  E. Jahangiri and A. L. Yuille. Generating multiple hypotheses for human 3d pose consistent with 2d joint detections. arXiv preprint arXiv:1702.02258, 2017.
-  A. Jain, J. Tompson, Y. LeCun, and C. Bregler. Modeep: A deep learning framework using motion features for human pose estimation. In Asian conference on computer vision, pages 302–315. Springer, 2014.
-  A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. End-to-end recovery of human shape and pose. arXiv preprint arXiv:1712.06584, 2017.
-  M. Lin, L. Lin, X. Liang, K. Wang, and H. Cheng. Recurrent 3d pose sequence machines. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5543–5552. IEEE, 2017.
-  Y. Luo, J. Ren, Z. Wang, W. Sun, J. Pan, J. Liu, J. Pang, and L. Lin. Lstm pose machines. arXiv preprint arXiv:1712.06316, 2017.
-  J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple yet effective baseline for 3d human pose estimation. arXiv preprint arXiv:1705.03098, 2017.
-  D. Mehta, H. Rhodin, D. Casas, O. Sotnychenko, W. Xu, and C. Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. arXiv preprint arXiv:1611.09813, 2016.
-  D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H.-P. Seidel, W. Xu, D. Casas, and C. Theobalt. Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Transactions on Graphics (TOG), 36(4):44, 2017.
-  F. Moreno-Noguer. 3d human pose estimation from a single image via distance matrix regression. arXiv preprint arXiv:1611.09010, 2016.
-  F. Mueller, F. Bernard, O. Sotnychenko, D. Mehta, S. Sridhar, D. Casas, and C. Theobalt. Ganerated hands for real-time 3d hand tracking from monocular rgb. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 49–59, 2018.
-  M. Oberweger and V. Lepetit. Deepprior++: Improving fast and accurate 3d hand pose estimation. In ICCV workshop, volume 840, page 2, 2017.
-  I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Markerless and efficient 26-dof hand pose recovery. In Asian Conference on Computer Vision, pages 744–757. Springer, 2010.
-  I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Efficient model-based 3d tracking of hand articulations using kinect. In BmVC, volume 1, page 3, 2011.
-  P. Panteleris, I. Oikonomidis, and A. Argyros. Using a single rgb frame for real time 3d hand pose estimation in the wild. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 436–445. IEEE, 2018.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
-  G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Coarse-to-fine volumetric prediction for single-image 3d human pose. arXiv preprint arXiv:1611.07828, 2016.
-  T. Pfister, J. Charles, and A. Zisserman. Flowing convnets for human pose estimation in videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 1913–1921, 2015.
-  C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun. Realtime and robust hand tracking from depth. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1106–1113, 2014.
-  G. Rogez and C. Schmid. Mocap-guided data augmentation for 3d pose estimation in the wild. In Advances in Neural Information Processing Systems, pages 3108–3116, 2016.
-  I. Sárándi, T. Linder, K. O. Arras, and B. Leibe. Synthetic occlusion augmentation with volumetric heatmaps for the 2018 eccv posetrack challenge on 3d human pose estimation. arXiv preprint arXiv:1809.04987, 2018.
-  T. Sharp, C. Keskin, D. Robertson, J. Taylor, J. Shotton, D. Kim, C. Rhemann, I. Leichter, A. Vinnikov, Y. Wei, et al. Accurate, robust, and flexible real-time hand tracking. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pages 3633–3642. ACM, 2015.
-  J. Song, L. Wang, L. Van Gool, and O. Hilliges. Thin-slicing network: A deep structured model for pose estimation in videos. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2017.
-  A. Spurr, J. Song, S. Park, and O. Hilliges. Cross-modal deep variational hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 89–98, 2018.
-  S. Sridhar, F. Mueller, M. Zollhöfer, D. Casas, A. Oulasvirta, and C. Theobalt. Real-time joint tracking of a hand manipulating an object from rgb-d input. In European Conference on Computer Vision, pages 294–310. Springer, 2016.
-  D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2018.
-  X. Sun, J. Shang, S. Liang, and Y. Wei. Compositional human pose regression. In International Conference on Computer Vision, 2017.
-  X. Sun, Y. Wei, S. Liang, X. Tang, and J. Sun. Cascaded hand pose regression. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 824–832, 2015.
-  X. Sun, B. Xiao, S. Liang, and Y. Wei. Integral human pose regression. arXiv preprint arXiv:1711.08229, 2017.
-  I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
-  A. Tagliasacchi, M. Schröder, A. Tkach, S. Bouaziz, M. Botsch, and M. Pauly. Robust articulated-icp for real-time hand tracking. In Computer Graphics Forum, volume 34, pages 101–114. Wiley Online Library, 2015.
-  B. Tekin, A. Rozantsev, V. Lepetit, and P. Fua. Direct prediction of 3d body poses from motion compensated sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 991–1000, 2016.
-  A. Tkach, M. Pauly, and A. Tagliasacchi. Sphere-meshes for real-time hand modeling and tracking. ACM Transactions on Graphics (TOG), 35(6):222, 2016.
-  D. Tome, C. Russell, and L. Agapito. Lifting from the deep: Convolutional 3d pose estimation from a single image. arXiv preprint arXiv:1701.00295, 2017.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
-  C. Wan, T. Probst, L. Van Gool, and A. Yao. Crossing nets: Combining gans and vaes with a shared latent space for hand pose estimation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
-  C. Wan, T. Probst, L. Van Gool, and A. Yao. Dense 3d regression for hand pose estimation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.(CVPR), pages 1–10, 2018.
-  C. Wan, A. Yao, and L. Van Gool. Hand pose estimation from local surface normals. In European conference on computer vision, pages 554–569. Springer, 2016.
-  S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4724–4732, 2016.
-  B. Xiao, H. Wu, and Y. Wei. Simple baselines for human pose estimation and tracking. arXiv preprint arXiv:1804.06208, 2018.
-  Y. Xiu, J. Li, H. Wang, Y. Fang, and C. Lu. Pose flow: Efficient online pose tracking. arXiv preprint arXiv:1802.00977, 2018.
-  H. Yasin, U. Iqbal, B. Kruger, A. Weber, and J. Gall. A dual-source approach for 3d pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4948–4956, 2016.
-  W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
-  J. Zhang, J. Jiao, M. Chen, L. Qu, X. Xu, and Q. Yang. 3d hand pose tracking and estimation using stereo matching. arXiv preprint arXiv:1610.07214, 2016.
-  X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei. Towards 3d human pose estimation in the wild: a weakly-supervised approach. In International Conference on Computer Vision, 2017.
-  X. Zhou, X. Sun, W. Zhang, S. Liang, and Y. Wei. Deep kinematic pose regression. In Computer Vision–ECCV 2016 Workshops, pages 186–201. Springer, 2016.
-  X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis. Sparseness meets deepness: 3d human pose estimation from monocular video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4966–4975, 2016.
-  X. Zhou, M. Zhu, G. Pavlakos, S. Leonardos, K. G. Derpanis, and K. Daniilidis. Monocap: Monocular human motion capture using a cnn coupled with a geometric prior. arXiv preprint arXiv:1701.02354, 2017.
-  C. Zimmermann and T. Brox. Learning to estimate 3d hand pose from single rgb images. In International Conference on Computer Vision, volume 1, page 3, 2017.
1 Additional Results
Due to page limitations, some experimental results were not included in the main paper, and are reported in this supplement instead.
1.1 Experiments on Human3.6M
Effect of decay function
We investigate four forms of the decay function for Eq. 2 in the main paper. They are defined in Table 4 and illustrated in Figure 4. As an example, Figure 4 shows the pose deformation estimation performance using the Binary decay function under different decay rates . The D-MPJPE metric and 2.5FPS frame rate are used. When , degenerates to JointOne and Full, respectively. It is seen that JointOne (i.e., using only the joint pixels) does not yield the best results, and we obtain better performance when is slightly larger than ( in the Binary case). Similar observations are obtained using other decay functions, with the corresponding best selection for each decay function listed in Table 4. The performance differences between different decay functions are minor, but an ensemble of these functions yields considerable improvement. We thus use this Ensemble model as our final pose deformation estimator for tracking.
Effect of ExplicitPD for tracking
In Table 4 of the main paper, it is seen that the tracking accuracy improvement at a higher frame rate (25PFS) is not as significant as at a lower frame rate (8FPS). This is because under the same duration set , the duration time at 8FPS is longer than at 25FPS. In order to get comparably good results at 25FPS, we increase the duration time at 25FPS by (i.e. by taking frame 3 as the frame after frame 0, and taking frame 4 as the frame after frame 1, etc.). The tracking results are reported in Table 1, bottom row. It can be seen that the tracking accuracy in this setting is comparable to the results at 8FPS.
Detailed Results on All Joints and Dimensions
In Table 2 of the main paper, we show that ExplicitPD outperforms FlowPD under all frame rates using any sampling/weighting strategy, with especially large differences under large pose deformations. Table 2 further reports the performance improvements from ExplicitPD to FlowPD on all the joints and dimensions. The performance gain in is shown as the subscript. JointOne sampling strategy is used. Since we directly use the corresponding deformation in depth from ExplicitPD for FlowPD, comparisons in the depth dimension are not reported. The conclusions are consistent with Table 2 in the main paper. First, ExplicitPD clearly outperforms FlowPD under all frame rates for all the joints and dimensions. Second, ExplicitPD is especially superior to FlowPD under large pose deformations. This can be seen from the larger relative improvements under lower frame rates and those end joints with large displacements, e.g., Ankle, Head and Wrist. Additionally, it is seen that improvement in the x dimension is larger than in the y dimension. This is due to more frequent body movements in the x dimension than in the y dimension in the dataset.
In Table 4 of the main paper, we show that ExplicitPD effectively and consistently improves all three single-frame baselines, namely HM36S1, HM36S2 and HM36+MPII. Table 3 further reports the performance improvement on to these single-frame baselines on all the joints and dimensions. We can conclude that our method effectively improves the accuracy for all the joints and dimensions, especially for the challenging ones like wrist, elbow, ankle joint and z dimension.
Figures 6, 7, 8 and 9 show additional qualitative comparison results of ExplicitPD and FlowPD. Displacements of Right Wrist and Right Elbow joints are shown by green and red arrows, respectively. Figure 5 is the color map for optical flow and pose deformation visualization. See our supplementary demo video for more vivid and dynamic results.