Multi-person Articulated Tracking with Spatial and Temporal Embeddings

03/21/2019 ∙ by Sheng Jin, et al. ∙ SenseTime Corporation The University of Sydney 0

We propose a unified framework for multi-person pose estimation and tracking. Our framework consists of two main components, SpatialNet and TemporalNet. The SpatialNet accomplishes body part detection and part-level data association in a single frame, while the TemporalNet groups human instances in consecutive frames into trajectories. Specifically, besides body part detection heatmaps, SpatialNet also predicts the Keypoint Embedding (KE) and Spatial Instance Embedding (SIE) for body part association. We model the grouping procedure into a differentiable Pose-Guided Grouping (PGG) module to make the whole part detection and grouping pipeline fully end-to-end trainable. TemporalNet extends spatial grouping of keypoints to temporal grouping of human instances. Given human proposals from two consecutive frames, TemporalNet exploits both appearance features encoded in Human Embedding (HE) and temporally consistent geometric features embodied in Temporal Instance Embedding (TIE) for robust tracking. Extensive experiments demonstrate the effectiveness of our proposed model. Remarkably, we demonstrate substantial improvements over the state-of-the-art pose tracking method from 65.4% to 71.8% Multi-Object Tracking Accuracy (MOTA) on the ICCV'17 PoseTrack Dataset.



There are no comments yet.


page 1

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-person articulated tracking aims at predicting the body parts of each person and associating them across temporal periods. It has stimulated much research interest because of its importance in various applications such as video understanding and action recognition [5]. In recent years, significant progress has been made in single frame human pose estimation [3, 9, 12, 24]. However, multi-person articulated tracking in complex videos remains challenging. Videos may contain a varying number of interacting people with frequent body part occlusion, fast body motion, large pose changes, and scale variation. Camera movement and zooming further pose challenges to this problem.

Figure 1: (a) Pose estimation with KE or SIE. SIE may over-segment a single pose into several parts (column 2), while KE may erroneously group far-away body parts together (column 3). (b) Pose tracking with HE or TIE. Poses are color coded by predicted track ids and errors are highlighted by eclipses. TIE is not robust to camera zooming and movement (column 2), while HE is not robust to human pose changes (column 3). (c) Effect of PGG module. Comparing KE before/after PGG (column 3/4), PGG makes embeddings more compact and accurate, where pixels with similar color have higher confidence of belonging to the same person.

Pose tracking [14] can be viewed as a hierarchical detection and grouping problem. At the part level, body parts are detected and grouped spatially into human instances in each single frame. At the human level, the detected human instances are grouped temporally into trajectories.

Embedding can be viewed as a kind of permutation-invariant instance label to distinguish different instances. Previous works [20] perform keypoint grouping with Keypoint Embedding (KE). KE is a set of 1-D appearance embedding maps where joints of the same person have similar embedding values and those of different people have dissimilar ones. However, due to the over-flexibility of the embedding space, such representations are difficult to interpret and hard to learn [23]. Arguably, a more natural way for the human to assign ids to targets in an image is by counting in a specific order (from left to right and/or from top to bottom). This inspires us to enforce geometric ordering constraints on the embedding space to facilitate training. Specifically, we add six auxiliary ordinal-relation prediction tasks for faster convergence and better interpretation of KE by encoding the knowledge of geometric ordering. Recently, Spatial Instance Embedding (SIE) [22, 23] is introduced for body part grouping. SIE is a 2-D embedding map, where each pixel is encoded with the predicted human center location (x, y). Fig. 1(a) illustrates the typical error patterns of pose estimation with KE or SIE. SIE may over-segment a single pose into several parts (column 2), while KE sometimes erroneously groups far-away body parts together (column 3). KE better preserves intra-class consistency but has difficulty in separating instances for lack of geometric constraints. Since KE captures appearance features while SIE extracts geometric information, they are naturally complementary to each other. Therefore we combine them to achieve better grouping results.

In this paper, we propose to extend the idea of using appearance and geometric information in a single frame to the temporal grouping of human instances for pose tracking. Previous pose tracking algorithms mostly rely on task-agnostic similarity metrics such as the Object Keypoint Similarity (OKS) [33, 35] and Intersection over Union (IoU) [8]. However, such simple geometric cues are not robust to fast body motion, pose changes, camera movement and zoom. For robust pose tracking, we extend the idea of part-level spatial grouping to human-level temporal grouping. Specifically, we extend KE to Human Embedding (HE) for capturing holistic appearance features and extend SIE to Temporal Instance Embedding (TIE) for achieving temporal consistency. Intuitively, appearance features encoded by HE are more robust to fast motion, camera movement and zoom, while temporal information embodied in TIE is more robust to body pose changes and occlusion. We propose a novel TemporalNet to enjoy the best of both worlds. Fig. 1(b) demonstrates typical error patterns of pose tracking with HE or TIE. HE exploits scale-invariant appearance features which are robust to camera zooming and movement (column 1), and TIE preserves temporal consistency which is robust to human pose changes (column 4).

Bottom-up pose estimation methods follow the two-stage pipeline to generate body part proposals at the first stage and group them into individuals at the second stage. Since the grouping is mainly used as post-processing, i.e. graph based optimization [11, 12, 14, 16, 26]

or heuristic parsing 

[3, 23], no error signals from the grouping results are back-propagated. We instead propose a fully differentiable Pose-Guided Grouping (PGG) module, making detection-grouping fully end-to-end trainable. We are able to directly supervise the grouping results and the grouping loss is back-propagated to the low-level feature learning stages. This enables more effective feature learning by paying more attention to the mistakenly grouped body parts. Moreover, to obtain accurate regression results, post-processing clustering [22] or extra refinement [23] are required. Our PGG helps to produce accurate embeddings (see Fig. 1(c)). To improve the pose tracking accuracy, we further extend PGG to temporal grouping of TIE.

In this work, we aim at unifying pose estimation and tracking in a single framework. SpatialNet detects body parts in a single frame and performs part-level spatial grouping to obtain body poses. TemporalNet accomplishes human-level temporal grouping in consecutive frames to track targets across time. These two modules share the feature extraction layers to make more efficient inference.

The main contributions are summarized as follows:

  • For pose tracking, we extend the KE and SIE in still images to Human Embedding (HE) and Temporal Instance Embeddings (TIE) in videos. HE captures human-level global appearance features to avoid drifting in camera motion, while TIE provides smoother geometric features to obtain temporal consistency.

  • A fully differentiable Pose-Guided Grouping (PGG) module for both pose estimation and tracking, which enables the detection and grouping to be fully end-to-end trainable. The introduction of PGG and its grouping loss significantly improves the spatial/temporal embedding prediction accuracy.

2 Related Work

2.1 Multi-person Pose Estimation in Images

Recent multi-person pose estimation approaches can be classified into top-down and bottom-up methods.

Top-down methods [7, 9, 33, 24] locate each person with a bounding box then apply single-person pose estimation. They mainly differ in the choices of human detectors [28] and single-person pose estimators [21, 32]. They highly rely on the object detector and may fail in cluttered scenes, occlusion, person-to-person interaction, or rare poses. More importantly, top-down methods perform single-person pose estimation individually for each human candidate. Thus, its inference time is proportional to the number of people, making it hard for achieving real-time performance. Additionally, the interface between human detection and pose estimation is non-differentiable, making it difficult to train in an end-to-end manner. Bottom-up approaches [3, 12, 26] detect body part candidates and group them into individuals. Graph-cut based methods [12, 26] formulate grouping as solving a graph partitioning based optimization problem, while  [3, 23] utilize the heuristic greedy parsing algorithm to speed up decoding. However, these bottom-up approaches only use grouping as post-processing and no error signals from grouping results are back-propagated.

More recently, efforts have been devoted to end-to-end training or joint optimization. For top-down methods, Xie et al[34]

proposes a reinforcement learning agent to bridge the object detector and the pose estimator. For bottom-up methods, Newell

et al[20] proposes the keypoint embedding (KE) to tag instances and train by pairwise losses. Our framework is a bottom-up method inspired by [20]. [20] supervises the grouping in an indirect way. It trains keypoint embedding descriptors to ease the post-processing grouping. However, no direct supervision on grouping results is provided. Even if the pairwise loss of KE is low, it is still possible to produce wrong grouping results, but [20] does not model such grouping loss. We instead propose a differentiable Pose-Guided Grouping (PGG) module to learn to group body parts, making the whole pipeline fully end-to-end trainable, yielding significant improvement in pose estimation and tracking.

Our work is also related to [22, 23], where spatial instance embeddings (SIE) are introduced to aid body part grouping. However, due to lack of grouping supervision, their embeddings are always noisy [22, 23] and additional clustering [22] or refinement [23] is required. We instead employ PGG and additional grouping losses to learn to group SIE, making it end-to-end trainable while resulting in much more compact embedding representation.

Figure 2: The overview of our framework for pose tracking.

2.2 Multi-person Pose Tracking

Recent works on multi-person pose tracking mostly follow the tracking-by-detection paradigm, in which human body parts are first detected in each frame, then data association is performed over time to form trajectories.

Offline pose tracking methods take future frames into consideration, allowing for more robust predictions but having high computational complexity. ProTracker [8] employs 3D Mask R-CNN to improve the estimation of body parts by leveraging temporal context encoded within a sliding temporal window. Graph partitioning based methods [11, 14, 16]

formulate multi-person pose tracking into an integer linear programming (ILP) problem and solve spatial-temporal grouping. Such methods achieve competitive performance in complex videos by enforcing long-range temporal consistency.

Our approach is an online pose tracking approach, which is faster and fits for practical applications. Online pose tracking methods [6, 25, 37, 33] mainly use bi-partite graph matching to assign targets in the current frame to existing trajectories. However, they only consider part-level geometric information and ignore global appearance features. When faced with fast pose motion and camera movement, such geometrical trackers are prone to tracking errors. We propose to extend SpatialNet to TemporalNet to capture both appearance features in HE and temporal coherence in TIE, resulting in much better tracking performance.

3 Method

As demonstrated in Figure 2, we unify pose estimation and tracking in a single framework. Our framework consists of two major components: SpatialNet and TemporalNet.

SpatialNet tackles multi-person pose estimation by body part detection and part-level spatial grouping. It processes a single frame at a time. Given a frame, SpatialNet produces heatmaps, KE, SIE and geometric-ordinal maps simultaneously. Heatmaps model the body part locations. KE encodes the part-level appearance features, while SIE captures the geometric information about human centers. The auxiliary geometric-ordinal maps enforce ordering constraints on the embedding space to facilitate training of KE. PGG is utilized to make both KE and SIE to be more compact and discriminative. We finally generate the body pose proposals by greedy decoding following [20].

TemporalNet extends SpatialNet to deal with online human-level temporal grouping. It consists of HE branch and TIE branch, and shares the same low-level feature extraction layers with SpatialNet. Given body pose proposals, HE branch extracts region-specific embedding (HE) for each human instance. TIE branch exploits the temporally coherent geometric embedding (TIE). Given HE and TIE as pairwise potentials, a simple bipartite graph matching problem is solved to generate pose trajectories.

3.1 SpatialNet: Part-level Spatial Grouping

Throughout the paper, we use following notations. Let be the 2-D position in an image, and the location of body part for person . We use to represent the body pose of the th person. We use 2D Gaussian confidence heatmaps to model the body part locations. Let be the confidence heatmap for the th body part of th person, which is calculated by for each position in the image, where is set as 2 in the experiments. Following [3], we take the maximum of the confidence heatmaps to get the ground truth confidence heatmap, i.e. .

The detection loss is calculated by weighted distance respect to the ground truth confidence heatmaps.


3.1.1 Keypoint Embedding (KE) with auxiliary tasks

We follow [20] to produce the keypoint embedding for each type of body part. However, such kind of embedding representation has several drawbacks. First, the embedding is difficult to interpret [20, 23]. Second, it is hard to learn due to its over-flexibility with no direct supervision available. To overcome these drawbacks, we introduce several auxiliary tasks to facilitate training and improve interpretation. The idea of auxiliary learning [31]

has shown effective both in supervised learning 

[27] and reinforcement learning [15]. Here, we explore auxiliary training in the context of keypoint embedding representation learning.

By auxiliary training, we explicitly enforce the embedding maps to learn geometric ordinal relations. Specifically, we define six auxiliary tasks: to predict the ’left-to-right’ l2r, ’right-to-left’ r2l, ’top-to-bottom’ t2b, ’bottom-to-top’ b2t, ’far-to-near’ f2n and ’near-to-far’ n2f orders of human instances in a single image. For example, in the ‘left-to-right’ map, the person from left to right in the images should have low to high order (value). Fig. 4 (c)(d)(e) visualize some example predictions of the auxiliary tasks. We see human instances are clearly arranged in the corresponding geometric ordering. We also observe that KE (Fig. 4 (b)) and the geometric ordinal-relation maps (c)(d)(e) share some similar patterns, which suggests that KE acquires some knowledge of geometric ordering.

Following [20], is trained with pairwise grouping loss . The pull loss (Eq. 2) is computed as the squared distance between the human reference embedding and the predicted embedding of each joint. The push loss (Eq. 3) is calculated between different reference embeddings, which exponentially drops to zero as the increase of embedding difference. Formally, we define the reference embedding for the th person as .


For auxiliary training, we replace the push loss with the ordinal loss but keep the pull loss (Eq. 2) the same.


where indicates the ground-truth order for person and . In l2r, r2l, t2b, and b2t, we sort human instances by their centroid locations. For example, in l2r , if th person is on the left of th person, then , otherwise . In f2n and n2f, we sort them according to the head size .

3.1.2 Spatial Instance Embedding (SIE)

For lack of geometric information, KE has difficulty in separating instances and tends to erroneously group with distant body parts. To remedy this, we combine KE with SIE to embody instance-wise geometric cues. Concretely, we predict the dense offset spatial vector fields (SVF), where each 2-D vector encodes the relative displacement from the human center to its absolute location

. Fig. 4(f)(g) visualize the spatial vector fields of x-axis and y-axis, which distinguish the left/right sides and upper/lower sides relative to its body center. As shown in Fig. 3, subtracted by its coordinate, SVF can be decoded to SIE in which each pixel is encoded with the human center location.

We denote the spatial vector fields (SVF) by , and SIE by . We use distance to train SVF, where the ground truth spatial vector is the displacement from the person center to each body part.


where , is the center of person .

Figure 3: Spatial keypoint grouping with Pose-Guided Grouping (PGG). We obtain more compact and accurate Keypoint Embedding (KE) and Spatial Instance Embedding (SIE) with PGG.
Figure 4: (a) input image. (b) the average KE. (c)(d)(e) predicted ’left-to-right’, ’top-to-bottom’ and ’far-to-near’ geometric-relation maps. We use colors to indicate the predicted orders, where the brighter color means the higher ordinal value. (f)(g) are the spatial vector fields of x-axis and y-axis respectively. The bright color means positive offset relative to the human center, while dark color means negative.

3.2 Pose-Guided Grouping (PGG) Module

In prior bottom-up methods [3, 22, 23], detection and grouping are separated. We reformulate the grouping process into a differentiable Pose-Guided Grouping (PGG) module for end-to-end training. By directly supervising the grouping results, more accurate estimation is obtained.

Our PGG is based on Gaussian Blurring Mean Shift (GBMS) [4] algorithm and inspired by [17], which is originally proposed for segmentation. However, directly applying GBMS in the challenging articulate tracking task is not desirable. First, the complexity of GBMS is , where is the number of feature vectors to group. Direct use of GBMS on the whole image will lead to huge memory consumption. Second, the predicted embeddings are always noisy especially in background regions, where no supervision is available during training. As illustrated in the top row of Fig. 4, embedding noises exist in the background area (the ceiling or the floor). The noise in these irrelevant regions will affect the mean-shift grouping accuracy. We propose a novel Pose-Guided Grouping module to address the above drawbacks. Considering the sparseness of the matrix (body parts only occupy a small area in images), we propose to use the human pose mask to guide grouping, which rules out irrelevant areas and significantly reduces the memory cost. As shown in Fig. 3, we apply max along the channel and generate the instance-agnostic pose mask , by thresholding at . is 1 if , otherwise 0.

1:KE , SIE , Mask , and iteration number .
3:Concatenate and , mask-selected by , and reshape to .
5:for  do
6:     Gaussian Affinity . , .
7:     Normalization Matrix.
8:     Update.
10:end for
Algorithm 1 Pose-Guided Grouping

Both spatial (KE and SIE) and temporal (TIE) embeddings can be grouped by PGG. Take spatial grouping for example, we refine KE and SIE with PGG module to get more compact and discriminative embedding descriptors. The Pose-Guided Grouping algorithm is summarized in Alg. 1. KE and SIE are first concatenated to dimensional feature maps. Then embeddings are selected according to the binary pose mask and reshaped to as initialization, where is the number of non-zero elements in , (). Recurrent mean-shift grouping is then applied to for iterations. In each iteration, the Gaussian affinity is first calculated with the isotropic multivariate normal kernel , where the kernel bandwidth is empirically chosen as 5 in the experiments. can be viewed as the weighted adjacency matrix. The diagonal matrix of affinity row sum is used for normalization, where means a vector with all entries one. We then update with the normalized Gaussian kernel weighted mean, . After several iterations of grouping refinement, the embeddings become distinct for heterogeneous pairs and similar for homogeneous ones. When training, we apply the pairwise pull/push losses (Eq. 2 and 3) over all iterations of grouping results .

3.3 TemporalNet: Human Temporal Grouping

TemporalNet extends SpatialNet to perform human-level temporal grouping in an online manner. Formally, we use the superscript to distinguish different frames. denotes the input frame at time-step , which contains persons. SpatialNet is applied to to estimate a set of poses . TemporalNet aims at temporally grouping human pose proposals in the current frame with already tracked poses in the previous frame. TemporalNet exploits both human-level appearance features (HE) and temporally coherent geometric information (TIE) to calculate the total pose similarity. Finally, we generate the pose trajectories by solving the bipartite graph matching problems, using pose similarity as pairwise potentials.

3.3.1 Human Embedding (HE)

To obtain human-level appearance embedding (HE), we introduce a region-specific HE branch based on [36]. Given predicted pose proposals, HE brach first calculates human bounding boxes to cover the corresponding human keypoints. For each bounding box, ROI-Align pooling [9] is applied to the shared low-level feature maps to extract region-adapted ROI features. The ROI features are then mapped to the human embedding . HE is trained with triplet loss [30], pulling HE of the same instance closer, and pushing apart embeddings of different instances.


where the margin term is set to 0.3 in the experiments.

3.3.2 Temporal Instance Embedding (TIE)

To exploit the temporal information for pose tracking, we naturally extend the Spatial Instance Embedding (SIE) to the Temporal Instance Embedding (TIE). TIE branch concatenates low-level features, body part detection heatmaps and SIE from two neighboring frames. The concatenated feature maps are then mapped to dense TIE.

TIE is a task-specific representation which measures the displacement between the keypoint of one frame and the human center of another frame. This design utilizes the mutual information between keypoint and human in adjacent frames to handle occlusion and pose motion simultaneously. Specifically, we introduce bi-directional temporal vector fields (TVF), which are denoted as and respectively. Forward TVF encodes the relative displacement from the human center in -th frame to body parts in the -th frame, it temporally propagates the human centroid embeddings from -th to -th frame. In contrast, Backward TVF represents the offset from current -th frame body center to body parts in the previous frame.


where , is the center of person at time step . Simply subtracted from absolute locations, we get the corresponding Forward TIE and Backward TIE . Thereby, TIE encodes the temporally propagated human centroid. Likewise, we also extend the idea of spatial grouping to temporal grouping. TemporalNet outputs Forward TIE and Backward TIE , which are refined by PGG independently. Take Forward TIE for example, we generate pose mask using body heatmaps from the -th frame. We rule out irrelevant regions of and reshape it to . Subsequently, recurrent mean-shift grouping is applied. Again, additional grouping losses (Eq. 2,3) are used to train TIE.

3.3.3 Pose Tracking

The problem of temporal pose association is formulated as a bipartite graph based energy maximization problem. The estimated poses are then associated with the previous poses by bipartite graph matching.



is a binary variable which implies if the pose hypothesis

and are associated. The pairwise potentials represent the similarity between pose hypothesis. , with for human-level appearance similarity and for temporal smoothness. and

are hyperparameters to balance them, with

and .

The human-level appearance similarity is calculated as the embedding distance: And the temporal smoothness term is computed as the similarity between the encoded human center locations in SIE and the temporally propagated TIE , .


The bipartite graph matching problem (Eq. 8) is solved using Munkres algorithm to generate pose trajectories.

3.4 Implementation Details

Following [20], SpatialNet uses the 4-stage stacked-hourglass as its backbone. We first train SpatialNet without PGG. The total losses consist of and , with their weights 1:1e-3:1e-4:1e-4. We set the initial learning rate to 2e-4 and reduce it to 1e-5 after 250K iterations. Then we fine-tune SpatialNet with PGG included. In practice, we have found the iteration number is sufficient, and more iterations do not lead to much gain.

TemporalNet uses 1-stage hourglass model [21]

. When training, we simply fix SpatialNet and train TemporalNet for another 40 epochs with learning rate of 2e-4. We randomly select a pair of images

and from a range-5 temporal window () in a video clip as input.

4 Experiments

4.1 Datasets and Evaluation

MS-COCO Dataset [19] contains over 66k images with 150k people and 1.7 million labeled keypoints, for pose estimation in images. For the MS-COCO results, we follow the same train/val split as [20], where a held-out set of 500 training images are used for evaluation.

ICCV’17 PoseTrack Challenge Dataset [13] is a large-scale benchmark for multi-person articulated tracking, which contains 250 video clips for training and 50 sequences of videos for validation.

Evaluation Metrics: We follow [13] to use AP to evaluate multi-person pose estimation and the multi-object tracking accuracy (MOTA) [2] to measure tracking performance.

4.2 Comparisons with the State-of-the-art Methods

We compare our framework with the state-of-the-art methods on both pose estimation and tracking on the ICCV’17 PoseTrack validation set. As a common practice [13], additional images from MPII-Pose [1] are used for training. Table 1 demonstrate our single-frame pose estimation performance. We show that our model achieves the state-of-the-art mAP without single-person pose model refinement. Table 2 evaluates the multi-person articulated tracking performance. Our model outperforms the state-of-the-art methods by a large margin. Compared with the winner of ICCV’17 PoseTrack Challenge (ProTracker [8]), our method obtain an improvement of 16.6% in MOTA. Our model further improves over the current state-of-the-art pose tracker (FlowTrack [33]) by 6.4% in MOTA with comparable single frame pose estimation accuracy, indicating the effectiveness of our TemporalNet.

Method Head Shou Elb Wri Hip Knee Ankl Total
ProTracker [8]
PoseFlow [35]
BUTDS [16]
ArtTrack [13]
ML_Lab [37]
FlowTrack [33]
Table 1: Comparisons with the state-of-the-art methods on single-frame pose estimation on ICCV’17 PoseTrack Challenge Dataset.
Head Shou Elb Wri Hip Knee Ankl Total
ArtTrack [13]
ProTracker [8]
BUTD2 [16]
PoseFlow [35]
JointFlow [6] - - - - - - -
FlowTrack [33]
Table 2: Comparisons with the state-of-the-art methods on multi-person pose tracking on ICCV’17 PoseTrack Challenge Dataset.
Figure 5: Learning curves of keypoint embedding (KE) with (orange) or without (cyan) auxiliary training.
Figure 6: (a) Histogram of the memory cost ratio between PGG and GBMS [4] on the PoseTrack val set. Using the instance-agnostic pose mask, PGG reduces the memory consumption to about , i.e. times more efficient. (b) Runtime analysis. CNN processing time is measured on one GTX-1060 GPU, while PoseTrack [14] and our tracking algorithm is tested on a single core of a 2.4GHz CPU. denotes the number of people in a frame, which is 5.97 on average for PoseTrack val set.

4.3 Ablation Study

We extensively evaluate the effect of each component in our framework. Table 3 summarizes the single-frame pose estimation results, and Table 4 the pose tracking results.

For pose estimation we choose [20] as our baseline, which proposes KE for spatial grouping. We also compare with one alternative embedding approach [18] for design justification. In BBox [18], instance location information is encoded as the human bounding box (x, y, w, h) at each pixel. The predicted bounding boxes are then used to group keypoints into individuals. However, such representation is hard to learn due to large variations of its embedding space, resulting in worse pose estimation accuracy compared to KE and SIE. KE provides part-level appearance cues, while SIE encodes the human centroid constraints. When combined together, a large gain is obtained (% vs. %/%). As shown in Fig. 5, adding auxiliary tasks (+aux) dramatically speeds up the training of KE, by enforcing geometric constraints on the embedding space. It also facilitates representation learning and marginally enhances pose estimation. As shown in Table 3, employing PGG significantly improves the pose estimation accuracy ( for KE, for SIE, and for both combined). End-to-end model training and direct grouping supervision together account for the improvement. Additionally, using the instance-agnostic pose mask, the memory consumption is remarkably reduced to about , as shown in Fig. 6(a), demonstrating the efficiency of PGG. Combining both KE and SIE with PGG, further boosts the pose estimation performance to % mAP.

For pose tracking, we first build a baseline tracker based on KE and/or SIE. It is assumed that KE and SIE change smoothly in consecutive frames, and . Somewhat surprisingly, such a simple tracker already achieves competitive performance, thanks to the rich geometric information contained in KE and SIE. Employing TemporalNet for tracking significantly improves over the baseline tracker, because of the combination of the holistic appearance features of HE and temporal smoothness of TIE. Finally, incorporating spatial-temporal PGG to refine KE, SIE and TIE, further increase the tracking performance (% vs. % MOTA). We also compare with some widely used alternative tracking metrics, namely Object Keypoint Similarity (OKS), Intersection over Union (IoU) of persons and DeepMatching (DM[29] for design justification. We find that TemporalNet significantly outperform other trackers with task-agnostic tracking metrics. OKS only uses keypoints for handling occlusion, while IOU and DM only consider human in handling fast motion. In comparison, we kill two birds with one stone.

MS-COCO Results. Our SpatialNet substantially improves over our baseline [20] on single frame pose estimation on the MS-COCO dataset. For fair comparisons, we use the same train/val split as [20] for evaluation. Table 5 reports both single-scale (sscale) and multi-scale (mscale) results. Four different scales are used for multi-scale inference. Our sscale SpatialNet already achieves competitive performance against mscale baseline. By multi-scale inference, we further gain a significant improvement of 3% AP. All reported results are obtained without model ensembling or pose refinement [3, 20].

Head Shou Elb Wri Hip Knee Ankl Total
BBox [18]
KE [20]
Table 3: Ablation study on single-frame pose estimation (AP) on ICCV’17 PoseTrack validation set. aux means auxiliary training with geometric ordinal prediction. Ours (KE+SIE+aux+PGG) combines KE+SIE+aux with PGG for accurate pose estimation.
Head Shou Elb Wri Hip Knee Ankl Total
DM [29]
Table 4: Ablation study on multi-person articulated tracking on ICCV’17 PoseTrack validation set. Ours (HE+TIE+PGG) combines HE+TIE with PGG grouping for robust tracking.
Assoc. Embed. [20] (sscale)
Assoc. Embed. [20] (mscale)
Ours (sscale)
Ours (mscale)
Table 5: Multi-human pose estimation performance on the subset of MS-COCO dataset. mscale means multi-scale testing.

4.4 Runtime Analysis

Fig. 6(b) analyzes the runtime performance of pose estimation and tracking. For pose estimation, we compare with both top-down and bottom-up [20] approaches. The top-down pose estimator uses Faster RCNN [28] and a ResNet-152 [10] based single person pose estimator (SPPE) [33]. Since it estimates pose for each person independently, the runtime grows proportionally to the number of people.

Compared with [20], our SpatialNet significantly improves the pose estimation accuracy with the increase of limited computational complexity. For pose tracking, we compare with the graph-cut based tracker (PoseTrack [14]) and show the efficiency of TemporalNet.

5 Conclusion

We have presented a unified pose estimation and tracking framework, which is composed of SpatialNet and TemporalNet: SpatialNet tackles body part detection and part-level spatial grouping, while TemporalNet accomplishes the temporal grouping of human instances. We propose to extend KE and SIE in still images to HE appearance features and TIE temporally consistent geometric features in videos for robust online tracking. An effective and efficient Pose-Guided Grouping module is proposed to gain the benefits of full end-to-end learning of pose estimation and tracking.