Multi-person articulated tracking aims at predicting the body parts of each person and associating them across temporal periods. It has stimulated much research interest because of its importance in various applications such as video understanding and action recognition . In recent years, significant progress has been made in single frame human pose estimation [3, 9, 12, 24]. However, multi-person articulated tracking in complex videos remains challenging. Videos may contain a varying number of interacting people with frequent body part occlusion, fast body motion, large pose changes, and scale variation. Camera movement and zooming further pose challenges to this problem.
Pose tracking  can be viewed as a hierarchical detection and grouping problem. At the part level, body parts are detected and grouped spatially into human instances in each single frame. At the human level, the detected human instances are grouped temporally into trajectories.
Embedding can be viewed as a kind of permutation-invariant instance label to distinguish different instances. Previous works  perform keypoint grouping with Keypoint Embedding (KE). KE is a set of 1-D appearance embedding maps where joints of the same person have similar embedding values and those of different people have dissimilar ones. However, due to the over-flexibility of the embedding space, such representations are difficult to interpret and hard to learn . Arguably, a more natural way for the human to assign ids to targets in an image is by counting in a specific order (from left to right and/or from top to bottom). This inspires us to enforce geometric ordering constraints on the embedding space to facilitate training. Specifically, we add six auxiliary ordinal-relation prediction tasks for faster convergence and better interpretation of KE by encoding the knowledge of geometric ordering. Recently, Spatial Instance Embedding (SIE) [22, 23] is introduced for body part grouping. SIE is a 2-D embedding map, where each pixel is encoded with the predicted human center location (x, y). Fig. 1(a) illustrates the typical error patterns of pose estimation with KE or SIE. SIE may over-segment a single pose into several parts (column 2), while KE sometimes erroneously groups far-away body parts together (column 3). KE better preserves intra-class consistency but has difficulty in separating instances for lack of geometric constraints. Since KE captures appearance features while SIE extracts geometric information, they are naturally complementary to each other. Therefore we combine them to achieve better grouping results.
In this paper, we propose to extend the idea of using appearance and geometric information in a single frame to the temporal grouping of human instances for pose tracking. Previous pose tracking algorithms mostly rely on task-agnostic similarity metrics such as the Object Keypoint Similarity (OKS) [33, 35] and Intersection over Union (IoU) . However, such simple geometric cues are not robust to fast body motion, pose changes, camera movement and zoom. For robust pose tracking, we extend the idea of part-level spatial grouping to human-level temporal grouping. Specifically, we extend KE to Human Embedding (HE) for capturing holistic appearance features and extend SIE to Temporal Instance Embedding (TIE) for achieving temporal consistency. Intuitively, appearance features encoded by HE are more robust to fast motion, camera movement and zoom, while temporal information embodied in TIE is more robust to body pose changes and occlusion. We propose a novel TemporalNet to enjoy the best of both worlds. Fig. 1(b) demonstrates typical error patterns of pose tracking with HE or TIE. HE exploits scale-invariant appearance features which are robust to camera zooming and movement (column 1), and TIE preserves temporal consistency which is robust to human pose changes (column 4).
Bottom-up pose estimation methods follow the two-stage pipeline to generate body part proposals at the first stage and group them into individuals at the second stage. Since the grouping is mainly used as post-processing, i.e. graph based optimization [11, 12, 14, 16, 26]
or heuristic parsing[3, 23], no error signals from the grouping results are back-propagated. We instead propose a fully differentiable Pose-Guided Grouping (PGG) module, making detection-grouping fully end-to-end trainable. We are able to directly supervise the grouping results and the grouping loss is back-propagated to the low-level feature learning stages. This enables more effective feature learning by paying more attention to the mistakenly grouped body parts. Moreover, to obtain accurate regression results, post-processing clustering  or extra refinement  are required. Our PGG helps to produce accurate embeddings (see Fig. 1(c)). To improve the pose tracking accuracy, we further extend PGG to temporal grouping of TIE.
In this work, we aim at unifying pose estimation and tracking in a single framework. SpatialNet detects body parts in a single frame and performs part-level spatial grouping to obtain body poses. TemporalNet accomplishes human-level temporal grouping in consecutive frames to track targets across time. These two modules share the feature extraction layers to make more efficient inference.
The main contributions are summarized as follows:
For pose tracking, we extend the KE and SIE in still images to Human Embedding (HE) and Temporal Instance Embeddings (TIE) in videos. HE captures human-level global appearance features to avoid drifting in camera motion, while TIE provides smoother geometric features to obtain temporal consistency.
A fully differentiable Pose-Guided Grouping (PGG) module for both pose estimation and tracking, which enables the detection and grouping to be fully end-to-end trainable. The introduction of PGG and its grouping loss significantly improves the spatial/temporal embedding prediction accuracy.
2 Related Work
2.1 Multi-person Pose Estimation in Images
Recent multi-person pose estimation approaches can be classified into top-down and bottom-up methods.Top-down methods [7, 9, 33, 24] locate each person with a bounding box then apply single-person pose estimation. They mainly differ in the choices of human detectors  and single-person pose estimators [21, 32]. They highly rely on the object detector and may fail in cluttered scenes, occlusion, person-to-person interaction, or rare poses. More importantly, top-down methods perform single-person pose estimation individually for each human candidate. Thus, its inference time is proportional to the number of people, making it hard for achieving real-time performance. Additionally, the interface between human detection and pose estimation is non-differentiable, making it difficult to train in an end-to-end manner. Bottom-up approaches [3, 12, 26] detect body part candidates and group them into individuals. Graph-cut based methods [12, 26] formulate grouping as solving a graph partitioning based optimization problem, while [3, 23] utilize the heuristic greedy parsing algorithm to speed up decoding. However, these bottom-up approaches only use grouping as post-processing and no error signals from grouping results are back-propagated.
More recently, efforts have been devoted to end-to-end training or joint optimization. For top-down methods, Xie et al. 
proposes a reinforcement learning agent to bridge the object detector and the pose estimator. For bottom-up methods, Newellet al.  proposes the keypoint embedding (KE) to tag instances and train by pairwise losses. Our framework is a bottom-up method inspired by .  supervises the grouping in an indirect way. It trains keypoint embedding descriptors to ease the post-processing grouping. However, no direct supervision on grouping results is provided. Even if the pairwise loss of KE is low, it is still possible to produce wrong grouping results, but  does not model such grouping loss. We instead propose a differentiable Pose-Guided Grouping (PGG) module to learn to group body parts, making the whole pipeline fully end-to-end trainable, yielding significant improvement in pose estimation and tracking.
Our work is also related to [22, 23], where spatial instance embeddings (SIE) are introduced to aid body part grouping. However, due to lack of grouping supervision, their embeddings are always noisy [22, 23] and additional clustering  or refinement  is required. We instead employ PGG and additional grouping losses to learn to group SIE, making it end-to-end trainable while resulting in much more compact embedding representation.
2.2 Multi-person Pose Tracking
Recent works on multi-person pose tracking mostly follow the tracking-by-detection paradigm, in which human body parts are first detected in each frame, then data association is performed over time to form trajectories.
Offline pose tracking methods take future frames into consideration, allowing for more robust predictions but having high computational complexity. ProTracker  employs 3D Mask R-CNN to improve the estimation of body parts by leveraging temporal context encoded within a sliding temporal window. Graph partitioning based methods [11, 14, 16]
formulate multi-person pose tracking into an integer linear programming (ILP) problem and solve spatial-temporal grouping. Such methods achieve competitive performance in complex videos by enforcing long-range temporal consistency.
Our approach is an online pose tracking approach, which is faster and fits for practical applications. Online pose tracking methods [6, 25, 37, 33] mainly use bi-partite graph matching to assign targets in the current frame to existing trajectories. However, they only consider part-level geometric information and ignore global appearance features. When faced with fast pose motion and camera movement, such geometrical trackers are prone to tracking errors. We propose to extend SpatialNet to TemporalNet to capture both appearance features in HE and temporal coherence in TIE, resulting in much better tracking performance.
As demonstrated in Figure 2, we unify pose estimation and tracking in a single framework. Our framework consists of two major components: SpatialNet and TemporalNet.
SpatialNet tackles multi-person pose estimation by body part detection and part-level spatial grouping. It processes a single frame at a time. Given a frame, SpatialNet produces heatmaps, KE, SIE and geometric-ordinal maps simultaneously. Heatmaps model the body part locations. KE encodes the part-level appearance features, while SIE captures the geometric information about human centers. The auxiliary geometric-ordinal maps enforce ordering constraints on the embedding space to facilitate training of KE. PGG is utilized to make both KE and SIE to be more compact and discriminative. We finally generate the body pose proposals by greedy decoding following .
TemporalNet extends SpatialNet to deal with online human-level temporal grouping. It consists of HE branch and TIE branch, and shares the same low-level feature extraction layers with SpatialNet. Given body pose proposals, HE branch extracts region-specific embedding (HE) for each human instance. TIE branch exploits the temporally coherent geometric embedding (TIE). Given HE and TIE as pairwise potentials, a simple bipartite graph matching problem is solved to generate pose trajectories.
3.1 SpatialNet: Part-level Spatial Grouping
Throughout the paper, we use following notations. Let be the 2-D position in an image, and the location of body part for person . We use to represent the body pose of the th person. We use 2D Gaussian confidence heatmaps to model the body part locations. Let be the confidence heatmap for the th body part of th person, which is calculated by for each position in the image, where is set as 2 in the experiments. Following , we take the maximum of the confidence heatmaps to get the ground truth confidence heatmap, i.e. .
The detection loss is calculated by weighted distance respect to the ground truth confidence heatmaps.
3.1.1 Keypoint Embedding (KE) with auxiliary tasks
We follow  to produce the keypoint embedding for each type of body part. However, such kind of embedding representation has several drawbacks. First, the embedding is difficult to interpret [20, 23]. Second, it is hard to learn due to its over-flexibility with no direct supervision available. To overcome these drawbacks, we introduce several auxiliary tasks to facilitate training and improve interpretation. The idea of auxiliary learning 
has shown effective both in supervised learning and reinforcement learning . Here, we explore auxiliary training in the context of keypoint embedding representation learning.
By auxiliary training, we explicitly enforce the embedding maps to learn geometric ordinal relations. Specifically, we define six auxiliary tasks: to predict the ’left-to-right’ l2r, ’right-to-left’ r2l, ’top-to-bottom’ t2b, ’bottom-to-top’ b2t, ’far-to-near’ f2n and ’near-to-far’ n2f orders of human instances in a single image. For example, in the ‘left-to-right’ map, the person from left to right in the images should have low to high order (value). Fig. 4 (c)(d)(e) visualize some example predictions of the auxiliary tasks. We see human instances are clearly arranged in the corresponding geometric ordering. We also observe that KE (Fig. 4 (b)) and the geometric ordinal-relation maps (c)(d)(e) share some similar patterns, which suggests that KE acquires some knowledge of geometric ordering.
Following , is trained with pairwise grouping loss . The pull loss (Eq. 2) is computed as the squared distance between the human reference embedding and the predicted embedding of each joint. The push loss (Eq. 3) is calculated between different reference embeddings, which exponentially drops to zero as the increase of embedding difference. Formally, we define the reference embedding for the th person as .
For auxiliary training, we replace the push loss with the ordinal loss but keep the pull loss (Eq. 2) the same.
where indicates the ground-truth order for person and . In l2r, r2l, t2b, and b2t, we sort human instances by their centroid locations. For example, in l2r , if th person is on the left of th person, then , otherwise . In f2n and n2f, we sort them according to the head size .
3.1.2 Spatial Instance Embedding (SIE)
For lack of geometric information, KE has difficulty in separating instances and tends to erroneously group with distant body parts. To remedy this, we combine KE with SIE to embody instance-wise geometric cues. Concretely, we predict the dense offset spatial vector fields (SVF), where each 2-D vector encodes the relative displacement from the human center to its absolute location. Fig. 4(f)(g) visualize the spatial vector fields of x-axis and y-axis, which distinguish the left/right sides and upper/lower sides relative to its body center. As shown in Fig. 3, subtracted by its coordinate, SVF can be decoded to SIE in which each pixel is encoded with the human center location.
We denote the spatial vector fields (SVF) by , and SIE by . We use distance to train SVF, where the ground truth spatial vector is the displacement from the person center to each body part.
where , is the center of person .
3.2 Pose-Guided Grouping (PGG) Module
In prior bottom-up methods [3, 22, 23], detection and grouping are separated. We reformulate the grouping process into a differentiable Pose-Guided Grouping (PGG) module for end-to-end training. By directly supervising the grouping results, more accurate estimation is obtained.
Our PGG is based on Gaussian Blurring Mean Shift (GBMS)  algorithm and inspired by , which is originally proposed for segmentation. However, directly applying GBMS in the challenging articulate tracking task is not desirable. First, the complexity of GBMS is , where is the number of feature vectors to group. Direct use of GBMS on the whole image will lead to huge memory consumption. Second, the predicted embeddings are always noisy especially in background regions, where no supervision is available during training. As illustrated in the top row of Fig. 4, embedding noises exist in the background area (the ceiling or the floor). The noise in these irrelevant regions will affect the mean-shift grouping accuracy. We propose a novel Pose-Guided Grouping module to address the above drawbacks. Considering the sparseness of the matrix (body parts only occupy a small area in images), we propose to use the human pose mask to guide grouping, which rules out irrelevant areas and significantly reduces the memory cost. As shown in Fig. 3, we apply max along the channel and generate the instance-agnostic pose mask , by thresholding at . is 1 if , otherwise 0.
Both spatial (KE and SIE) and temporal (TIE) embeddings can be grouped by PGG. Take spatial grouping for example, we refine KE and SIE with PGG module to get more compact and discriminative embedding descriptors. The Pose-Guided Grouping algorithm is summarized in Alg. 1. KE and SIE are first concatenated to dimensional feature maps. Then embeddings are selected according to the binary pose mask and reshaped to as initialization, where is the number of non-zero elements in , (). Recurrent mean-shift grouping is then applied to for iterations. In each iteration, the Gaussian affinity is first calculated with the isotropic multivariate normal kernel , where the kernel bandwidth is empirically chosen as 5 in the experiments. can be viewed as the weighted adjacency matrix. The diagonal matrix of affinity row sum is used for normalization, where means a vector with all entries one. We then update with the normalized Gaussian kernel weighted mean, . After several iterations of grouping refinement, the embeddings become distinct for heterogeneous pairs and similar for homogeneous ones. When training, we apply the pairwise pull/push losses (Eq. 2 and 3) over all iterations of grouping results .
3.3 TemporalNet: Human Temporal Grouping
TemporalNet extends SpatialNet to perform human-level temporal grouping in an online manner. Formally, we use the superscript to distinguish different frames. denotes the input frame at time-step , which contains persons. SpatialNet is applied to to estimate a set of poses . TemporalNet aims at temporally grouping human pose proposals in the current frame with already tracked poses in the previous frame. TemporalNet exploits both human-level appearance features (HE) and temporally coherent geometric information (TIE) to calculate the total pose similarity. Finally, we generate the pose trajectories by solving the bipartite graph matching problems, using pose similarity as pairwise potentials.
3.3.1 Human Embedding (HE)
To obtain human-level appearance embedding (HE), we introduce a region-specific HE branch based on . Given predicted pose proposals, HE brach first calculates human bounding boxes to cover the corresponding human keypoints. For each bounding box, ROI-Align pooling  is applied to the shared low-level feature maps to extract region-adapted ROI features. The ROI features are then mapped to the human embedding . HE is trained with triplet loss , pulling HE of the same instance closer, and pushing apart embeddings of different instances.
where the margin term is set to 0.3 in the experiments.
3.3.2 Temporal Instance Embedding (TIE)
To exploit the temporal information for pose tracking, we naturally extend the Spatial Instance Embedding (SIE) to the Temporal Instance Embedding (TIE). TIE branch concatenates low-level features, body part detection heatmaps and SIE from two neighboring frames. The concatenated feature maps are then mapped to dense TIE.
TIE is a task-specific representation which measures the displacement between the keypoint of one frame and the human center of another frame. This design utilizes the mutual information between keypoint and human in adjacent frames to handle occlusion and pose motion simultaneously. Specifically, we introduce bi-directional temporal vector fields (TVF), which are denoted as and respectively. Forward TVF encodes the relative displacement from the human center in -th frame to body parts in the -th frame, it temporally propagates the human centroid embeddings from -th to -th frame. In contrast, Backward TVF represents the offset from current -th frame body center to body parts in the previous frame.
where , is the center of person at time step . Simply subtracted from absolute locations, we get the corresponding Forward TIE and Backward TIE . Thereby, TIE encodes the temporally propagated human centroid. Likewise, we also extend the idea of spatial grouping to temporal grouping. TemporalNet outputs Forward TIE and Backward TIE , which are refined by PGG independently. Take Forward TIE for example, we generate pose mask using body heatmaps from the -th frame. We rule out irrelevant regions of and reshape it to . Subsequently, recurrent mean-shift grouping is applied. Again, additional grouping losses (Eq. 2,3) are used to train TIE.
3.3.3 Pose Tracking
The problem of temporal pose association is formulated as a bipartite graph based energy maximization problem. The estimated poses are then associated with the previous poses by bipartite graph matching.
is a binary variable which implies if the pose hypothesisand are associated. The pairwise potentials represent the similarity between pose hypothesis. , with for human-level appearance similarity and for temporal smoothness. and
are hyperparameters to balance them, withand .
The human-level appearance similarity is calculated as the embedding distance: And the temporal smoothness term is computed as the similarity between the encoded human center locations in SIE and the temporally propagated TIE , .
The bipartite graph matching problem (Eq. 8) is solved using Munkres algorithm to generate pose trajectories.
3.4 Implementation Details
Following , SpatialNet uses the 4-stage stacked-hourglass as its backbone. We first train SpatialNet without PGG. The total losses consist of and , with their weights 1:1e-3:1e-4:1e-4. We set the initial learning rate to 2e-4 and reduce it to 1e-5 after 250K iterations. Then we fine-tune SpatialNet with PGG included. In practice, we have found the iteration number is sufficient, and more iterations do not lead to much gain.
4.1 Datasets and Evaluation
MS-COCO Dataset  contains over 66k images with 150k people and 1.7 million labeled keypoints, for pose estimation in images. For the MS-COCO results, we follow the same train/val split as , where a held-out set of 500 training images are used for evaluation.
ICCV’17 PoseTrack Challenge Dataset  is a large-scale benchmark for multi-person articulated tracking, which contains 250 video clips for training and 50 sequences of videos for validation.
4.2 Comparisons with the State-of-the-art Methods
We compare our framework with the state-of-the-art methods on both pose estimation and tracking on the ICCV’17 PoseTrack validation set. As a common practice , additional images from MPII-Pose  are used for training. Table 1 demonstrate our single-frame pose estimation performance. We show that our model achieves the state-of-the-art mAP without single-person pose model refinement. Table 2 evaluates the multi-person articulated tracking performance. Our model outperforms the state-of-the-art methods by a large margin. Compared with the winner of ICCV’17 PoseTrack Challenge (ProTracker ), our method obtain an improvement of 16.6% in MOTA. Our model further improves over the current state-of-the-art pose tracker (FlowTrack ) by 6.4% in MOTA with comparable single frame pose estimation accuracy, indicating the effectiveness of our TemporalNet.
4.3 Ablation Study
For pose estimation we choose  as our baseline, which proposes KE for spatial grouping. We also compare with one alternative embedding approach  for design justification. In BBox , instance location information is encoded as the human bounding box (x, y, w, h) at each pixel. The predicted bounding boxes are then used to group keypoints into individuals. However, such representation is hard to learn due to large variations of its embedding space, resulting in worse pose estimation accuracy compared to KE and SIE. KE provides part-level appearance cues, while SIE encodes the human centroid constraints. When combined together, a large gain is obtained (% vs. %/%). As shown in Fig. 5, adding auxiliary tasks (+aux) dramatically speeds up the training of KE, by enforcing geometric constraints on the embedding space. It also facilitates representation learning and marginally enhances pose estimation. As shown in Table 3, employing PGG significantly improves the pose estimation accuracy ( for KE, for SIE, and for both combined). End-to-end model training and direct grouping supervision together account for the improvement. Additionally, using the instance-agnostic pose mask, the memory consumption is remarkably reduced to about , as shown in Fig. 6(a), demonstrating the efficiency of PGG. Combining both KE and SIE with PGG, further boosts the pose estimation performance to % mAP.
For pose tracking, we first build a baseline tracker based on KE and/or SIE. It is assumed that KE and SIE change smoothly in consecutive frames, and . Somewhat surprisingly, such a simple tracker already achieves competitive performance, thanks to the rich geometric information contained in KE and SIE. Employing TemporalNet for tracking significantly improves over the baseline tracker, because of the combination of the holistic appearance features of HE and temporal smoothness of TIE. Finally, incorporating spatial-temporal PGG to refine KE, SIE and TIE, further increase the tracking performance (% vs. % MOTA). We also compare with some widely used alternative tracking metrics, namely Object Keypoint Similarity (OKS), Intersection over Union (IoU) of persons and DeepMatching (DM)  for design justification. We find that TemporalNet significantly outperform other trackers with task-agnostic tracking metrics. OKS only uses keypoints for handling occlusion, while IOU and DM only consider human in handling fast motion. In comparison, we kill two birds with one stone.
MS-COCO Results. Our SpatialNet substantially improves over our baseline  on single frame pose estimation on the MS-COCO dataset. For fair comparisons, we use the same train/val split as  for evaluation. Table 5 reports both single-scale (sscale) and multi-scale (mscale) results. Four different scales are used for multi-scale inference. Our sscale SpatialNet already achieves competitive performance against mscale baseline. By multi-scale inference, we further gain a significant improvement of 3% AP. All reported results are obtained without model ensembling or pose refinement [3, 20].
4.4 Runtime Analysis
Fig. 6(b) analyzes the runtime performance of pose estimation and tracking. For pose estimation, we compare with both top-down and bottom-up  approaches. The top-down pose estimator uses Faster RCNN  and a ResNet-152  based single person pose estimator (SPPE) . Since it estimates pose for each person independently, the runtime grows proportionally to the number of people.
We have presented a unified pose estimation and tracking framework, which is composed of SpatialNet and TemporalNet: SpatialNet tackles body part detection and part-level spatial grouping, while TemporalNet accomplishes the temporal grouping of human instances. We propose to extend KE and SIE in still images to HE appearance features and TIE temporally consistent geometric features in videos for robust online tracking. An effective and efficient Pose-Guided Grouping module is proposed to gain the benefits of full end-to-end learning of pose estimation and tracking.
M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele.
2d human pose estimation: New benchmark and state of the art
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
-  K. Bernardin and R. Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008.
-  Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  M. A. Carreiraperpinan. Generalised blurring mean-shift algorithms for nonparametric clustering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
-  G. Cheron, I. Laptev, and C. Schmid. P-cnn: Pose-based cnn features for action recognition. In The IEEE International Conference on Computer Vision (ICCV), 2015.
-  A. Doering, U. Iqbal, and J. Gall. Joint flow: Temporal flow fields for multi person tracking. arXiv preprint arXiv:1805.04596, 2018.
-  H. Fang, S. Xie, and C. Lu. Rmpe: Regional multi-person pose estimation. arXiv preprint arXiv:1612.00137, 2016.
-  R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran. Detect-and-track: Efficient pose estimation in videos. arXiv preprint arXiv:1712.09184, 2017.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. arXiv preprint arXiv:1703.06870, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres, B. Schiele, and S. I. Campus. Arttrack: Articulated multi-person tracking in the wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision (ECCV), 2016.
-  U. Iqbal, A. Milan, M. Andriluka, E. Ensafutdinov, L. Pishchulin, J. Gall, and S. B. PoseTrack: A benchmark for human pose estimation and tracking. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  U. Iqbal, A. Milan, and J. Gall. Pose-track: Joint multi-person pose estimation and tracking. arXiv preprint arXiv:1611.07727, 2016.
-  M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
-  S. Jin, X. Ma, Z. Han, Y. Wu, W. Yang, W. Liu, C. Qian, and W. Ouyang. Towards multi-person pose tracking: Bottom-up and top-down methods. In ICCV PoseTrack Workshop, 2017.
-  S. Kong and C. Fowlkes. Recurrent pixel embedding for instance grouping. arXiv preprint arXiv:1712.08273, 2017.
-  X. Liang, Y. Wei, X. Shen, J. Yang, L. Lin, and S. Yan. Proposal-free network for instance-level object segmentation. arXiv preprint arXiv:1509.02636, 2015.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), 2014.
-  A. Newell, Z. Huang, and J. Deng. Associative embedding: End-to-end learning for joint detection and grouping. In Advances in Neural Information Processing Systems (NIPS), 2017.
-  A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision (ECCV), 2016.
-  X. Nie, J. Feng, J. Xing, and S. Yan. Generative partition networks for multi-person pose estimation. arXiv preprint arXiv:1705.07422, 2017.
-  G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, and K. Murphy. Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. arXiv preprint arXiv:1803.08225, 2018.
-  G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy. Towards accurate multi-person pose estimation in the wild. arXiv preprint arXiv:1701.01779, 2017.
-  C. Payer, T. Neff, H. Bischof, M. Urschler, and D. Štern. Simultaneous multi-person detection and single-person pose estimation with a single heatmap regression network. In ICCV PoseTrack Workshop, 2017.
-  L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multi-view cnns for object classification on 3d data. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), 2015.
-  J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Deepmatching: Hierarchical deformable dense matching. International Journal of Computer Vision (IJCV), 2015.
F. Schroff, D. Kalenichenko, and J. Philbin.
Facenet: A unified embedding for face recognition and clustering.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  S. C. Suddarth and Y. Kergosien. Rule-injection hints as a means of improving network performance and learning time. In Neural Networks. 1990.
-  S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  B. Xiao, H. Wu, and Y. Wei. Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision (ECCV), 2018.
-  S. Xie, Z. Chen, C. Xu, and C. Lu. Environment upgrade reinforcement learning for non-differentiable multi-stage pipelines. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  Y. Xiu, J. Li, H. Wang, Y. Fang, and C. Lu. Pose flow: Efficient online pose tracking. arXiv preprint arXiv:1802.00977, 2018.
-  Q. Yu, X. Chang, Y.-Z. Song, T. Xiang, and T. M. Hospedales. The devil is in the middle: Exploiting mid-level representations for cross-domain instance matching. arXiv preprint arXiv:1711.08106, 2017.
-  X. Zhu, Y. Jiang, and Z. Luo. Multi-person pose estimation for posetrack with enhanced part affinity fields. In ICCV PoseTrack Workshop, 2017.