1 Introduction
Multiperson articulated tracking aims at predicting the body parts of each person and associating them across temporal periods. It has stimulated much research interest because of its importance in various applications such as video understanding and action recognition [5]. In recent years, significant progress has been made in single frame human pose estimation [3, 9, 12, 24]. However, multiperson articulated tracking in complex videos remains challenging. Videos may contain a varying number of interacting people with frequent body part occlusion, fast body motion, large pose changes, and scale variation. Camera movement and zooming further pose challenges to this problem.
Pose tracking [14] can be viewed as a hierarchical detection and grouping problem. At the part level, body parts are detected and grouped spatially into human instances in each single frame. At the human level, the detected human instances are grouped temporally into trajectories.
Embedding can be viewed as a kind of permutationinvariant instance label to distinguish different instances. Previous works [20] perform keypoint grouping with Keypoint Embedding (KE). KE is a set of 1D appearance embedding maps where joints of the same person have similar embedding values and those of different people have dissimilar ones. However, due to the overflexibility of the embedding space, such representations are difficult to interpret and hard to learn [23]. Arguably, a more natural way for the human to assign ids to targets in an image is by counting in a specific order (from left to right and/or from top to bottom). This inspires us to enforce geometric ordering constraints on the embedding space to facilitate training. Specifically, we add six auxiliary ordinalrelation prediction tasks for faster convergence and better interpretation of KE by encoding the knowledge of geometric ordering. Recently, Spatial Instance Embedding (SIE) [22, 23] is introduced for body part grouping. SIE is a 2D embedding map, where each pixel is encoded with the predicted human center location (x, y). Fig. 1(a) illustrates the typical error patterns of pose estimation with KE or SIE. SIE may oversegment a single pose into several parts (column 2), while KE sometimes erroneously groups faraway body parts together (column 3). KE better preserves intraclass consistency but has difficulty in separating instances for lack of geometric constraints. Since KE captures appearance features while SIE extracts geometric information, they are naturally complementary to each other. Therefore we combine them to achieve better grouping results.
In this paper, we propose to extend the idea of using appearance and geometric information in a single frame to the temporal grouping of human instances for pose tracking. Previous pose tracking algorithms mostly rely on taskagnostic similarity metrics such as the Object Keypoint Similarity (OKS) [33, 35] and Intersection over Union (IoU) [8]. However, such simple geometric cues are not robust to fast body motion, pose changes, camera movement and zoom. For robust pose tracking, we extend the idea of partlevel spatial grouping to humanlevel temporal grouping. Specifically, we extend KE to Human Embedding (HE) for capturing holistic appearance features and extend SIE to Temporal Instance Embedding (TIE) for achieving temporal consistency. Intuitively, appearance features encoded by HE are more robust to fast motion, camera movement and zoom, while temporal information embodied in TIE is more robust to body pose changes and occlusion. We propose a novel TemporalNet to enjoy the best of both worlds. Fig. 1(b) demonstrates typical error patterns of pose tracking with HE or TIE. HE exploits scaleinvariant appearance features which are robust to camera zooming and movement (column 1), and TIE preserves temporal consistency which is robust to human pose changes (column 4).
Bottomup pose estimation methods follow the twostage pipeline to generate body part proposals at the first stage and group them into individuals at the second stage. Since the grouping is mainly used as postprocessing, i.e. graph based optimization [11, 12, 14, 16, 26]
or heuristic parsing
[3, 23], no error signals from the grouping results are backpropagated. We instead propose a fully differentiable PoseGuided Grouping (PGG) module, making detectiongrouping fully endtoend trainable. We are able to directly supervise the grouping results and the grouping loss is backpropagated to the lowlevel feature learning stages. This enables more effective feature learning by paying more attention to the mistakenly grouped body parts. Moreover, to obtain accurate regression results, postprocessing clustering [22] or extra refinement [23] are required. Our PGG helps to produce accurate embeddings (see Fig. 1(c)). To improve the pose tracking accuracy, we further extend PGG to temporal grouping of TIE.In this work, we aim at unifying pose estimation and tracking in a single framework. SpatialNet detects body parts in a single frame and performs partlevel spatial grouping to obtain body poses. TemporalNet accomplishes humanlevel temporal grouping in consecutive frames to track targets across time. These two modules share the feature extraction layers to make more efficient inference.
The main contributions are summarized as follows:

For pose tracking, we extend the KE and SIE in still images to Human Embedding (HE) and Temporal Instance Embeddings (TIE) in videos. HE captures humanlevel global appearance features to avoid drifting in camera motion, while TIE provides smoother geometric features to obtain temporal consistency.

A fully differentiable PoseGuided Grouping (PGG) module for both pose estimation and tracking, which enables the detection and grouping to be fully endtoend trainable. The introduction of PGG and its grouping loss significantly improves the spatial/temporal embedding prediction accuracy.
2 Related Work
2.1 Multiperson Pose Estimation in Images
Recent multiperson pose estimation approaches can be classified into topdown and bottomup methods.
Topdown methods [7, 9, 33, 24] locate each person with a bounding box then apply singleperson pose estimation. They mainly differ in the choices of human detectors [28] and singleperson pose estimators [21, 32]. They highly rely on the object detector and may fail in cluttered scenes, occlusion, persontoperson interaction, or rare poses. More importantly, topdown methods perform singleperson pose estimation individually for each human candidate. Thus, its inference time is proportional to the number of people, making it hard for achieving realtime performance. Additionally, the interface between human detection and pose estimation is nondifferentiable, making it difficult to train in an endtoend manner. Bottomup approaches [3, 12, 26] detect body part candidates and group them into individuals. Graphcut based methods [12, 26] formulate grouping as solving a graph partitioning based optimization problem, while [3, 23] utilize the heuristic greedy parsing algorithm to speed up decoding. However, these bottomup approaches only use grouping as postprocessing and no error signals from grouping results are backpropagated.More recently, efforts have been devoted to endtoend training or joint optimization. For topdown methods, Xie et al. [34]
proposes a reinforcement learning agent to bridge the object detector and the pose estimator. For bottomup methods, Newell
et al. [20] proposes the keypoint embedding (KE) to tag instances and train by pairwise losses. Our framework is a bottomup method inspired by [20]. [20] supervises the grouping in an indirect way. It trains keypoint embedding descriptors to ease the postprocessing grouping. However, no direct supervision on grouping results is provided. Even if the pairwise loss of KE is low, it is still possible to produce wrong grouping results, but [20] does not model such grouping loss. We instead propose a differentiable PoseGuided Grouping (PGG) module to learn to group body parts, making the whole pipeline fully endtoend trainable, yielding significant improvement in pose estimation and tracking.Our work is also related to [22, 23], where spatial instance embeddings (SIE) are introduced to aid body part grouping. However, due to lack of grouping supervision, their embeddings are always noisy [22, 23] and additional clustering [22] or refinement [23] is required. We instead employ PGG and additional grouping losses to learn to group SIE, making it endtoend trainable while resulting in much more compact embedding representation.
2.2 Multiperson Pose Tracking
Recent works on multiperson pose tracking mostly follow the trackingbydetection paradigm, in which human body parts are first detected in each frame, then data association is performed over time to form trajectories.
Offline pose tracking methods take future frames into consideration, allowing for more robust predictions but having high computational complexity. ProTracker [8] employs 3D Mask RCNN to improve the estimation of body parts by leveraging temporal context encoded within a sliding temporal window. Graph partitioning based methods [11, 14, 16]
formulate multiperson pose tracking into an integer linear programming (ILP) problem and solve spatialtemporal grouping. Such methods achieve competitive performance in complex videos by enforcing longrange temporal consistency.
Our approach is an online pose tracking approach, which is faster and fits for practical applications. Online pose tracking methods [6, 25, 37, 33] mainly use bipartite graph matching to assign targets in the current frame to existing trajectories. However, they only consider partlevel geometric information and ignore global appearance features. When faced with fast pose motion and camera movement, such geometrical trackers are prone to tracking errors. We propose to extend SpatialNet to TemporalNet to capture both appearance features in HE and temporal coherence in TIE, resulting in much better tracking performance.
3 Method
As demonstrated in Figure 2, we unify pose estimation and tracking in a single framework. Our framework consists of two major components: SpatialNet and TemporalNet.
SpatialNet tackles multiperson pose estimation by body part detection and partlevel spatial grouping. It processes a single frame at a time. Given a frame, SpatialNet produces heatmaps, KE, SIE and geometricordinal maps simultaneously. Heatmaps model the body part locations. KE encodes the partlevel appearance features, while SIE captures the geometric information about human centers. The auxiliary geometricordinal maps enforce ordering constraints on the embedding space to facilitate training of KE. PGG is utilized to make both KE and SIE to be more compact and discriminative. We finally generate the body pose proposals by greedy decoding following [20].
TemporalNet extends SpatialNet to deal with online humanlevel temporal grouping. It consists of HE branch and TIE branch, and shares the same lowlevel feature extraction layers with SpatialNet. Given body pose proposals, HE branch extracts regionspecific embedding (HE) for each human instance. TIE branch exploits the temporally coherent geometric embedding (TIE). Given HE and TIE as pairwise potentials, a simple bipartite graph matching problem is solved to generate pose trajectories.
3.1 SpatialNet: Partlevel Spatial Grouping
Throughout the paper, we use following notations. Let be the 2D position in an image, and the location of body part for person . We use to represent the body pose of the th person. We use 2D Gaussian confidence heatmaps to model the body part locations. Let be the confidence heatmap for the th body part of th person, which is calculated by for each position in the image, where is set as 2 in the experiments. Following [3], we take the maximum of the confidence heatmaps to get the ground truth confidence heatmap, i.e. .
The detection loss is calculated by weighted distance respect to the ground truth confidence heatmaps.
(1) 
3.1.1 Keypoint Embedding (KE) with auxiliary tasks
We follow [20] to produce the keypoint embedding for each type of body part. However, such kind of embedding representation has several drawbacks. First, the embedding is difficult to interpret [20, 23]. Second, it is hard to learn due to its overflexibility with no direct supervision available. To overcome these drawbacks, we introduce several auxiliary tasks to facilitate training and improve interpretation. The idea of auxiliary learning [31]
has shown effective both in supervised learning
[27] and reinforcement learning [15]. Here, we explore auxiliary training in the context of keypoint embedding representation learning.By auxiliary training, we explicitly enforce the embedding maps to learn geometric ordinal relations. Specifically, we define six auxiliary tasks: to predict the ’lefttoright’ l2r, ’righttoleft’ r2l, ’toptobottom’ t2b, ’bottomtotop’ b2t, ’fartonear’ f2n and ’neartofar’ n2f orders of human instances in a single image. For example, in the ‘lefttoright’ map, the person from left to right in the images should have low to high order (value). Fig. 4 (c)(d)(e) visualize some example predictions of the auxiliary tasks. We see human instances are clearly arranged in the corresponding geometric ordering. We also observe that KE (Fig. 4 (b)) and the geometric ordinalrelation maps (c)(d)(e) share some similar patterns, which suggests that KE acquires some knowledge of geometric ordering.
Following [20], is trained with pairwise grouping loss . The pull loss (Eq. 2) is computed as the squared distance between the human reference embedding and the predicted embedding of each joint. The push loss (Eq. 3) is calculated between different reference embeddings, which exponentially drops to zero as the increase of embedding difference. Formally, we define the reference embedding for the th person as .
(2) 
(3) 
For auxiliary training, we replace the push loss with the ordinal loss but keep the pull loss (Eq. 2) the same.
(4) 
where indicates the groundtruth order for person and . In l2r, r2l, t2b, and b2t, we sort human instances by their centroid locations. For example, in l2r , if th person is on the left of th person, then , otherwise . In f2n and n2f, we sort them according to the head size .
3.1.2 Spatial Instance Embedding (SIE)
For lack of geometric information, KE has difficulty in separating instances and tends to erroneously group with distant body parts. To remedy this, we combine KE with SIE to embody instancewise geometric cues. Concretely, we predict the dense offset spatial vector fields (SVF), where each 2D vector encodes the relative displacement from the human center to its absolute location
. Fig. 4(f)(g) visualize the spatial vector fields of xaxis and yaxis, which distinguish the left/right sides and upper/lower sides relative to its body center. As shown in Fig. 3, subtracted by its coordinate, SVF can be decoded to SIE in which each pixel is encoded with the human center location.We denote the spatial vector fields (SVF) by , and SIE by . We use distance to train SVF, where the ground truth spatial vector is the displacement from the person center to each body part.
(5) 
where , is the center of person .
3.2 PoseGuided Grouping (PGG) Module
In prior bottomup methods [3, 22, 23], detection and grouping are separated. We reformulate the grouping process into a differentiable PoseGuided Grouping (PGG) module for endtoend training. By directly supervising the grouping results, more accurate estimation is obtained.
Our PGG is based on Gaussian Blurring Mean Shift (GBMS) [4] algorithm and inspired by [17], which is originally proposed for segmentation. However, directly applying GBMS in the challenging articulate tracking task is not desirable. First, the complexity of GBMS is , where is the number of feature vectors to group. Direct use of GBMS on the whole image will lead to huge memory consumption. Second, the predicted embeddings are always noisy especially in background regions, where no supervision is available during training. As illustrated in the top row of Fig. 4, embedding noises exist in the background area (the ceiling or the floor). The noise in these irrelevant regions will affect the meanshift grouping accuracy. We propose a novel PoseGuided Grouping module to address the above drawbacks. Considering the sparseness of the matrix (body parts only occupy a small area in images), we propose to use the human pose mask to guide grouping, which rules out irrelevant areas and significantly reduces the memory cost. As shown in Fig. 3, we apply max along the channel and generate the instanceagnostic pose mask , by thresholding at . is 1 if , otherwise 0.
Both spatial (KE and SIE) and temporal (TIE) embeddings can be grouped by PGG. Take spatial grouping for example, we refine KE and SIE with PGG module to get more compact and discriminative embedding descriptors. The PoseGuided Grouping algorithm is summarized in Alg. 1. KE and SIE are first concatenated to dimensional feature maps. Then embeddings are selected according to the binary pose mask and reshaped to as initialization, where is the number of nonzero elements in , (). Recurrent meanshift grouping is then applied to for iterations. In each iteration, the Gaussian affinity is first calculated with the isotropic multivariate normal kernel , where the kernel bandwidth is empirically chosen as 5 in the experiments. can be viewed as the weighted adjacency matrix. The diagonal matrix of affinity row sum is used for normalization, where means a vector with all entries one. We then update with the normalized Gaussian kernel weighted mean, . After several iterations of grouping refinement, the embeddings become distinct for heterogeneous pairs and similar for homogeneous ones. When training, we apply the pairwise pull/push losses (Eq. 2 and 3) over all iterations of grouping results .
3.3 TemporalNet: Human Temporal Grouping
TemporalNet extends SpatialNet to perform humanlevel temporal grouping in an online manner. Formally, we use the superscript to distinguish different frames. denotes the input frame at timestep , which contains persons. SpatialNet is applied to to estimate a set of poses . TemporalNet aims at temporally grouping human pose proposals in the current frame with already tracked poses in the previous frame. TemporalNet exploits both humanlevel appearance features (HE) and temporally coherent geometric information (TIE) to calculate the total pose similarity. Finally, we generate the pose trajectories by solving the bipartite graph matching problems, using pose similarity as pairwise potentials.
3.3.1 Human Embedding (HE)
To obtain humanlevel appearance embedding (HE), we introduce a regionspecific HE branch based on [36]. Given predicted pose proposals, HE brach first calculates human bounding boxes to cover the corresponding human keypoints. For each bounding box, ROIAlign pooling [9] is applied to the shared lowlevel feature maps to extract regionadapted ROI features. The ROI features are then mapped to the human embedding . HE is trained with triplet loss [30], pulling HE of the same instance closer, and pushing apart embeddings of different instances.
(6) 
where the margin term is set to 0.3 in the experiments.
3.3.2 Temporal Instance Embedding (TIE)
To exploit the temporal information for pose tracking, we naturally extend the Spatial Instance Embedding (SIE) to the Temporal Instance Embedding (TIE). TIE branch concatenates lowlevel features, body part detection heatmaps and SIE from two neighboring frames. The concatenated feature maps are then mapped to dense TIE.
TIE is a taskspecific representation which measures the displacement between the keypoint of one frame and the human center of another frame. This design utilizes the mutual information between keypoint and human in adjacent frames to handle occlusion and pose motion simultaneously. Specifically, we introduce bidirectional temporal vector fields (TVF), which are denoted as and respectively. Forward TVF encodes the relative displacement from the human center in th frame to body parts in the th frame, it temporally propagates the human centroid embeddings from th to th frame. In contrast, Backward TVF represents the offset from current th frame body center to body parts in the previous frame.
(7) 
where , is the center of person at time step . Simply subtracted from absolute locations, we get the corresponding Forward TIE and Backward TIE . Thereby, TIE encodes the temporally propagated human centroid. Likewise, we also extend the idea of spatial grouping to temporal grouping. TemporalNet outputs Forward TIE and Backward TIE , which are refined by PGG independently. Take Forward TIE for example, we generate pose mask using body heatmaps from the th frame. We rule out irrelevant regions of and reshape it to . Subsequently, recurrent meanshift grouping is applied. Again, additional grouping losses (Eq. 2,3) are used to train TIE.
3.3.3 Pose Tracking
The problem of temporal pose association is formulated as a bipartite graph based energy maximization problem. The estimated poses are then associated with the previous poses by bipartite graph matching.
(8)  
where
is a binary variable which implies if the pose hypothesis
and are associated. The pairwise potentials represent the similarity between pose hypothesis. , with for humanlevel appearance similarity and for temporal smoothness. andare hyperparameters to balance them, with
and .The humanlevel appearance similarity is calculated as the embedding distance: And the temporal smoothness term is computed as the similarity between the encoded human center locations in SIE and the temporally propagated TIE , .
(9) 
The bipartite graph matching problem (Eq. 8) is solved using Munkres algorithm to generate pose trajectories.
3.4 Implementation Details
Following [20], SpatialNet uses the 4stage stackedhourglass as its backbone. We first train SpatialNet without PGG. The total losses consist of and , with their weights 1:1e3:1e4:1e4. We set the initial learning rate to 2e4 and reduce it to 1e5 after 250K iterations. Then we finetune SpatialNet with PGG included. In practice, we have found the iteration number is sufficient, and more iterations do not lead to much gain.
4 Experiments
4.1 Datasets and Evaluation
MSCOCO Dataset [19] contains over 66k images with 150k people and 1.7 million labeled keypoints, for pose estimation in images. For the MSCOCO results, we follow the same train/val split as [20], where a heldout set of 500 training images are used for evaluation.
ICCV’17 PoseTrack Challenge Dataset [13] is a largescale benchmark for multiperson articulated tracking, which contains 250 video clips for training and 50 sequences of videos for validation.
Evaluation Metrics: We follow [13] to use AP to evaluate multiperson pose estimation and the multiobject tracking accuracy (MOTA) [2] to measure tracking performance.
4.2 Comparisons with the Stateoftheart Methods
We compare our framework with the stateoftheart methods on both pose estimation and tracking on the ICCV’17 PoseTrack validation set. As a common practice [13], additional images from MPIIPose [1] are used for training. Table 1 demonstrate our singleframe pose estimation performance. We show that our model achieves the stateoftheart mAP without singleperson pose model refinement. Table 2 evaluates the multiperson articulated tracking performance. Our model outperforms the stateoftheart methods by a large margin. Compared with the winner of ICCV’17 PoseTrack Challenge (ProTracker [8]), our method obtain an improvement of 16.6% in MOTA. Our model further improves over the current stateoftheart pose tracker (FlowTrack [33]) by 6.4% in MOTA with comparable single frame pose estimation accuracy, indicating the effectiveness of our TemporalNet.
Method  Head  Shou  Elb  Wri  Hip  Knee  Ankl  Total 

ProTracker [8]  
PoseFlow [35]  
BUTDS [16]  
ArtTrack [13]  
ML_Lab [37]  
FlowTrack [33]  
Ours 
Method  MOTA  MOTA  MOTA  MOTA  MOTA  MOTA  MOTA  MOTA 

Head  Shou  Elb  Wri  Hip  Knee  Ankl  Total  
ArtTrack [13]  
ProTracker [8]  
BUTD2 [16]  
PoseFlow [35]  
JointFlow [6]                
FlowTrack [33]  
Ours 
4.3 Ablation Study
We extensively evaluate the effect of each component in our framework. Table 3 summarizes the singleframe pose estimation results, and Table 4 the pose tracking results.
For pose estimation we choose [20] as our baseline, which proposes KE for spatial grouping. We also compare with one alternative embedding approach [18] for design justification. In BBox [18], instance location information is encoded as the human bounding box (x, y, w, h) at each pixel. The predicted bounding boxes are then used to group keypoints into individuals. However, such representation is hard to learn due to large variations of its embedding space, resulting in worse pose estimation accuracy compared to KE and SIE. KE provides partlevel appearance cues, while SIE encodes the human centroid constraints. When combined together, a large gain is obtained (% vs. %/%). As shown in Fig. 5, adding auxiliary tasks (+aux) dramatically speeds up the training of KE, by enforcing geometric constraints on the embedding space. It also facilitates representation learning and marginally enhances pose estimation. As shown in Table 3, employing PGG significantly improves the pose estimation accuracy ( for KE, for SIE, and for both combined). Endtoend model training and direct grouping supervision together account for the improvement. Additionally, using the instanceagnostic pose mask, the memory consumption is remarkably reduced to about , as shown in Fig. 6(a), demonstrating the efficiency of PGG. Combining both KE and SIE with PGG, further boosts the pose estimation performance to % mAP.
For pose tracking, we first build a baseline tracker based on KE and/or SIE. It is assumed that KE and SIE change smoothly in consecutive frames, and . Somewhat surprisingly, such a simple tracker already achieves competitive performance, thanks to the rich geometric information contained in KE and SIE. Employing TemporalNet for tracking significantly improves over the baseline tracker, because of the combination of the holistic appearance features of HE and temporal smoothness of TIE. Finally, incorporating spatialtemporal PGG to refine KE, SIE and TIE, further increase the tracking performance (% vs. % MOTA). We also compare with some widely used alternative tracking metrics, namely Object Keypoint Similarity (OKS), Intersection over Union (IoU) of persons and DeepMatching (DM) [29] for design justification. We find that TemporalNet significantly outperform other trackers with taskagnostic tracking metrics. OKS only uses keypoints for handling occlusion, while IOU and DM only consider human in handling fast motion. In comparison, we kill two birds with one stone.
MSCOCO Results. Our SpatialNet substantially improves over our baseline [20] on single frame pose estimation on the MSCOCO dataset. For fair comparisons, we use the same train/val split as [20] for evaluation. Table 5 reports both singlescale (sscale) and multiscale (mscale) results. Four different scales are used for multiscale inference. Our sscale SpatialNet already achieves competitive performance against mscale baseline. By multiscale inference, we further gain a significant improvement of 3% AP. All reported results are obtained without model ensembling or pose refinement [3, 20].
Head  Shou  Elb  Wri  Hip  Knee  Ankl  Total  

BBox [18]  
KE [20]  
SIE  
KE+SIE  
KE+SIE+aux  
KE+PGG  
SIE+PGG  
Ours 
MOTA  MOTA  MOTA  MOTA  MOTA  MOTA  MOTA  MOTA  

Head  Shou  Elb  Wri  Hip  Knee  Ankl  Total  
OKS  
IOU  
DM [29]  
KE  
KE+SIE  
HE  
TIE  
HE+TIE  
Ours 
4.4 Runtime Analysis
Fig. 6(b) analyzes the runtime performance of pose estimation and tracking. For pose estimation, we compare with both topdown and bottomup [20] approaches. The topdown pose estimator uses Faster RCNN [28] and a ResNet152 [10] based single person pose estimator (SPPE) [33]. Since it estimates pose for each person independently, the runtime grows proportionally to the number of people.
5 Conclusion
We have presented a unified pose estimation and tracking framework, which is composed of SpatialNet and TemporalNet: SpatialNet tackles body part detection and partlevel spatial grouping, while TemporalNet accomplishes the temporal grouping of human instances. We propose to extend KE and SIE in still images to HE appearance features and TIE temporally consistent geometric features in videos for robust online tracking. An effective and efficient PoseGuided Grouping module is proposed to gain the benefits of full endtoend learning of pose estimation and tracking.
References

[1]
M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele.
2d human pose estimation: New benchmark and state of the art
analysis.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2014.  [2] K. Bernardin and R. Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008.
 [3] Z. Cao, T. Simon, S.E. Wei, and Y. Sheikh. Realtime multiperson 2d pose estimation using part affinity fields. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [4] M. A. Carreiraperpinan. Generalised blurring meanshift algorithms for nonparametric clustering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
 [5] G. Cheron, I. Laptev, and C. Schmid. Pcnn: Posebased cnn features for action recognition. In The IEEE International Conference on Computer Vision (ICCV), 2015.
 [6] A. Doering, U. Iqbal, and J. Gall. Joint flow: Temporal flow fields for multi person tracking. arXiv preprint arXiv:1805.04596, 2018.
 [7] H. Fang, S. Xie, and C. Lu. Rmpe: Regional multiperson pose estimation. arXiv preprint arXiv:1612.00137, 2016.
 [8] R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran. Detectandtrack: Efficient pose estimation in videos. arXiv preprint arXiv:1712.09184, 2017.
 [9] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask rcnn. arXiv preprint arXiv:1703.06870, 2017.
 [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 [11] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres, B. Schiele, and S. I. Campus. Arttrack: Articulated multiperson tracking in the wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [12] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multiperson pose estimation model. In European Conference on Computer Vision (ECCV), 2016.
 [13] U. Iqbal, A. Milan, M. Andriluka, E. Ensafutdinov, L. Pishchulin, J. Gall, and S. B. PoseTrack: A benchmark for human pose estimation and tracking. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [14] U. Iqbal, A. Milan, and J. Gall. Posetrack: Joint multiperson pose estimation and tracking. arXiv preprint arXiv:1611.07727, 2016.
 [15] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
 [16] S. Jin, X. Ma, Z. Han, Y. Wu, W. Yang, W. Liu, C. Qian, and W. Ouyang. Towards multiperson pose tracking: Bottomup and topdown methods. In ICCV PoseTrack Workshop, 2017.
 [17] S. Kong and C. Fowlkes. Recurrent pixel embedding for instance grouping. arXiv preprint arXiv:1712.08273, 2017.
 [18] X. Liang, Y. Wei, X. Shen, J. Yang, L. Lin, and S. Yan. Proposalfree network for instancelevel object segmentation. arXiv preprint arXiv:1509.02636, 2015.
 [19] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), 2014.
 [20] A. Newell, Z. Huang, and J. Deng. Associative embedding: Endtoend learning for joint detection and grouping. In Advances in Neural Information Processing Systems (NIPS), 2017.
 [21] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision (ECCV), 2016.
 [22] X. Nie, J. Feng, J. Xing, and S. Yan. Generative partition networks for multiperson pose estimation. arXiv preprint arXiv:1705.07422, 2017.
 [23] G. Papandreou, T. Zhu, L.C. Chen, S. Gidaris, J. Tompson, and K. Murphy. Personlab: Person pose estimation and instance segmentation with a bottomup, partbased, geometric embedding model. arXiv preprint arXiv:1803.08225, 2018.
 [24] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy. Towards accurate multiperson pose estimation in the wild. arXiv preprint arXiv:1701.01779, 2017.
 [25] C. Payer, T. Neff, H. Bischof, M. Urschler, and D. Štern. Simultaneous multiperson detection and singleperson pose estimation with a single heatmap regression network. In ICCV PoseTrack Workshop, 2017.
 [26] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 [27] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multiview cnns for object classification on 3d data. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
 [28] S. Ren, K. He, R. Girshick, and J. Sun. Faster RCNN: Towards realtime object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), 2015.
 [29] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Deepmatching: Hierarchical deformable dense matching. International Journal of Computer Vision (IJCV), 2015.

[30]
F. Schroff, D. Kalenichenko, and J. Philbin.
Facenet: A unified embedding for face recognition and clustering.
In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.  [31] S. C. Suddarth and Y. Kergosien. Ruleinjection hints as a means of improving network performance and learning time. In Neural Networks. 1990.
 [32] S.E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 [33] B. Xiao, H. Wu, and Y. Wei. Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision (ECCV), 2018.
 [34] S. Xie, Z. Chen, C. Xu, and C. Lu. Environment upgrade reinforcement learning for nondifferentiable multistage pipelines. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [35] Y. Xiu, J. Li, H. Wang, Y. Fang, and C. Lu. Pose flow: Efficient online pose tracking. arXiv preprint arXiv:1802.00977, 2018.
 [36] Q. Yu, X. Chang, Y.Z. Song, T. Xiang, and T. M. Hospedales. The devil is in the middle: Exploiting midlevel representations for crossdomain instance matching. arXiv preprint arXiv:1711.08106, 2017.
 [37] X. Zhu, Y. Jiang, and Z. Luo. Multiperson pose estimation for posetrack with enhanced part affinity fields. In ICCV PoseTrack Workshop, 2017.
Comments
There are no comments yet.