Efficient Online Multi-Person 2D Pose Tracking with Recurrent Spatio-Temporal Affinity Fields

11/29/2018 ∙ by Yaadhav Raaj, et al. ∙ Carnegie Mellon University 6

We present an online approach to efficiently and simultaneously detect and track the 2D pose of multiple people in a video sequence. We build upon Part Affinity Field (PAF) representation designed for static images, and propose an architecture that can encode and predict Spatio-Temporal Affinity Fields (STAF) across a video sequence. In particular, we propose a novel temporal topology cross-linked across limbs which can consistently handle body motions of a wide range of magnitudes. Additionally, we make the overall approach recurrent in nature, where the network ingests STAF heatmaps from previous frames and estimates those for the current frame. Our approach uses only online inference and tracking, and is currently the fastest and the most accurate bottom-up approach that is runtime invariant to the number of people in the scene and accuracy invariant to input frame rate of camera. Running at ∼30 fps on a single GPU at single scale, it achieves highly competitive results on the PoseTrack benchmarks.



There are no comments yet.


page 1

page 3

page 5

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: We solve multi-person human pose tracking by encoding change in position and orientation of keypoints or limbs across time as Temporal Affinity Fields (TAFs) in a recurrent fashion. Top: Modeling TAFs (blue arrows) through keypoints works when motion occurs but fails during limited motion making temporal association difficult. Bottom: Cross-linked TAFs across limbs perform consistently for all kinds of motions providing redundancy and smoother encoding for further refinement and prediction.

Multi-person human pose estimation has received considerable attention in the past few years assisted by deep convolutional learning as well as COCO [20] and MPII [3] datasets. The recently introduced PoseTrack dataset [16] has provided the community with a large scale corpus of video data with multiple people in the scenes. In this paper, our aim is to utilize these towards building a truly online and real-time multi-person 2D pose estimator and tracker that is deployable and scalable while achieving high performance and requiring minimal post-processing, with potential uses in real-time and closed-loop applications with low latency where the execution is in sync with frame rate of camera such as self-driving cars and augmented reality.

The real-time/online nature of such an approach introduces several challenges: i) scenes with multiple people demand handling of occlusion, proximity and contact as well as limb articulation, and ii) it should be run-time invariant to the number of people in the scene. Furthermore, iii) it must be capable of handling challenges induced from video data, such as large camera motion and motion blur across frames. We build upon the Part Affinity Fields (PAFs) [5]

to overcome these challenges, which represent connections across body keypoints in static images as normalized 2D vector fields with position and orientation. In this work, we propose Temporal Affinity Fields (TAFs) which encode connections between keypoints across frames, including a unique cross-linked limb topology as seen in bottom row of Figure 

1. In the absence of motion or when there is not enough data from previous frames, TAFs constructed between same keypoints, e.g., wrist-wrist or elbow-elbow across frames lose all associative properties (see top row of Fig. 1). In this case, the nullification of magnitude and orientation provides no useful information to discern between the case where a new person appears or where an existing person stops moving. This effect is compounded if these two cases occur in proximity together. However, the longer limb TAF connections allow information preservation even in the absence of motion or appearance of new people by preventing corruption of valid information with noise as the magnitude of motion becomes small. In the limiting case of zero motion, the TAF effectively collapses to a PAF. From the perspective of a network, TAF between keypoints destroys spatial information about keypoints as motion ceases, whereas TAF across keypoints simply learns to propagate the PAF, which is a much simpler task.

Furthermore, we work on videos in a recurrent manner to make the approach real-time, where computation of each frame leverages information from previous frames thereby reducing overall computation. Where the single-image pose estimation methods use multiple stages to refine heatmaps [5, 24], we exploit the redundant information in the video frames and divert the resources towards efficient computation of both poses and tracking across multiple frames. Thus, the multi-stage computation over images is divided over multiple frames in a video. Overall, we call this Recurrent Spatio-Temporal Affinity Fields (STAF) and it achieves highly competitive results on the PoseTrack benchmarks: [64.6% mAP, 58.4% MOTA] on single scale at 30 FPS, and [71.5% mAP, 61.3% MOTA] on multiple scales at 7 FPS on the Posetrack 2017 validation set using one GTX 1080 Ti. Our approach currently ranks second place for accuracy and third place for tracking on the 2017 challenge [1]. Note that, our tracking approach is truly online on a per-frame basis with no post improvements made to tracks.

The rest of the paper is organized as follows. In Sec. 2, we discuss related work and situate the paper in the literature. In Sec. 3 we present details of our approach, training procedure as well as tracking and inference algorithm. Finally we present results and ablation experiments in Sec. 4 and conclude the paper in Sec. 5.

2 Related Work

Early methods for human pose estimation localized keypoints or body parts of individuals but did not consider multiple people simultaneously [4, 28, 38, 19, 34]. Hence, these methods were not adept at localizing keypoints of highly articulated or interacting people. Person detection was used followed by single-person keypoint detection [29, 10, 33, 15]

. With deep learning, human detection methods such as Mask-RCNN

[9, 12] were employed to directly predict multiple human bounding boxes through ROI-pooling followed by pose estimation per person [30]. However, these methods suffered when people were in close proximity as bounding boxes got grouped together. Furthermore, these top-down methods required more computation as the number of people increased in the image, making them inadequate for real-time pose estimation and tracking.

The bottom-up Part Affinity Fields (PAF) method [5] produced a spatial encoding of pair-wise body part connections in image space, followed by greedy bipartite graph matching for inference permitting consistent speed irrespective of the number of people. Person Lab [25] built upon these ideas to incorporate redundant connections on people with a less greedy inference approach getting highly competitive results on the COCO [21] and MPII [3] datasets. These methods work on single images and do not incorporate any keypoint tracking or past information.

Many offline methods have been used to enforce temporal consistency of pose estimation in videos [14, 16, 36]. These require solving spatio-temporal graphs or incorporating data from future frames making them inadequate for an online approach. Alternatively, Song et al. and Pfister et al. [27, 32] demonstrate how optical flow fields could be predicted per keypoint by formulating the input to be multi-framed. LSTM Pose Machines [23] built upon previous work demonstrating use of single stage/frame for video sequences. However, these networks did not model spatial relationship between keypoints and were evaluated on the single person Penn Action [35] and JHMDB [17] datasets.

A different line of works explored maintaining temporal graphs in neural networks for handling multiple people

[8, 7]. Rohit et al. demonstrated that a 3D extension of Mask-RCNN, called person tubes, can connect people across time. However, this required applying grouped convolutions over a stack of frames reducing speed, and did not achieve better results for tracking than the Hungarian Algorithm baseline. Joint Flow [7] used the concept of Temporal Flow Field which connected keypoints across two frames. However, it did not use a recurrent structure and explicitly required a pair of images as input increasing run-time significantly. The flow representation also suffered from ambiguity when subjects moved slowly or were stationary and required special handling of such cases during tracking.

Top-down pose and tracking methods [36, 34, 6, 26, 12] have dominated the detection and tracking tasks [36] [37] in PoseTrack but their speed suffered due to explicit human detection and follow-up keypoint detection for each person. Moreover, modeling long-term spatio-temporal graphs for tracking in an offline manner hurt real-time applications. None of these works are able to report any significant run-time to performance measures as they cannot run in real time. In this work, we demonstrate this problem can be solved in a simple elegant single-stage network that incorporates recurrence by using the previous pose heatmaps to predict both keypoints and their spatio-temporal associations. We call this Recurrent Spatio-Temporal Affinity Fields (STAF) which not only represents the prediction of Spatial (PAFs) and Temporal (TAFs) Affinity Fields but also how they are refined through past information.

Figure 2: Left: Training architecture for one of our models which ingests video sequences in a recurrent manner across time while generating keypoints and connections across keypoints in same frame as Part Affinity Fields (PAFs), and connections across keypoints in time as Temporal Affinity Fields (TAFs). Together, we call this Recurrent Spatio-Temporal Affinity Fields (STAFs). Each module ingests outputs from other modules in both previous and current frames (shown with arrows) and refines it. Center: During inference, our network operates on a single video frame at each time step. Right: During inference, we use the predicted heatmaps to track and detect people. Keypoints (red) are extracted first, then associated into poses and tracklets using PAFs (green), TAFs (blue) and previous tracklets.

3 Proposed Approach

Our approach aims to solve the problems of keypoint estimation and tracking simultaneously in videos. We employ Recurrent Convolutional Neural Networks which we construct from four essential building blocks. Let

represent the pose of a person in a particular frame or time , consisting of keypoints . The Part Affinity Fields (PAFs) are synthesized from keypoints in each frame. For tracking keypoints across frames a video, we propose Temporal Affinity Fields (TAFs) given by which capture the recurrence and connect the keypoints across frames. Together, they are referred to as Spatio-Temporal Affinity Fields (STAFs). These blocks are visualized in Fig. 2 where each block is shown with a different color: the raw convolutional feature from VGG backbone [31] are shown in amber, PAFs in green, keypoints in red and TAFs in blue.

Thus, the output of VGG backbone, PAFs, keypoints and TAFs are given by , , and , respectively, and computed through CNNs by , , and respectively. The keypoint heatmaps are constructed from ground truth by placing a Gaussian kernel at the location of the annotated keypoint, whereas the PAFs and TAFs are constructed from ground truth between pairs of keypoints for each person:


where denotes the ground truth and the function places a directional unit vector at every pixel within a pre-defined radius of the line connecting the two keypoints.

3.1 Video Models for Pose Estimation and Tracking

Next, we present the three models comprising the four blocks capable of estimating keypoints and STAFs. The input to each network consists of a set of consecutive frames of a video. Each block in each network consists of five 7  7 and two 1  1 convolution layers. Each 7  7 layer is replaceable with the concatenation of three 3  3 convolution layers providing the same receptive field. The first stage has a unique set of weights from subsequent frames as it cannot incorporate any previous data and also has a lower depth which was found to improve results (see Sec. 4). The VGG features are computed for each frame. For frame at time of the video, they are computed as .

Model I: Given and , the the following equations describe the first model:


where means recursive applications of . In our experiments, we found that performance plateaus at . In Model I, PAFs are obtained by recursive application of on concatenated input from VGG features and PAFs from previous stage. Similarly, keypoints depend on VGG features, keypoints from the previous stage and PAFs from the current stage. Finally, TAFs are dependent on VGG features and PAFs from both the previous and current frames, as well as TAFs from previous frame. This model produces good results but is the slowest due to recursive stages.

Model II: Unlike Model I with multiple applications of CNNs for PAFs and keypoints, Model II computes the PAFs and keypoints in a single pass as visualized in Fig. 2:


Replacing five stages with a single stage is expected to drop performance. Therefore, the multi-stage computation of PAFs and keypoints in Model II is supplanted with output of PAFs and keypoints from the previous frames. This boosts up the speed significantly without significant loss in performance as it takes advantage of the redundant information in videos, i.e., the PAFs and keypoints from previous frame are a reliable guide to the location of PAFs and keypoints in the current frame. Model III: Finally, the third model attempts to estimate Part and Temporal Affinity Fields through a single CNN:


where implies simultaneous computation of Part and Temporal Affinity Fields through a single CNN. For Model III, the channels corresponding to PAF are then passed for keypoint estimation along with VGG features from current frame and keypoints from previous frame. As Model III consists of only three blocks, it has the fastest inference, however it proved to be the most difficult to train.

3.2 Topology of Spatio-Temporal Affinity Fields

Figure 3: This figure illustrates the three possible topology variations for Spatio-Temporal Affinity Fields including the new cross-linked limb topology (b). The red and green color indicates previous frame and current frame, respectively. Keypoints, PAFs and TAFs are represented by solid circles, straight lines and arrows, respectively.

For our body model, we define body parts or keypoints which is the union of body parts in COCO and MPII pose datasets. They include ears, nose and eyes from COCO; and head and neck from MPII. Next, there are several possible ways to associate and track the keypoints and STAFs across frames as illustrated in Figure 3. In this figure, solid circles represent keypoints while straight lines and arrows stand for PAFs and TAFs, respectively. Figure 3(a) consists of TAFs between same keypoints as well as PAFs. For this topology, the number of TAFs and PAFs is 21 and 48, respectively. The TAFs capture temporal connections directly across keypoints similar to [7].

On the other hand, Figure 3(b) consists of TAFs between different limbs in a cross-linked manner across frames. The number of PAFs and TAFs is 48 and 96, respectively. We also tested the topology in Figure 3(c) which consists of 69 keypoints and limb TAFs only. This does not model any spatial links within frames across keypoints.

3.3 Model Training

During training, we unroll each model to handle multiple frames at once. Each model is first pre-trained in Image Mode where we present a single image or frame at each time instant to the model. This implies multiple applications of PAF and keypoint stages to the same frame. We train with COCO, MPII and PoseTrack datasets with a batch distribution of , and , respectively, corresponding dataset sizes where each batch consists of images or frames from one dataset exclusively. For masking out un-annotated keypoints, we use the head bounding boxes available in MPII and Posetrack datasets, and location of annotated keypoints for batches from COCO dataset. The net takes in 368  368 images and has scaling, rotation and translation augmentations. Heatmaps are computed with an

loss with a stride of

resulting in 46  46 dimensional heatmaps. In topology 3(b), we initialize the TAF with PAF, and zeros for 3(a). We train the net for a maximum of k iterations.

Next, we proceed training in the Video Mode where we expose the network to video sequences. For static image datasets including COCO and MPII, we augment data with video sequences that have length equal to number of times the network is unrolled by synthesizing motion with scaling, rotation and translation. We train COCO, MPII and PoseTrack in Video Mode with a batch distribution of of , and , respectively. Moreover, we also use skip-frame augmentation for video-based PoseTrack dataset where some of the randomly selected sequences skip up to frames. We lock the weights of VGG module in Video Mode. For Model I, we only trained the TAFs block when training on videos. For Model II, we trained keypoints, PAFs and TAFs for epochs, then locked all modules except TAFs. In Model III, both STAFs and keypoints remained unlocked throughout k iterations.

3.4 Inference and Tracking

The method described till now predicts heatmaps of keypoints and STAFs at every frame. Next, we present the framework to perform pose inference and tracking across frames given the predicted heatmaps. Let the inferred poses at time be given by where the second superscript indexes over people in each frame. Each pose at a particular time consists of up to keypoints that become part of a pose post inference, i.e., .

The detection and tracking procedure begins with localization of keypoints at time . The inferred keypoints are obtained by rescaling the heatmaps to original image resolution followed by non-maximal suppression. Then, we infer PAF, , and TAF, , weights between all pairs of keypoints in each frame defined by the given topology, i.e.,


where the function samples points between the two keypoints, computes the dot product between the the mean vector of the sampled points and the directional vector from the first to the second keypoint.

Both the inferred PAF and TAF weights are sorted by their scores before inferring the complete poses and associating them across frames with unique ids. We perform this in a bottom-up style where we utilize poses and inferred PAFs from the previous frame to determine the update, addition or deletion of tracklets. Going through each PAF in the sorted list, (i) we initialize a new pose if both keypoints in the PAF are unassigned, (ii) add to existing pose if one of the keypoints is assigned, (iii) update score of PAF in pose if both are assigned to the same pose, and (iv) merge two poses if keypoints belong to different poses with opposing keypoints unassigned. Finally, we assign id to each pose in the current frame with the most frequent id of keypoints from the previous frame. For cases where we have ambiguous PAFs i.e., multiple equally likely possibilities as seen in Figure 4, we use transitivity that reweighs PAFs with TAFs to disambiguate between them, using as a biasing weight. In this figure, keypoint - an elbow - is under consideration with wrists and as two possibilities. We select the strongest TAFs where has a higher weight than , computed as:

Figure 4: (a) Ambiguity when selecting between two wrist locations B and E is resolved by reweighing PAFs through TAFs. (b)-(d): With transitivity, incorrect PAFs containing ankles (c) are resolved with past pose (b) resulting in (d).

4 Experiments

In this section, we present results of our experiments. Input images to networks are resized at Wx368 maintaining aspect ratio for single scale (SS); and Wx736, Wx368 and Wx184 for multiple scales (MS). The heatmaps for multiple scales are re-sized back to Wx736 and merged through averaging. This is followed by inference and tracking.

4.1 Ablation Study

We conducted a series of ablation studies to determine the construction of our network architecture:

Filter Sizes: We studied the effect of filter size in each of the modules. As discussed in Sec. 3, each block either consists of five 7  7 layers followed by two 1  1 layers [5], or each 7  7 layer is replaced with three 3  3 layers in the alternate experiment. The results are shown in Table 1. We run single frame inference on Model I and find the 3  3 filter size to be more accurate than 7  7, with significant boosts in average precision of knee and ankle keypoints. It is also faster while requiring more memory.

Method Hea Sho Elb Wri Hip Kne Ank mAP fps
Model I - 3x3 75.7 73.9 67.8 56.3 66.8 62.3 56.9 66.3 14
Model I - 7x7 76.0 73.3 66.4 54.0 63.4 59.2 52.2 64.3 10
Table 1: This table shows results for experiments with the two filter sizes on PoseTrack 2017 validation set.

Video Mode / Depth of First Stage: Next, we report results when training in Image Mode (Im) using single images, and when we continue training beyond images while exposing the network to videos and augmentation with synthetic motion in the Video Mode (Vid). During testing, the network is run recurrently on video sequences with one frame per stage. Model II is deployed for these experiments. We find that by exposing the network to video sequences for iterations, we were able to boost the mAP as seen in Table 2 and Fig. 5. We also find that if we use the same depth, i.e., number of channels for the first frame as the other frames (128-128), the network was not able to generalize well to recurrent execution ( mAP) when trained with Image Mode. When reducing the depth for the first frame to one-half, i.e. (64-128), we found that the generalization to videos was better ( mAP). When trained with Video Mode, mAP increased further to . We reason that the 64-depth modules produced relatively vague outputs which gave sufficient room for the subsequent modules in the following frames to process and refine the heatmaps yielding a boost in performance. Furthermore, this also highlights the importance of incorporating shot change detection and running first stage at each shot change.

Method Hea Sho Elb Wri Hip Kne Ank mAP fps
Im - 7x7 - 128-128 74.6 69.6 55.5 40.2 56.4 47.2 44.0 56.6 27
Vid - 7x7 - 128-128 76.2 71.6 64.5 51.9 62.6 59.3 52.5 63.6 27
Im - 7x7 - 64-128 73.5 72.2 63.8 52.1 62.7 57.3 51.1 62.6 27
Vid - 7x7 - 64-128 75.8 73.4 65.5 53.8 64.2 58.4 51.4 64.1 27
Im - 3x3 - 64-128 73.5 72.5 65.0 52.7 63.7 57.7 53.2 63.4 35
Vid - 3x3 - 64-128 75.4 73.2 67.4 55.0 63.9 58.4 53.5 64.6 35
Table 2: This table shows single-scale performance using Model II before and after training with videos, filter sizes, as well as different depths for first stage.
Figure 5: Improvement in quality of heatmaps before (a,c) and after (b,d) the network is exposed to videos and synthetic motion augmentation. We observe better peaks and less noise across both PAF and keypoint heatmaps.

Effect of Camera Frame Rate on mAP: For these experiments, we studied how the frame rate of the camera and number of stages affect the accuracy of pose estimation. With a high frame rate, the apparent motion between frames is smooth, while relatively abrupt with a low frame-rate camera. Therefore, the heatmaps from previous frames would not be as useful at low frame rates. We tested this hypothesis considering Model I (five stages on the same modules without ingesting previous heatmaps), and Model II (different number of stages with each ingesting heatmaps from previous frame). We also evaluate the influence of training with Image and Video modes in Figure 6.

Fig. 6(a) shows results on a subset of ten sequences where the human subjects comprised at least 30% of the frame height in the PoseTrack 2017 validation set. Fig. 6(b) presents results on the entire validation set. The original videos were assumed to run at the film-standard 24 Hz hence we ran experiments by varying frame rates at 24, 12 and 6 Hz through sub-sampling. The ground truth has been annotated at 6 Hz. As expected, accuracy is proportional to video frame rate and number of stages. When the Model II was trained in Image Mode, we observed small increments in accuracy until at four stages, it peaks at the same level as Model I. Upon training with Video Mode, it surpasses this accuracy peaking earlier at two stages.

When considering the entire validation set, the approach is still able to reap the benefits of more stages and training in Video Mode as can be seen in Fig. 6(b). However, it was barely able to reach the accuracy of the much slower Model I. For the validation set, the accuracy was harmed when including sequences with smaller apparent size of humans. These sequences usually were more crowded as well and passing in the previous heatmaps seemed to hurt the performance. The body parts of small-sized humans only occupied a few pixels in the heatmap and the normalized direction vectors were inconsistent and random across frames.

Figure 6: These graphs show mAP curves as a function of frame rates of camera, i.e., the rate at which an original 24Hz video is input to the method. The flat black line shows the performance of five-stage Model I, while ‘*’ in the legend indicates training using Image Mode only.

Influence of Topology Type in Tracking: Next, we report the ablation experiments on tracking performance evaluated using Multiple Object Tracking Accuracy (MOTA) metric in Table 3. We evaluate results using Topology A and B from Fig. 3(a) and 3(b), respectively, both with Models I and II and found an improvement in tracking using limb TAF in Topology B versus keypoints TAF in Topology A. As highlighted in Fig. 1, Topology A does not have associative properties when a keypoint has minimal motion or when a new person appears. Although we enforced proximity assumption across time and that keypoints less than 2 pixels apart should be associated and adjusted it according to scale (similar to [7]), however, this still resulted in false positives since it is difficult to disambiguate between a newly detected person and some nearby stationary person. Furthermore, where the motion of a person tended to be small, Topology A resulted in very jittery and noisy vectors causing more reliance on pixel distances. This was further exacerbated by recurrence where accumulation of noisy vectors from previous frame heatmaps deteriorated associative ability of Temporal Affinity Fields.

Topology B solves all of these problems elegantly. The longer cross-linked limb TAF connections preserve information even in the absence of motion or appearance of new people since the TAF effectively collapses to a PAF in such cases. This allows us to avoid association heuristics and makes the problem of new person identification trivial. With this representation, recurrence was observably beneficial due to true and consistent representation irrespective of magnitude of motion. As a side-advantage, this also allowed us to warm-start the TAF input with PAF providing more reliable initialization for tracking in the first frame.

For Model III, training beyond 5000 iterations gradually begins to harm the accuracy of the pose estimation resulting in reduced tracking performance as well. This is primarily due to the disparity in the amount of diverse data between COCO/MPII and Posetrack datasets. For Model II, if we train on keypoints and PAFs modules and lock their weights afterwards, then follow with training only the TAF, this results in better performance with a significant boost in speed as well. However, Model I outperformed the other models with five stages for keypoints and PAFs; and a single recurrent stage for TAF. However this comes at the expense of speed. Furthermore, we observe that an increase in mAP ends up sub-linearly increasing the MOTA as well.

Method Wrist-AP Ankles-AP mAP MOTA fps
Model I-A 56.2 56.4 66.0 58.5 14
Model I-B 56.3 56.9 66.3 59.4 13
Model II-A 54.9 53.0 64.4 57.4 28
Model II-B 55.0 53.5 64.6 58.4 27
Model III-B 51.9 49.5 61.6 57.8 30
Table 3: This table shows pose estimation and tracking performance for combinations of model types and topologies.

Effect of Video Rate and Number of People on Tracking: Finally, we performed a study on how the frame rate of the camera affects tracking accuracy, since a lower frame rate would require longer associations in pixel space.

We ran Lukas Kanade (LK) as a baseline tracker by replacing the TAF Module in Model I with LK (21  21 window and pyramid levels). Initially, we observe that there is a roughly improvement in MOTA as seen in Fig. 7(a). However, with careful observation we note that around 20% of the sequences have significant articulation and camera movement, where TAF outperformed LK as it was not able to match keypoints across large displacements whereas TAF found matches due to its stronger descriptive power. TAF was able to maintain tracking accuracy even with low frame-rate cameras, but with LK the MOTA drops off significantly (see Fig. 7(a)). Furthermore, Fig. 7(b) suggests that our approach is nearly run-time invariant to number of people in the frame making it suitable for crowded scenes.

Figure 7: (a) This graph shows MOTA as a function of video frame rate for Temporal Affinity Fields (TAF) and Lukas-Kanade (LK) tracker. The performance of TAF is virtually invariant to frame rate or alternatively to the amount of motion between frames. (b) Our approach is effectively run-time invariant to the number of people in the scene.
Method Wrist-AP Ankles-AP mAP MOTA fps


Posetrack 2017 Validation
Detect-and-track [8] 51.7 49.8 60.6 55.2 1.2
FlowTrack - 152 [36] 72.4 67.1 76.7 65.4 -
FlowTrack - 50 [36] 66.0 61.7 72.4 62.9 -
MDPN - 152 [11] 77.5 71.4 80.7 66.0 -


PoseFlow [37] 61.1 61.3 66.5 58.3 10*
JointFlow [7] - - 69.3 59.8 0.2
Model II-B (SS) 55.0 53.5 64.6 58.4 27
Model I-B (SS) 56.8 56.8 66.3 59.4 13
Model II-B (MS) 62.9 60.9 71.5 61.3 7
Model I-B (MS) 65.0 62.7 72.6 62.7 2


Posetrack 2017 Testing
Detect-and-track [8] - - 59.6 51.8 1.2
Flowtrack - 152 [36] 70.7 64.9 73.9 57.6 -
Flowtrack - 50 [36] 65.1 60.3 70.0 56.4 -


PoseTrack [2] 54.3 49.2 59.4 48.4 -
BUTD [18] 52.9 42.6 59.1 50.6 -
PoseFlow [37] 59.0 57.9 63.0 51.0 10*
JointFlow [7] 53.1 50.4 63.3 53.1 0.2
Model II-B (MS) 62.8 59.5 69.6 52.4 7
Model I-B (MS) 65.0 60.7 70.3 53.8 2


Posetrack 2018 Validation
Model II-B (SS) 56.2 54.2 63.7 58.4 27
Model I-B (SS) 58.3 56.7 64.9 59.6 13
Model II-B (MS) 62.7 60.6 69.9 59.8 7
Model I-B (MS) 64.7 62.0 70.4 60.9 3
Table 4: This table shows comparison on the Posetrack dataset. For our approach, we report results with Models I / II and Topology B. ‘SS’ and ‘MS’ refer to single and multiple scales, respectively. The last column shows the speed in frames per second (* excludes pose inference time). FlowTrack is a top-down approach that uses ResNet with 152 and 50 layers; whereas JointFlow, PoseFlow and our approach are bottom-up.

4.2 Comparison

We present results on PoseTrack dataset in Table 4 for 2017 validation set (top), 2017 test set (middle) and 2018 validation set (bottom). FlowTrack, JointFlow and PoseFlow are included as comparison in this table. FlowTrack is a top-down approach which means human detection is performed first followed by pose estimation. Due to this reason, it is significantly slower than bottom-up approaches such as ours. Model II-B with single scale is competitive with other bottom-up approaches while being 270% faster. However, multi-scale (MS) processing boosts performance by 6% and 1.5% for mAP and MOTA, respectively. We are also able to achieve competitive results on the PoseTrack 2018 Validation set while maintaining the best speeds amongst all reported results. Note that PoseTrack 2018 Test set was not released to public at the time of submission of this paper. Figure 8 shows some qualitative results.

Figure 8: Three example cases of tracking at 30 FPS on multiple targets. Top/Middle: Observe that tracking continues to function despite large motion displacements and occlusions. Bottom: A failure case where abrupt scene change causes ghosting, where previous scene person appears in new frame. Note that small targets also take several stages to fully appear, hence a warm start will be required during shot change.

5 Conclusion

In this paper, we first outlined why the recurrent Spatio-Temporal Affinity Fields (STAFs) is the right approach for detection and tracking of articulated human pose in videos, especially for real-time reactive systems. We demonstrated this by showing that leveraging the previous frame data within a recurrent structure and training on video sequences yields equally good results as a multi-stage network albeit at much lower computation cost. We also demonstrated the stability of tracking accuracy at reduced frame rates for the TAF formulation due to its ability to correlate keypoints over large pixel distances. This implies that our method can be deployed on low-power embedded systems that may not be able to run large networks at high frame rates, yet are able to maintain reasonable accuracy. We also demonstrated our new cross-linked limb temporal topology is able to generalize better than previous approaches due to strong associative power with PAF being a special case of TAF. We are also able to operate at the same consistent speed irrespective of the number of people due to bottom-up formulation. Our method is currently the most efficient and the best bottom-up fully online detection and tracking approach for articulated human poses. For future work, we plan to embed a re-identification module to handle cases of people leaving the camera view for long duration of time and reappearing later in time. Furthermore, detecting shot change and triggering warm-starting at every shot change has the potential to boost pose estimation and tracking performance.

Notes: More details on algorithms are in the supplementary document. SMPL [22] used for figures. The runtime code for this will be released into OpenPose [13] in the near future.