LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking
In this paper, we propose a novel effective light-weight framework, called LightTrack, for online human pose tracking. The proposed framework is designed to be generic for top-down pose tracking and is faster than existing online and offline methods. Single-person Pose Tracking (SPT) and Visual Object Tracking (VOT) are incorporated into one unified functioning entity, easily implemented by a replaceable single-person pose estimation module. Our framework unifies single-person pose tracking with multi-person identity association and sheds first light upon bridging keypoint tracking with object tracking. We also propose a Siamese Graph Convolution Network (SGCN) for human pose matching as a Re-ID module in our pose tracking system. In contrary to other Re-ID modules, we use a graphical representation of human joints for matching. The skeleton-based representation effectively captures human pose similarity and is computationally inexpensive. It is robust to sudden camera shift that introduces human drifting. To the best of our knowledge, this is the first paper to propose an online human pose tracking framework in a top-down fashion. The proposed framework is general enough to fit other pose estimators and candidate matching mechanisms. Our method outperforms other online methods while maintaining a much higher frame rate, and is very competitive with our offline state-of-the-art. We make the code publicly available at: https://github.com/Guanghan/lighttrack.READ FULL TEXT VIEW PDF
LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking
Pose tracking is the task of estimating multi-person human poses in videos and assigning unique instance IDs for each keypoint across frames. Accurate estimation of human keypoint-trajectories is useful for human action recognition, human interaction understanding, motion capture and animation, etc. Recently, the publicly available PoseTrack dataset [18, 3] and MPII Video Pose dataset  have pushed the research on human motion analysis one step further to its real-world scenario. Two PoseTrack challenges have been held. However, most existing methods are offline hence lacking the potential to be real-time. More emphasis has been put on the Multi-Object Tracking Accuracy (MOTA) criterion compared to the Frame Per Second (FPS) criterion. Existing offline methods divide the tasks of human detection, candidate pose estimation, and identity association into sequential stages. In the procedure, multi-person poses are estimated across frames within a video. Based on the pose estimation results, the pose tracking outputs are computed via solving an optimization problem. It requires the poses of future frames to be pre-computed, or at least for the frames within some range.
In this paper, we propose a novel effective light-weight framework for pose tracking. It is designed to be generic, top-down (i.e., pose estimation is performed after candidates are detected), and truly online. The proposed framework unifies single-person pose tracking with multi-person identity association. It sheds first light on bridging keypoint tracking with object tracking. To the best of our knowledge, this is the first paper to propose an online pose tracking framework in a top-down fashion. The proposed framework is general enough to fit other pose estimators and candidate matching mechanisms. Thus, if individual component is further improved in the future, our framework will be faster and/or more accurate.
In contrast to Visual Object Tracking (VOT) methods, in which the visual features are implicitly represented by kernels or CNN feature maps, we track each human pose by recursively updating the bounding box and its corresponding pose in an explicit manner. The bounding box region of a target is inferred from the explicit features, i.e., the human keypoints. Human keypoints can be considered as a series of special visual features. The advantages of using pose as explicit features include: (1) The explicit features are human-related and interpretable, and have very strong and stable relationship with the bounding box position. Human pose enforces direct constraint on the bounding box region. (2) The task of pose estimation and tracking requires human keypoints be predicted in the first place. Taking advantage of the predicted keypoints is efficient in tracking the ROI region, which is almost free. This mechanism makes the online tracking possible. (3) It naturally keeps the identity of the candidates, which greatly alleviates the burden of data association in the system. Even when data association is necessary, we can re-use the pose features for skeleton-based pose matching. Single Pose Tracking (SPT) and Single Visual Object Tracking (VOT) are thus incorporated into one unified functioning entity, easily implemented by a replaceable single-person human pose estimation module.
Our contributions are in three-fold: (1) We propose a general online pose tracking framework that is suitable for top-down approaches of human pose estimation. Both human pose estimator and Re-ID module are replaceable. In contrast to Multi-Object Tracking (MOT) frameworks, our framework is specially designed for the task of pose tracking. To the best of our knowledge, this is the first paper to propose an online human pose tracking system in a top-down fashion. (2) We propose a Siamese Graph Convolution Network (SGCN) for human pose matching as a Re-ID module in our pose tracking system. Different to existing Re-ID modules, we use a graphical representation of human joints for matching. The skeleton-based representation effectively captures human pose similarity and is computationally inexpensive. It is robust to sudden camera shift that introduces human drifting. (3) We conduct extensive experiments with various settings and ablation studies. Our proposed online pose tracking approach outperforms existing online methods and is competitive to the offline state-of-the-arts but with much higher frame rates. We make the code publicly available to facilitate future research.
Human Pose Estimation (HPE) has seen rapid progress with the emergence of CNN-based methods [34, 31, 39, 21]. The most widely used datasets, e.g., MPII  and LSP , are saturated with methods that achieve 90% and higher accuracy. Multi-person human pose estimation is more realistic and challenging, and has received increasing attentions with the hosting of COCO keypoints challenges  since 2017. Existing methods can be classified into top-down and bottom-up approaches. The top-down approaches [14, 32, 15] rely on the detection module to obtain human candidates and then applying single-person pose estimation to locate human keypoints. The bottom-up methods [6, 35, 30] detect human keypoints from all potential candidates and then assemble these keypoints into human limbs for each individual based on various data association techniques. The advantage of bottom-up approaches is their excellent trade-off between estimation accuracy and computational cost because the cost is nearly invariant to the number of human candidates in the image. In contrast, the advantage of top-down approaches is their capability in disassembling the task into multiple comparatively easier tasks, i.e., object detection and single-person pose estimation. The object detector is expert in detecting hard (usually small) candidates, so that the pose estimator will perform better with a focused regression space. Pose tracking is a new topic that is primarily introduced by the PoseTrack dataset [18, 3] and MPII Video Pose dataset . The task is to estimate human keypoints and assign unique IDs to each keypoint at instance-level across frames in videos. A typical top-down but offline method was introduced in , where pose tracking is transformed into a minimum cost multi-cut problem with a graph partitioning formulation.
Earlier works in object detection regress visual features into bounding box coordinates. HPE, on the other hand, usually regresses visual features into heatmaps, each channel representing a human joint. Recently, research in HPE has inspired many works on object detection [40, 22, 28]. These works predict heatmaps for a set of special keypoints to infer detection results (bounding boxes). Based on this motivation, we propose to predict human keypoints to infer bounding box regions. Human keypoints are a special set of keypoints to represent detection of the human class only.
MOT aims to estimate trajectories of multiple objects by finding target locations while maintaining their identities across frames. Offline methods use both past and future frames to generate trajectories while online methods only exploit information that is available until the current frame. An online MOT pipeline  was presented with applying a single object tracker to keep tracking each target given these target detections in each frame. The target state is set as tracked until the tracking result becomes unreliable. The target is then regarded as lost, and data association is performed to compute the similarity between the track-let and detections. Our proposed online pose tracking framework also tracks each target (with corresponding keypoints) individually while keeping their identities, and performs data association when target is lost. However, our framework is distinct in several aspects: (a) the detection is generated by object detector only at key frames, therefore not necessarily provided at each frame. It can be provided scarcely; (b) the single object tracker is actually a pose estimator that predicts keypoints based on an enlarged region.
It is recently studied in  on how to effectively model dynamic skeletons with a specially tailored graph convolution operation. The graph convolution operation turns human skeletons into spatio-temporal representation of human actions. Inspired by this work, we propose to employ GCN to encode spatial relationship among human joints into a latent representation of human pose. The representation aims to robustly encode the pose, which is invariant to human location or view angle. We measure similarities of such encodings for the matching of human poses.
We propose a novel top-down pose tracking framework. It has been proved that human pose can be employed for better inference of human locations . We observe that, in a top-down approach, accurate human locations also ease the estimation of human poses. We further study the relationships between these two levels of information: (1) Coarse person location can be distilled into body keypoints by a single-person pose estimator. (2) The position of human joints can be straightforwardly used to indicate rough locations of human candidates. (3) Thus, recurrently estimating one from the other is a feasible strategy for Single-person Pose Tracking (SPT).
However, it is not a good idea to merely consider the Multi-target Pose Tracking (MPT) problem as a repeated SPT problem for multiple individuals. Because certain constraints need to be met, e.g., in a certain frame, two different IDs should not belong to the same person; neither two candidates should share the same identity. A better way is to track multiple individuals simultaneously and preserve/update their identities with an additional Re-ID module. The Re-ID module is essential because it is usually hard to maintain correct identities all the way. It is unlikely to track the individual poses effectively across frames of the entire video. For instance, under the following scenarios, identities have to be updated: (1) some people disappear from the camera view or get occluded; (2) new candidates come in or previous candidates re-appear; (3) people walk across each other (two identities may merge into one if not treated carefully); (4) tracking fails due to fast camera shifting or zooming.
In our method, we first treat each human candidate separately such that their corresponding identity is kept across the frames. In this way, we circumvent the time-consuming offline optimization procedure. In case the tracked candidate is lost due to occlusion or camera shift, we then call the detection module to revive candidates and associate them to the tracked targets from the previous frame via pose matching. In this way, we accomplish multi-target pose tracking with an SPT module and a pose matching module.
Specifically, the bounding box of the person in the upcoming frame is inferred from the joints estimated by the pose module from the current frame. We find the minimum and maximum coordinates and enlarge this ROI region by 20% on each side. The enlarged bounding box is treated as the localized region for this person in the next frame. If the average confidence score from the estimated joints is lower than the standard , it reflects that the target is lost since the joints are not likely to appear in the bounding box region. The state of the target is defined as:
If the target is lost, we have two modes: (1) Fixed Keyframe Interval (FKI) mode. Neglect this target until the scheduled next key-frame, where the detection module re-generate the candidates and then associate their IDs to the tracking history. (2) Adaptive Keyframe Interval (AKI) mode. Immediately revive the missing target by candidate detection and identity association. The advantage of FKI mode is that the frame rate of pose tracking is stable due to the fixed interval of keyframes. The advantage of AKI mode is that the average frame rate can be higher for non-complex videos. In our experiments, we incorporate them by taking keyframes with fixed intervals while also calling detection module once a target is lost before the arrival of the next arranged keyframe. The tracking accuracy is higher because when a target is lost, it is handled immediately.
For identity association, we propose to consider two complementary information: spatial consistency and pose consistency. We first rely on spatial consistency, i.e., if two bounding boxes from the current and the previous frames are adjacent, or their Intersection Over Union (IOU) is above a certain threshold, we consider them to belong to the same target. Specifically, we set the matching flag to if the maximum IOU overlap ratio between the tracked target and the corresponding detection for key-frame is higher than the threshold . Otherwise, is set as :
The above criterion is based on the assumption that the tracked target from the previous frame and the actual location of the target in the current frame have significant overlap, which is true for most cases. However, such assumption is not always reliable, especially when the camera shifts swiftly. In such cases, we need to match the new observation to the tracked candidates. In Re-ID problems, this is usually accomplished by a visual feature classifier. However, visually similar candidates with different identities may confuse such classifiers. Extracting visual features can also be computationally expensive in an online tracking system. Therefore, we design a Graph Convolution Network (GCN) to leverage the graphical representation of the human joints. We observe that in two adjacent frames, the location of a person may drift away due to sudden camera shift, but the human pose will stay almost the same as people usually cannot act that fast, as illustrated in Fig. 2. Consequently, the graph representation of human skeletons can be a strong cue for candidate matching, which we refer to as pose matching in the following text.
Given the sequences of body joints in the form of 2D coordinates, we construct a spatial graph with the joints as graph nodes and connectivities in human body structures as graph edges. The input to our graph convolutional network is the joint coordinate vectors on the graph nodes. It is analogous to image-based CNNs where the input is formed by pixel intensity vectors residing on the 2D image grid. Multiple graph convolutions are performed on the input to generate a feature representation vector as a conceptual summary of the human pose. It inherently encodes the spatial relationship among the human joints. The input to the siamese networks is therefore a pair of inputs to the GCN network. The distance between two output features represent how similar two poses are to each other. Two poses are called a match if they are conceptually similar. The network is illustrated in Fig. 3. The siamese network consists of GCN layers and convolutional layer using contrastive loss. We take normalized keypoint coordinates as input; the output is a dimensional feature vector. The network is optimized with contrastive loss because we want the network to generate feature representations, that are close by enough for positive pairs, whereas they are far away at least by a minimum for negative pairs. we employ the margin contrastive loss:
where is the Euclidean distance of two -norm normalized latent representations, indicates whether and are the same pose or not, and is the minimum distance margin that pairs depicting different poses should satisfy.
Graph Convolution for Skeleton:
For standard 2D convolution on natural images, the output feature maps can have the same size as the input feature maps with stride
and appropriate padding. Similarly, the graph convolution operation is designed to output graphs with the same number of nodes. The dimensionality of attributes of these nodes, which is analogous to the number of feature map channels in standard convolution, may change after the graph convolution operation.
The standard convolution operation is defined as follows: given a convolution operator with the kernel size of , and an input feature map with the number of channels , the output value of a single channel at the spatial location can be written as:
where the sampling function enumerates the neighbors of location . The weight function provides a weight vector in -dimension real space for computing the inner product with the sampled input feature vectors of dimension .
The convolution operation on graphs is defined by extending the above formulation to the cases where the input features map resides on a spatial graph , i.e. the feature map has a vector on each node of the graph. The next step of the extension is to re-define the sampling function and the weight function . We follow the method proposed in . For each node, only its adjacent nodes are sampled. The neighbor set for node is . The sampling function can be written as . In this way, the number of adjacent nodes is not fixed, nor is the weighting order. In order to have a fixed number of samples and a fixed order of weighting them, we label the neighbor nodes around the root node with fixed number of partitions, and then weight these nodes based on their partition class. The specific partitioning method is illustrated in Fig. 4.
Therefore, Eq. (4) for graph convolution is re-written as:
where the normalization term is to balance the contributions of different subsets to the output. According to the partition method mentioned above, we have:
where is the average distance from gravity center to joint over all frames in the training set.
In this section, we present quantitative results of our experiments. Some qualitative results are shown in Fig. 5.
PoseTrack  is a large-scale benchmark for human pose estimation and articulated tracking in videos. It provides publicly available training and validation sets as well as an evaluation server for benchmarking on a held-out test set. The benchmark is a basis for the challenge competitions at ICCV’17  and ECCV’18  workshops. The dataset consisted of over frames for the ICCV’17 challenge and is extended to twice as many frames for the ECCV’18 challenge. It now includes training videos, validation videos and testing videos. For held-out test set, at most four submissions per task can be made for the same approach. Evaluation on validation set has no submission limit. Therefore, ablation studies in Section 4.4 are performed on the validation set. Since PoseTrack’18 test set is not open yet, we compare our results with other approaches in Sec. 4.5 on PoseTrack’17 test set.
The evaluation includes pose estimation accuracy and pose tracking accuracy. Pose estimation accuracy is evaluated using the standard mAP metric, whereas the evaluation of pose tracking is according to the clear MOT  metrics that are the standard for evaluation of multi-target tracking.
We adopt state-of-the-art key-frame object detectors trained with ImageNet and COCO datasets. Specifically, we use pre-trained models from deformable ConvNets. We conduct experiments on validation sets to choose the object detector with better recall rates. For the object detectors, we compare the deformable convolution versions of the R-FCN network  and of the FPN network , both with ResNet101 backbone . The FPN feature extractor is attached to the Fast R-CNN 
head for detection. We compare the detection results with the ground truth based on the precision and recall rate on PoseTrack’17 validation set. In order to eliminate redundant candidates, we drop candidates with lower likelihood. As shown in Table2, precision and recall of the detectors are given for various drop thresholds. Since the FPN network performs better, we choose it as our human candidate detector. During training, we infer ground truth bounding boxes of candidates from the annotated keypoints, because in PoseTrack’17 dataset, the bounding box positions are not provided in the annotations. Specifically, we locate a bounding box from the minimum and maximum coordinates of the keypoints, and then enlarge this box by 20% both horizontally and vertically.
with slight modifications. We first train the networks with the merged dataset of PoseTrack’17 and COCO for 260 epochs. Then we finetune the network solely on PoseTrack’17 for 40 epochs in order to mitigate the inaccurate regression on head and neck. For COCO, bottom-head and top-head positions are not given. We infer these keypoints by interpolation on the annotated keypoints. We find that by finetuning on the PoseTrack dataset, the prediction on head keypoints will be refined. During finetuning, we use the technique of online hard keypoint mining, only focusing on losses from thehardest keypoints out of the total keypoints. Pose inference is performed online with single thread.
For the pose matching module, we train a siamese graph convolutional network with GCN layers and convolutional layer using contrastive loss. We take normalized keypoint coordinates as input; the output is a dimensional feature vector. Following , we use spatial configuration partitioning as the sampling method for graph convolution and use learnable edge importance weighting. To train the siamese network, we generate training data from the PoseTrack dataset. Specifically, we extract people with same IDs within adjacent frames as positive pairs, and extract people with different IDs within the same frame and across frames as negative pairs. Hard negative pairs only include spatially overlapped poses. The number of collected pairs are illustrated in Table 1. We train the model with batch size of for a total of epochs with SGD optimizer. Initial learning rate is set to and is decayed by at epochs of . Weight decay is .
|Hard Negative Pairs||25064||7020|
|Other Negative Pairs||241450||91228|
We conducted a series of ablation studies to analyze the contribution of each component on the overall performance.
|-||Method / Thresh||0.1||0.2||0.3||0.4||0.5|
|-||Estimation (mAP)||Tracking (MOTA)|
Detectors: We experimented with several detectors and decide to use Deformable ConvNets with ResNet101 as backbone, Feature Pyramid Networks
(FPN) for feature extraction, and fast R-CNN scheme as detection head. As shown in Table2, this detector performs better than Deformable R-FCN with the same backbone. It is no surprise that the better detector results in better performances on both pose estimation and pose tracking, as shown in Table 3.
Offline vs. Online: We studied the effect of keyframe intervals of our online method and compare with the offline method. For fair comparison, we use identical human candidate detector and pose estimator for both methods. For offline method, we pre-compute human candidate detection and estimate the pose for each candidate, then we adopt a flow-based pose tracker , where pose flows are built by associating poses that indicate the same person across frames. For online method, we perform truly online pose tracking. Since human candidate detection is performed only at key frames, the online performance varies with different intervals. In Table 4, we illustrate the performance of the offline method, compared with the online method that is given various keyframe intervals. Offline methods performed better than online methods. But we can see the great potential of online methods when the detections (DET) at keyframes are more accurate, the upper-limited of which is achieved with ground truth (GT) detections. As expected, frequent keyframe helps more with the performance. Note that the online methods only use spatial consistency for data association at key frames. We report ablation experiments on the pose matching module in the following text.
|-||Estimation (mAP)||Tracking (MOTA)|
GCN vs. Spatial Consistency (SC): Next, we report results when pose matching is performed during data association stage, compared with only employing spatial consistency. It can be shown in Table 5 that the tracking performance increases with GCN-based pose matching. However, in some situations, different people may have near-duplicate poses, as shown in Fig. 6. To mitigate such ambiguities, spatial consistency is considered prior to pose similarity.
GCN vs. Euclidean Distance (ED): We studied whether the GCN network outperforms naive pose matching scheme. With same normalization on the keypoints, ED as the dissimilarity metric for pose matching renders 85% accuracy on validation pairs generated from PoseTrack dataset, while GCN renders 92% accuracy. We validate on positive pairs and hard negative pairs.
Since PoseTrack’18 test set is not open yet, we compare our methods with other approaches, both online and offline, on PoseTrack’17 test set. For fair comparison, we only use PoseTrack’17 training set and COCO train+val set to train the pose estimators. No auxiliary data is used. We performed ablation studies on validation sets with CPN-101  as the pose estimator. During testing, in addition to CPN-101, we conduct experiments using MSRA-152 .
|Posetrack 2017 Test Set|
|PoseTrack, CVPR’18 ||54.3||49.2||59.4||48.4||-|
|BUTD, ICCV’17 ||52.9||42.6||59.1||50.6||-|
|Detect-and-track, CVPR’18 ||-||-||59.6||51.8||-|
|Flowtrack-152, ECCV’18 ||71.5||65.7||74.6||57.8||-|
||Ours-CPN101 (offline)||68.0 / 59.7||62.6 / 56.3||70.7 / 63.9||55.1||-|
|Ours-MSRA152 (offline)||68.9 / 61.8||63.2 / 58.4||71.5 / 65.7||57.0||-|
|Ours-manifold (offline)||- / 64.6||- / 58.4||- / 66.7||58.0||-|
|PoseFlow, BMVC’18 ||59.0||57.9||63.0||51.0||10*|
|JointFlow, Arxiv’18 ||53.1||50.4||63.3||53.1||0.2|
|Ours-CPN101-LightTrack-3F||61.2||57.6||63.8||52.3||47* / 0.8|
|Ours-MSRA152-LightTrack-3F||63.8||59.1||66.5||55.1||48* / 0.7|
||Posetrack 2018 Validation Set|
|Ours-CPN101 (offline)||72.6 / 63.9||68.9 / 62.6||76.4 / 69.7||62.4||-|
|Ours-MSRA152 (offline)||73.6 / 65.6||70.5 / 64.9||77.3 / 71.2||64.9||-|
|Ours-YoloMD-LightTrack-2F||62.9 / 56.2||57.8 / 53.3||70.4 / 66.0||55.7||59* / 1.9|
|Ours-CPN101-LightTrack-2F||72.4 / 66.3||69.1 / 64.2||76.0 / 70.3||61.3||47* / 0.8|
|Ours-MSRA152-LightTrack-2F||73.3 / 66.4||70.9 / 66.1||77.2 / 72.4||64.6||48* / 0.7|
Accuracy: As shown in Table 6, our method LightTrack outperforms other online methods while maintaining a much higher frame rate, and is very competitive with offline state-of-the-arts. For our offline approach, we use the same detector and pose estimator of LightTrack, except we replace LightTrack with the official release of PoseFlow  for performance comparison. Although the PoseFlow algorithm is conceptually online, the processing is performed in multiple stages, and requires keypoint-matching between frames pre-computed, which is computationally expensive. In contrast, our LightTrack is truly processed online.
Speed: Testing on single Telsa P40 GPU, pose matching costs an average of ms for each pair. Since pose matching only occurs at key-frames, its frequency of occurrence depends on the number of candidates and length of keyframe intervals. Therefore, we test the average processing time on the PoseTrack’18 validation set, which consists of videos with a total of frames. It takes the online algorithm CPN101-LightTrack seconds to process, of which secs used for pose estimation. The frame rate of the whole system is fps. The framework runs at around fps excluding pose inference time. In total, persons are encountered. An average of people are tracked for each frame. It takes CPN101 ms to process each human candidate, including ms pose inference and ms for pre-processing and post-processing. There is potential room for the actual frame rate and tracking performance to improve with other choices of pose estimators and parallel inference optimization. We see an improved performance with MSRA152-LightTrack but slightly slower frame rate due to its ms inference time.
Accuracy: Since the components in our framework are easily replaceable and extendable, methods employing this framework can potentially become faster, more accurate, or possibly both. Note that the pose estimator mentioned in section 4.3 can be replaced by a more accurate  or a much faster counterpart. The performance boost in the general object detector, or methods that focus on detecting people (e.g., using auxiliary dataset ), should also improve the pose tracking performance. Ablation study in section 4.4 has shown that better detection increases the MOTA score, regardless of which detectors to use.
Speed: The pose estimation network can be prioritized for speed while sacrificing some accuracy. For instance, we use YOLOv3 and MobileNetv1-deconv (YoloMD) as detector and pose estimator, respectively. It achieves an average of FPS with mAP and MOTA score % on PoseTrack’18 validation set. Aside from network structure design, a faster network could also refine heatmaps from previous frame(s). Recently, refinement-based networks [29, 11] have drawn enormous attention.
Flexibility: The advantage of our top-down approach in pose tracking is that we can conveniently track specific targets and do not necessarily track all candidates. It can be achieved simply by choosing the target(s) at the first frame and providing target locations at key-frames. As a side effect, this further reduces computational complexity. If the target has specific visual appearance, the framework can be conveniently extended to ensure only the target can be matched at key-frames and tracked at remaining frames.
In this paper, we propose an effective and generic light-weight framework for online human pose tracking. We also provide a baseline employing this framework, and propose a siamese graph convolution network for human pose matching as a Re-ID module in our pose tracking system. The skeleton-based representation effectively captures human pose similarity and is computationally inexpensive. Our method outperforms other online methods significantly, and is very competitive with offline state-of-the-arts but with much higher frame rate. We believe the proposed framework is worthy to be widely used due to its superior performance, generality, and extensibility.