A simple baseline for one-shot multi-object tracking
There has been remarkable progress on object detection and re-identification in recent years which are the core components for multi-object tracking. However, little attention has been focused on accomplishing the two tasks in a single network to improve the inference speed. The initial attempts along this path ended up with degraded results mainly because the re-identification branch is not appropriately learned. In this work, we study the essential reasons behind the failure, and accordingly present a simple baseline to addresses the problems. It remarkably outperforms the state-of-the-arts on the public datasets at 30 fps. We hope this baseline could inspire and help evaluate new ideas in this field. The code and the pre-trained models will be released. Code is available at <https://github.com/ifzhang/FairMOT>.READ FULL TEXT VIEW PDF
A simple baseline for one-shot multi-object tracking
Multi-Object Tracking (MOT) has been a longstanding goal in computer vision[3, 37, 6, 40]
. The goal is to estimate the trajectories of multiple objects of interest in videos. The successful resolution of the task can benefit many applications such as action recognition, public security, sport videos analysis, elderly care, and human computer interaction. The state-of-the-art methods[23, 46, 11, 3, 37, 6, 40] usually address the problem by two separate models: the detection model first localizes the objects of interest by bounding boxes in the images, and then the association model extracts Re-identification (Re-ID) features for each bounding box and links it to one of the existing tracks according to certain metrics defined on the features. There has been remarkable progress on object detection [27, 12, 44, 26] and Re-ID [43, 6] respectively in recent years which in turn boosts the tracking performance. However, those methods cannot perform inference at video rate because the two networks do not share features.
With the maturity of multi-task learning , the one-shot methods which jointly detect objects and learn Re-ID features have began to attract more attention [35, 33]. Since most features are shared for the two models, they have the potential to notably reduce the inference time. However, the accuracy of the one-shot methods usually drops remarkably compared to the two-step ones. In particular, the number of ID switches increases a lot as will be shown in the experimental section. The results show that combining the two tasks is not trivial and should be treated carefully. Instead of using bags of tricks to improve the tracking accuracy, we study the reasons behind the failure, and present a simple yet effective baseline. Three factors which are critical to the tracking results are identified.
The current one-shot trackers [35, 33] are all based on anchors since they are modified from object detectors [26, 12]. However, the anchors are not suitable for learning Re-ID features for two reasons. First, multiple anchors, which correspond to different image patches, may be responsible for estimating the identity of the same object. This causes severe ambiguities for the network. See Figure 1 for illustration. In addition, the feature map is usually down-sampled by times to balance the accuracy and speed. This is acceptable for detection but is too coarse for ReID because the object centers may not align with the features extracted at coarse anchor locations for predicting the object’s identity. We solve the problem by treating the MOT problem as a pixel-wise keypoint (object center) estimation and identity classification problem on top of a high-resolution feature map.
This is particularly important for MOT because the Re-ID features need to leverage low-level and high-level features to accommodate both small and large objects. We observe in our experiment that this is helpful to reduce identity switches for the one-shot methods due to the improved ability to handle scale variations. Note that the improvement is less significant for the two-step methods because objects will have similar scales after the cropping and resizing operations.
The previous ReID methods usually learn high dimensional features, and have achieved promising results on their benchmarks. However, we find that lower-dimensional features are actually better for MOT because it has fewer training images than ReID (we cannot use the ReID datasets because they only provide cropped person images). Learning low-dimensional features helps reduce the risk of over-fitting to small data, and improves the tracking robustness. We present a simple baseline which jointly considers the above three factors. Note that we do not claim algorithmic novelty over the previous works. Instead, our contributions lie in first identifying the challenges behind the one-shot trackers, and then putting together a number of techniques and concepts that are developed in different areas of computer vision to address the challenges which are overlooked in the previous MOT works. The overview of our approach is shown in Figure 2. We first adopt an anchor-free object detection approach to estimate the object centers [44, 17, 45, 9] on a high-resolution feature map. The elimination of anchors alleviates the ambiguity problem and the use of a high-resolution feature map enables the Re-ID features to be better aligned with the object centers. Then we add a parallel branch for estimating the pixel-wise Re-ID features which are used to predict the objects’ identities. In particular, we learn low-dimensional Re-ID features which not only reduce the computation time but also improve the robustness of feature matching. We equip the backbone network  with the Deep Layer Aggregation operator  to fuse features from multiple layers in order to deal with objects of different scales. We evaluate our approach on the MOT Challenge benchmark via the evaluation server. It ranks first among all online trackers on the 2DMOT15 , MOT16 , MOT17  and MOT20  datasets. In fact, it also outperforms the offline trackers on the 2DMOT15, MOT17 and MOT20 datasets (MOT20 is the newest dataset and no previous works have reported results on it). In spite of the strong results, the approach is very simple and runs at 30 FPS. We hope it could be used as a strong baseline in this field. The code as well as the pre-trained models will be released.
In this section, we briefly review the related works on MOT by classifying them into the two-step and one-shot methods, respectively. We discuss the pros and cons of the methods and compare them to our approach.
to localize all objects of interest in the images by a number of boxes. Then in a separate step, they crop the images according to the boxes and feed them to the identity embedding network to extract Re-ID features, and link the boxes to form multiple tracks. The works usually follow a standard practice for box linking which first computes a cost matrix according to the Re-ID features and Intersection over Unions (IoU) of the bounding boxes, and then uses the Kalman Filter and Hungarian algorithm  to accomplish the linking task. A small number of works such as [23, 46, 11] use more complicated association strategies such as group models and RNNs. The advantage of the two-step methods is that they can use the most suitable model for each task, respectively, without making compromises. In addition, they can crop the image patches according to the detected bounding boxes and resize them to the same size before predicting Re-ID features. This helps to handle the scale variations of objects. As a result, these approaches  have achieved the best performance on the public datasets. However, they are usually very slow because both object detection and Re-ID feature embedding need a lot of computations without sharing between them. So it is hard to achieve video rate inference which is required in many applications.
in deep learning, one-shot MOT has begun to attract more research attention. The core idea is to simultaneously accomplish object detection and identity embedding (Re-ID features) in a single network in order to reduce the inference time through sharing most of the computation. For example, Track-RCNN adds a Re-ID head on top of Mask-RCNN  and regresses a bounding box and a Re-ID feature for each proposal. The JDE  is introduced on top of the YOLOv3  framework which achieves near video rate inference. However, the tracking accuracy of the one-shot methods is usually lower than that of the two-step methods. We find this is because the learned Re-ID features are not optimal which leads to a large number of ID switches. We deeply investigate the reasons and find that the identity embedding features extracted at anchors are not aligned with the object centers which causes severe ambiguities. To address the problem, we propose to use anchor-free approaches for both object detection and identity embedding which significantly improves the tracking accuracy on all benchmarks.
In this section, we present the details for the backbone network, the object detection branch and the Re-ID feature embedding branch, respectively.
We adopt the ResNet-34  as our backbone in order to strike a good balance between the accuracy and speed. To accommodate objects of different scales, a variant of Deep Layer Aggregation (DLA)  is applied to the backbone as shown in Figure 2. Different from the original DLA , it has more skip connections between the low-level and high-level features which is similar to the Feature Pyramid Network (FPN) . In addition, all convolution layers in the up-sampling module are replaced by the deformable convolution layers such that they can dynamically adapt the receptive field according to the object scales and poses. These modifications are also helpful to alleviate the alignment issue. The resulting model is named DLA-34. Denote the size of the input image as , then the output feature map has the shape of where and .
Following , we treat object detection as a center-based based bounding box regression task on a high-resolution feature map. In particular, three parallel regression heads are appended to the backbone network to estimate heatmaps, object center offsets and bounding box sizes, respectively. Each head is implemented by applying a convolution (with channels) to the output feature maps of the backbone network, followed by a convolutional layer which generates the final targets.
This head is responsible for estimating the locations of the object centers. The heatmap based representation, which is the de facto standard for the landmark point estimation task, is adopted here. In particular, the dimension of the heatmap is . The response at a location in the heatmap is expected to be one if it collapses with the ground-truth object center. The response decays exponentially as the distance between the location in the heatmap and the object center.
This head is responsible for localizing the objects more precisely. Recall that the stride of the feature map is four which will introduce non-negligible quantization errors. Note that the benefits for object detection performance may be marginal. But it is critical for tracking because the Re-ID features should be extracted according to the accurate object centers. We find in our experiments that the careful alignment of the ReID features with object centers is critical for the performance.
This head is responsible for estimating the height and width of the target bounding box at each anchor location. This head is not directly related to the Re-ID features but the localization precision will impact the evaluation of the object detection performance.
The goal of the identity embedding branch is to generate features that can distinguish different objects. Ideally, the distance between different objects should be larger than that between the same object. To achieve the goal, we apply a convolution layer with kernels on top of the backbone features to extract identity embedding features for each location. The resulting feature map is . The Re-ID feature of an object at is extracted from the feature map.
For each GT box in the image, we compute the object center as and , respectively. Then its location on the feature map is obtained by dividing the stride . Then the heatmap response at the location is computed as where represents the number of objects in the image and20]:
where is the estimated heatmap, and are the parameters.
We denote the outputs of the size and offset heads as and , respectively. For each GT box in the image, we can compute its size as . Similarly, the GT offset can be computed as . Denote the estimated size and offset at the corresponding location as and , respectively. Then we enforce losses for the two heads:
We treat object identity embedding as a classification task. In particular, all object instances of the same identity in the training set are treated as one class. For each GT box in the image, we obtain the object center on the heatmap
. We extract an identity feature vectorat the location and learn to map it to a class distribution vector . Denote the one-hot representation of the GT class label as . Then we compute the softmax loss as:
where is the number of classes.
In this section, we explain the inference of our model and how we perform box tracking with the detection results and identity embeddings.
The network takes an image of size as input which is the same as the previous work JDE . On top of the predicted heatmap, we perform non-maximum suppression (NMS) based on the heatmap scores to extract the peak keypoints. We keep the locations of the keypoints whose heatmap scores are larger than a threshold. Then, we compute the corresponding bounding boxes based on the estimated offsets and box sizes. We also extract the identity embeddings at the estimated object centers.
We use the standard online tracking algorithm to achieve box linking. We initialize a number of tracklets based on the estimated boxes in the first frame. In the subsequent frames, we link the boxes to the existing tracklets according to their distances measured by Re-ID features and IoU’s. We also use Kalman Filter to predict the locations of the tracklets in the current frame. If it is too far from the linked detection, we set the corresponding cost to infinity which effectively prevents from linking the detections with large motion. We update the appearance features of the trackers in each time step to handle appearance variations as in [4, 14].
Following the previous works such as , we compose a large training dataset by combining the training images from six public datasets for human detection and search. In particular, the ETH  and the CityPerson  datasets only provide bounding box annotations so we only train the detection branch on them. The CalTech , MOT17 , CUHK-SYSU  and PRW  datasets provide both bounding box and identity annotations on which we train both of the detection and identity embedding branches. Since some videos in the ETH dataset also appear in the testing set of the MOT16 dataset, we remove them from the training dataset for fair comparison. In some ablative experiments, we propose to train our model on a smaller dataset to save the computation cost which will be described clearly. We extensively evaluate a variety of factors of our approach on the testing sets of four benchmarks: 2DMOT15, MOT16, MOT17 and the recently released MOT20. As in , We use Average Precision (AP) for evaluating the detection performance, and True Positive Rate (TPR) at a false accept rate of for evaluating the Re-ID features. We use the CLEAR metric  and IDF1  to evaluate the tracking accuracy.
We use a variant of DLA-34 proposed in  as our default backbone. The model parameters pre-trained on the COCO detection dataset  are used to initialize our model. We train our model with the Adam optimizer for epochs with a starting learning rate of . The learning rate decays to and , at and epochs, respectively. The batch size is set to be . We use standard data augmentation techniques including rotation, scaling and color jittering. The input image is resized to and the feature map resolution is . The training takes about 30 hours on two RTX 2080 GPUs.
The previous one-shot trackers are based on anchors which suffer from the mis-alignment problem as described in the previous sections. In this section, we numerically validate the argument by constructing an anchor-based baseline on top of our approach by replacing the detection branch with the anchor-based method used in . We keep the rest of the factors the same for the two approaches for fair comparison. Note that the models in this section are trained on the large training dataset because the anchor-based method obtains very bad results when we use small datasets for training. The results are shown in Table 1.
We can see that the anchor-based method obtains consistently lower MOTA scores than our anchor-free method for different strides. For example, when the stride is , the anchor-free method achieves a significantly better TPR score than the anchor-based baseline ( vs. ) meaning that the Re-ID features of the anchor-free method have clear advantages. The main reason is that the mis-alignment between the anchors and object centers causes severe ambiguities to the learning of the network. It is noteworthy that increasing the feature map resolution for the anchor-based method even degrades the MOTA score. This is because there will be more unaligned positive anchors when we use high-resolution feature maps which makes the network training even more difficult. We do not show the results for the stride of because the significantly increased number of anchors exceed the memory capacity of our GPUs.
In contrast, our anchor-free approach suffers less from the mis-alignment issue and achieves notably better MOTA score than the anchor-based one. In particular, the number of ID switches decreases significantly from to for the stride of four. More importantly, our approach benefits a lot when we decrease the stride from to . Further decreasing the stride to begins to degrade the results because the introduction of lower-level features makes the representation less robust to appearance variations. We also visualize the Re-ID features learned by different models in Figure 3. We can see that the features of different identities are mixed for the anchor-based approach, especially when the stride is . In contrast, they are well separated for our anchor-free approach.
This section evaluates the impact of multi-layer feature aggregation in the backbone networks. In particular, we evaluate a number of backbones such as the vanilla ResNet , Feature Pyramid Network (FPN) , High-Resolution Network (HRNet)  and DLA-34 . The remaining factors of the approaches are controlled to be the same for fair comparison. The stride of the final feature map is for all methods in this experiment. In particular, We add three up-sampling operations for the vanilla ResNet to obtain the stride-4 feature map. We split the training subset of the 2DMOT15 dataset into 5 training videos and 6 validation videos following the practice of the previous work . The large scale training dataset is not used here in order to reduce the computation cost.
The results are shown in Table 2. We can see that DLA-34, which is built on top of the ResNet-34, achieves a notably better MOTA score than the vanilla ResNet-34. In particular, TPR increases significantly from to which in turn decreases the number of ID switches (IDs) from to . The experimental results suggest that the discriminative ability of the Re-ID features improves due to the multi-layer feature fusion. By comparing the results of ResNet-34 and ResNet-50, we can see that using a larger network also improves the overall MOTA score. However, if we look into the detailed metrics, we find that the improvement is mainly from the enhanced detection results measured by AP. However, the Re-ID features barely benefit from the larger network. For example, TPR only improves from to . In contrast, the number is for DLA-34. The results demonstrate that multi-layer fusion has clear advantages over using deeper networks in terms of improving the identity embeddings. We also compare to other multi-layer fusion methods such as HRNet  and FPN . Both approaches achieve better MOTA scores than ResNet-34. The improvement not only comes from the enhanced detection results, but also from the improved discriminative ability of the Re-ID features. For example, TPR increases from to for HRNet. The DLA-34 model has additional gains over FPN and HRNet. We find that the deformable convolution in DLA-34 is the main reasons for the gap because it can alleviate the mis-alignment issue caused by down-sampling for small objects. As shown in Table 3, we can see that DLA-34 mainly outperforms HRNet on small and middle sized objects. We visualize the Re-ID features of all persons in the testing set in Figure 4 by t-SNE . We can see that the features learned by the vanilla ResNet-34 are not discriminative since the features of different identities are mostly mixed together. This will cause a large number of ID switches in the linking stage. The Re-ID features learned by HRNet become better except that the pink and green points are largely confused. In addition, the Re-ID features of DLA-34 are more discriminative than the two baseline methods.
The previous works usually learn dimensional features without ablation study. However, we find in our experiments that the feature dimension actually plays an important role. In general, to avoid over-fitting, training high-dimensional Re-ID features requires a large number of training images which is not available for the one-shot tracking problem. The previous two-step approaches suffer less from the problem because they could leverage the abundant Re-ID datasets which provide cropped person images. The one-shot methods including ours cannot use them because it requires original uncropped images. One solution is to reduce its dependence on data by reducing the dimensionality of Re-ID features. We evaluate multiple choices of dimensionality in Table 4. We can see that TPR consistently improves when the dimension decreases from to which demonstrates the advantages of using low-dimensional features. Further reducing the dimensionality to begins to decrease TPR because the representative ability of the Re-ID features suffers. Although the changes for MOTA score are very marginal, the number of ID switches actually decreases significantly from to . This actually plays a critical role in improving the user experience. The inference speed is also slightly improved by reducing the dimensionality of the Re-ID features. It is noteworthy that the argument of using lower-dimensional Re-ID features only holds when we have access to a small number of training data. The gap caused by the feature dimensionality will become smaller when the number of training data increases.
We compare our approach to the state-of-the-art methods including both the one-shot methods and the two-step methods.
There are only two published works, i.e. JDE  and TrackRCNN , that jointly perform object detection and identity feature embedding. In particular, TrackRCNN requires additional segmentation annotations and reports results using a different metric for the segmentation task. So in this work, we only compare to JDE.
For fair comparison, we use the same data for training and testing as in . Specifically, we use 2DMOT15-train and MOT16-test for validation. The CLEAR metric  and IDF1  are used to measure the performance. The results are shown in Table 5. We can see that our approach remarkably outperforms JDE  on both datasets. In particular, the number of ID switches reduces from to which is big improvement in terms of improving the user experience. The results validate the effectiveness of the anchor-free approach over the previous anchor-based one. The inference speed is near video rate for the both methods with ours being faster.
We compare our approach to the state-of-the-art online trackers including the two-step methods on the MOT Challenge dataset in Table 6 111online tracker means it only uses the information before current frame for tracking; offline tracker could use the whole video.. Since we do not use the public detection results, the “private detector” protocol is adopted. We report results on the testing sets of the 2DMOT15, MOT16, MOT17 and MOT20 datasets, respectively. We finetune our model for 10 epochs on each of the dataset before doing testing. All of the results are obtained on the MOT challenge evaluation server. Our approach ranks first among all online trackers on the four datasets. In fact, it also achieves the highest MOTA score among all online and offline trackers on the 2DMOT15 and MOT17 datasets, respectively. This is a very strong result considering that our approach is very simple. In addition, our approach achieves video rate inference. In contrast, most high-performance trackers such as [11, 40] are usually slower than ours.
We present a simple baseline for one-shot multiple object tracking. We start by studying why the previous methods such as  fails to achieve comparable results as the two-step methods. We find that the use of anchors in object detection and identity embedding is the main reason for the degraded results. In particular, multiple nearby anchors, which correspond to different parts of an object, may be responsible for estimating the same identity which causes ambiguities for network training. We present a simple anchor-free approach which outperforms the previous state-of-the-arts on several benchmark datasets with fps. We hope it could inspire and evaluate new ideas in this field.
2010 IEEE computer society conference on computer vision and pattern recognition, pp. 2544–2550. Cited by: §3.5.2.
Online multi-object tracking with convolutional neural networks. In 2017 IEEE International Conference on Image Processing (ICIP), pp. 645–649. Cited by: Table 6.
Journal of machine learning research9 (Nov), pp. 2579–2605. Cited by: Figure 3, Figure 4, §4.3.2.
Multi-object tracking using online metric learning with long short-term memory. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 788–792. Cited by: Table 6.
Online multi-target tracking with tensor-based high-order graph matching. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 1809–1814. Cited by: §1, §2.1, Table 6.