Multi-object tracking (MOT) aims to track all objects of interest categories in a video sequence [18, 26]. It is crucial in applications like video surveillance and autonomous driving, where multiple pedestrians and vehicles need to be tracked simultaneously [6, 25, 3]. In recent years, tracking-by-detection [9, 18, 23, 6, 1, 5] has become the predominant paradigm of MOT. This approach first detects objects in each frame, then extracts discriminative features to quantify the similarities between targets, and finally perform data association to assign detections into their most likely trajectories. During this process, several influential parameters need to be set manually, such as the threshold determining whether to establish associations. To find the optimal parameters, an evaluation procedure is needed to measure the tracking performance. However, existing evaluation metrics like event-based measures CLEAR MOT  or identity-based measures  all require ground truth annotations, limiting the optimization to training data. Since the optimized parameters could be sub-optimal in test scenes, a self evaluation metric that enables parameters optimization without ground truth is urgently needed.
To evaluate the accuracy and stability of a tracker without ground truth, we design a self quality evaluation metric that considers the quantity, length, and feature distance information of the trajectory hypotheses comprehensively. Our method can assess the quality of trajectories owing to the distinctive distance distribution forms as shown in Figure 1. The intra distance denotes the feature distance between each two detection boxes in the same trajectory, and all pairs constitute the intra distance distribution. Similarly, the inter distance denotes the feature distance between each two detection boxes from different trajectories. Intuitively, when a trajectory contains different targets, the distance distribution scatters, and we demonstrate that it has the general characteristic of multiple peaks.
enables automatic parameters adaptation to accommodate different scenes. Designing a tracking algorithm that performs well under various video scenes is hard, yet tuning parameters in existing tracking algorithms can achieve equally outstanding performance in an easier manner. To the best of our knowledge, there is no previous work in this area to date. We believe that our approach is exceedingly instructive and provides new ideas for future research.
In summary, our contributions are as follows: (1) We show that feature distance distributions can reflect trajectory hypotheses quality; (2) We propose a self quality evaluation metric
based on two-class Gaussian mixture model, which can primarily fulfill the self-evaluation desire; (3) We test the effectiveness of our method on various data sets and note its drawbacks. A future prospect of using distributions to estimate erroneous frames is discussed in the end.
2 Related Work
2.1 MOT algorithms
In the tracking-by-detection paradigm, trackers first detect objects in each frame and then associate detections over time to form trajectories for targeted individuals [9, 26, 12]. Online methods [14, 24, 4, 23] only use previous and current frames and are thus suitable for real-time applications. One straightforward implementation is simple online and real-time tracking (SORT) 
, which predicts the new locations of bounding boxes using Kalman filter, followed by a data association procedure using intersection-over-union (IOU) to calculate the cost matrix. Although SORT achieves favourable speed and accuracy simultaneously, it suffers from heavy identity switches due to short-term motion information. Deep SORT
, on the other hand, introduces object re-identification (REID) as appearance information to handle long-term occlusions, leading to a more robust and effective algorithm. Due to the rapid development of deep neural networks (DNNs), REID features with powerful discriminative capability have been popularized in MOT algorithms[6, 22, 26, 25]. In addition, the frame-by-frame association problem is often seen as bipartite graph matching solved by Hungarian algorithm .
By contrast, offline methods [27, 9, 2] have access to the whole sequence and can perform global optimization on data association. These batch methods generally formulate MOT as a network flow problem [27, 15]. K-shortest paths (KSP) , successive shortest-path (SSP) , and dynamic programming (DP)  can be used to find the optimal solution. Offline methods enable correction of early errors in online methods and often show better performance, but are not applicable to time-critical applications.
In this paper, we focus on a simple, efficient, and easy-to-implement tracking framework. We use REID features to calculate the cost between current object detections and existing tracklets, minimize the total cost by Hungarian algorithm, and employ operations like interpolation and merging to correct previous results. Among all the parameters that need to be set, the REID threshold and merging threshold are the two most dominant parameters, which allows establishing associations and merging tracklets respectively.
2.2 Evaluation metrics
Quantitative evaluation of tracking performance is challenging due to the complexity of multi-target tracking task. A large number of metrics have been proposed [10, 17, 20, 13], including two main common metrics serving different purposes. One of them is CLEAR MOT metrics [3, 11], which contains multiple object tracking accuracy (MOTA) and multiple object tracking precision (MOTP):
where denotes the number of matched targets in frame , and denotes the matching distance of target . Comparing to MOTP, which is mainly influenced by localization accuracy of detections, MOTA sums various sources of errors, including false negatives, false positives, and identity switches, providing a better overall performance measure.
The other is ID metrics , which contains identification precision (), identification recall () and corresponding score :
where , , and are calculated by the truth-to-result match, i.e., bipartite graph matching between true trajectories and hypothetical trajectories. Afterwards, each hypothesis is assigned to a unique target. All the frames of hypotheses with small overlap are seen as false positives and that of ground truth are seen as false negatives.
Comparing to , better measures the consistency of ID matching. A simple example to illustrate its effectiveness is presented in Figure 2. In this paper, we focus on the performance of identification, and thus use as the reference of our self-evaluation metric.
3 Self Quality Evaluation
We design a novel self quality evaluation metric to measure the tracking performance without ground truth annotations that can enable parameters optimization to gain better tracking performance in reality. This metric should be positively correlated with which generally measures the tracking performance the best. The guiding design criteria is provided below in which we highlight some distinctive features an ideal tracker should possess. Both the theoretical and practical facets show that high-quality trajectories present single peaks in the feature distance distribution, while low-quality trajectories present multiple peaks.
3.1 Design criteria
To have a better understanding of the proposed metric, we first explain that an ideal MOT tracker should meet the following criteria. It should be able to: (1) track all targets continuously from appearing to leaving the tracking area; (2) track each target consistently, that is, each target should be assigned one and only one track ID over time; (3) locate the position of each target as accurately as possible.
As mentioned in Section 2.2, (3) quantifies the detection performance in the tracking-by-detection paradigm, thus it is not our main focus. For self evaluation metrics design, (1) inspires that the number and length of trajectories are supposed to be appropriate. (2) leads to the assumption that for an outstanding tracker, REID features are as similar as possible if coming from the same trajectory, otherwise are as different as possible. This can be characterized by the intra and inter distance of trajectories. We define the distance between two features and as their Euclidean distance:
Based on the above considerations, our self evaluation metric should take the quantity, length, and feature distance information into account comprehensively. Since establishing relationship between the identification quality and the absolute values of distance is hard, distance distribution analysis is considered to be a more reasonable solution.
3.2 Distance distribution analysis
We demonstrate in theory that the intra distance of the same target and the inter distance of different targets obey chi distribution.
For object representation, it is common that low-quality inputs will lead to uncertain estimations, causing the computed REID features to fluctuate around the ideal value. We follow the assumptions in 
, modeling the distribution of features as multivariate Gaussian distribution:
is a N-dimension feature vector,and represent the ideal value and uncertainty along each dimension respectively. Each dimension obeys an independent Gaussian distribution.
We measure the Euclidean distance between a pair of features :
According to the nature of independent Gaussian random variables, we have. If comes from the same target, then ,
. Thus, the feature distance after standardization obeys chi distribution with a degree of freedom equals to N:
and if comes from different targets:
Therefore, the intra and inter distance distributions of ideal trajectory hypotheses present single peaks. Next we consider a low-quality trajectory containing an identity switch between target A and B. For the ease of analysis, we assume that each target and feature dimension has the same variance. Therefore, the distance of featuresobeys non-central chi distribution with a positive noncentrality parameter . Meanwhile, the distance within each target obeys central chi distribution proved as above. The final distance distribution is indeed the sum of central and non-central chi distributions, thus showing a bimodal form. It can be inferred that the low-quality trajectories with wrong identification would present multiple peaks in the intra and inter distance distributions.
3.3 Practical verification
We practically verify the above conclusions by visualizing the intra and inter distance distributions of several different tracking cases in Figure 3. The results exhibit that the high-quality trajectories, such as the one labeled with ID 0 consistently tracks a person moving forward while being separated from the one with ID 1, present single peaks. In contrast, the low-quality trajectories, such as the one containing an identity switch with ID 9 and the overlapped ones with ID 3 and ID 220, present multiple peaks.
To quantify the validity of our Gaussian assumption in Section 3.2, we use the descriptor provided in 
to perform a normality test on the ground truth of MOT16 train set and find that 74% of the trajectories can be approximate as Gaussian distribution at a significance level of 0.1. Under low-density scenarios like MOT16-05, the percentage raises to 88 %. Considering that counterexamples may occur in practice, such as two similar-dressed people, we have also tested the performance of the descriptor on classifying unique person IDs in MOT16’s detection boxes. When the precision is set to 0.95, the recall and mAP can reach 0.94 and 0.98 respectively. Therefore, we consider the counterexamples only make up a small portion.
However, due to non-ideal factors, the final distances do not fully obey the theoretical chi distribution. We take ID 0 for example. Although a similar overall shape is shown, the hypothesis test has an extremely low p-value of 0, indicating a statistically significant difference. This may have two reasons: (1) Bias is introduced when using sample statistics to replace the true mean and variance for standardization; (2) Features extracted by the REID model are not independent in each dimension. The second reason is very common, since deep neural networks tend to cause strong correlations between multiple dimensions.
It is encouraging that the trajectories of different qualities still retain the distinctive single or multiple peaks. The more frames with wrong identification, the more obvious the two peaks, and the larger interval between them. In practice we found that fitting a two-class Gaussian distribution and setting a threshold for the mean difference can qualitatively detect those low-quality trajectories which significantly affect tracking performance. According to the visualization results, we also found that the false alarm trajectory is usually short in length, large in variance, and may interfere with the inter distances to produce multiple peaks. These trajectories for which no real target exists are also categorized as low-quality trajectories.
Based on the above criteria and distance distribution analyses, we propose a novel self quality evaluation metric , which can be expressed as:
The specific explanation is detailed below. The evaluation process is summarized in Algorithm 1 and mainly divided into four steps:
(1) For a trajectory with short length and large standard deviation, we mark it as false alarm and accumulate.
(2) For the rest trajectories we utilize a two-class Gaussian mixture model to fit the intra distances, and judge whether it is a low-quality trajectory according to the mean difference. If it exceeds a certain threshold, we assert that this trajectory contains more than one target and accumulate a difference error, denoted by .
(3) Similarly, the inter distances of each two non-false alarm trajectories are also fitted. They are considered to match the same target with a large mean difference, and the similarity error is denoted by .
(4) Other internal characteristics like the number and mean length of trajectories are also embedded.
When the REID threshold is set too strict, there are so many detection boxes being excluded that and are both small; when
remains almost constant, the two variables have opposite trend, and extreme situations including excessively fragmented or concatenated trajectories will lead to imbalance between them. To downgrade these poor tracking results, we employ the form of harmonic mean, and setto accommodate moving speed and density of tracking objects. For pedestrian tracking task on street videos, the magnitude of and is approximately equivalent, and thus could be set to concisely.
Based on this rough constraint form, a correction item is added to the denominator. We have demonstrated that the accumulated , and can reflect the number of low-quality trajectories. Therefore, their sum is expected to be small, and meanwhile the value of is large. The correction item actually plays a key role within the range of moderate values of and . is used to adjust the ratio between , and sum of errors.
Parameters in SQE are not difficult to set. is comparable to the video’s frame rate. With a high-precision ReID model, randomly selecting false alarms and ID switch examples from reference videos is adequate to observe and , so as to set and accordingly. Additionally, when the tracker and task (vehicle/pedestrian) are given, and could be set empirically.
Implementation details. We assess our self evaluation method mainly on the MOT16 Challenge data sets , which contains 14 video sequences (7 for training, 7 for testing) taken by both static and moving cameras from different angles in different scenes. We focus our study on pedestrian tracking and make use of the person ReID model provided by . All the experiments are completed with the same parameter setting: , , , and takes 2 and 10 for the REID threshold and the merging threshold, respectively. The REID threshold varies from 0.3 to 1.6, beyond which remains invariant. Similarly, the merging threshold varies from 0.5 to 1.5. The parameter optimization process is based on grid search. The rest of this section prove the accuracy, universality, and effectiveness of our self quality evaluation metric .
denotes that the score for parameters is only calculated after the parameters are determined by , but not used to tune the parameters.
Comparison with supervised metrics. To demonstrate the effectiveness of our self evaluation metric in evaluating tracking performance, we compare its score with existing commonly used supervised metrics on MOT16-02 training video, and visualize and in Figure 4. We found that as the the REID threshold ascends, both and increase at first and decrease afterwards, and reach the highest value at 0.8 with relatively high , , and . These two items present a very similar trend, which indicates that our designed metric can primarily fulfill the desired positive correlations with which generally measures the performance of identification the best.
MOT16-02 video records a complex scene with a large number of people walking around a large square. We further analyse the result on MOT16-09 video, a simpler street scene with low density and the least number of tracks from a low angle, in Figure 5. The favourable similarity illustrates that our self evaluation method can be generalized to different viewpoints and scenarios. The detailed results on other videos are provided in the supplementary material. We summarize the optimal REID threshold determined by and in Table 1, with corresponding evaluation scores under these parameters. Our self evaluation method can approximately quantify tracking performance, specifically, 85% of the optimal parameter differences do not exceed 0.25, and 85% of the corresponding differences do not exceed 3.
Generalization to other tracking algorithms.
To illustrate the robustness and universality of our method, other tracking algorithms are supposed to be tested as a supplementary experiment. We choose Deep SORT, which is one of the highly recognized and open source MOT algorithms in recent years. The REID threshold corresponds to the matching cosine threshold in Deep SORT. This algorithm replaces our interpolation logic with IOU matching, causing the features during occlusion time period to exhibit a small interference peak in the intra distance distribution; therefore, we remove the feature information of these frames when performing self evaluation. As shown in Figure6, a strong correlation between and is presented, demonstrating the success of our method on other trackers.
Generalization to other parameters. We further test the universality of our method on other parameters. Except for the REID threshold, the merging threshold is another dominant factor affecting final tracking performance. Similarly, we visualize the comparison of and of both complex and simple scenes in Figure 7 and 8. The results still maintain positive correlations. Table 2 shows a high accuracy, with 5 out of 7 videos have an optimal parameter difference below 0.1, and almost all the corresponding differences do not exceed 3.
Practical testing. Our ultimate goal is to find the optimal parameters in realistic scenes where ground truth is unavailable. Additionally in reality the training data is relatively small in scale comparing to the unknown test environment. To test our method in a pragmatic manner, we regard the first 4 training videos as our test set and the last 3 training videos as our training set. Conventionally, the parameters are tuned on the training set and remain constant during testing. In our simulation, we name these parameters as the baseline parameters. Conversely, our metric can guide the self-optimization of parameters without ground truth. Thus, it is employed directly to tune the 4 testing videos individually.
In reality we can first acquire baseline parameters as reference on small-scale training data, then conduct self evaluation to further optimize the parameters in a relatively small range. The procedure of computing the customized parameters is as follows: (1) Find baseline parameters; (2) For each testing video, fix one parameter with reference to baseline, and then tune the other according to alternately; (3) Combine them to be the customized parameters.
Our method is considered to be effectual if the tracker using the customized parameters outperform the tracker using constant training-set-tuned parameters. The result is shown in Table 3, where gt denotes the true optimal parameters on each video. To be rigorous, we use the best parameters found by grid search on the 3 assumed training videos as the baseline. It is apparent that the parameters tuned by achieve considerable improvement comparing to the baseline, and the results are much closer to the the true optimum, showing the effectiveness of our method when implemented in a practical manner.
we use the 5 videos with the most pedestrians in KITTI train set.
To further illustrate the performance of self-optimization using , we experiment on MOT16 test set and KITTI train set . The baseline parameters are the best parameters found by grid search on MOT16 training set, which outperform empirical parameters by 5.8% already. This setup is based on the updated submitting policy of KITTI 111http://www.cvlibs.net/datasets/kitti/eval_tracking.php, and we believe it can simulate pedestrian tracking in reality where test scenes varies greatly compared to annotated videos. As shown in Table 4, the parameter self-optimization enabled by elevates the performance of the tracker on these data sets.
Drawbacks and prospects. The above experiments reflect the effectiveness of our proposed metric, while there are still some drawbacks worth noting. Firstly, due to the randomness during model fitting, and possess several units of uncertainty, resulting in insufficient sensitivity to small changes in . Secondly, current metrics lack physical consistency explanation. is calculated by , and , while our method simply records the number of low-quality trajectories. A more precise idea is to estimate and relying on the quantity information. Assume that for a trajectory where an identity switch occurs, target A appears frames while target B appears frames. The total length is and the number of distances in the class with larger values is . Then A and B satisfy the following conditions:
which can be easily solved. We can make estimations by:
Such processing for the intra distance distribution can accurately estimate the number of erroneous frames. Furthermore, the inter distance distribution can help refine the estimations. For example, if there is another trajectory that also tracks A, we only keep the longer one as according to the calculation rule of . However, more detailed considerations are needed for global precise estimations. In addition, categorizing low-quality trajectories and estimating erroneous frames may also be conducive to tracker’s post-processing so as to improve tracking performance. Finally, the adjustable parameters and need to be defined more strictly. We plan to investigate these downsides in the future.
In this paper, we propose a self quality evaluation metric to enable the parameters optimization in the test environment and realistic scenes where ground truth is unavailable. This new perspective can bypass the difficulty of designing an algorithm that perform well in various scenes. We demonstrate that trajectories with different qualities exhibit different single or multiple peaks in feature distance distribution, inspiring us to use a two-class Gaussian mixture model to estimate identification errors. Experiments mainly on the MOT16 Challenge data sets demonstrate the effectiveness of our method in both correlating with existing metrics and enabling parameters self-optimization to achieve better tracking performance. In the end, the drawbacks and prospects for future work are summed up. We believe that our work is instructive for further MOT research.
This research was supported by National Key R&D Program of China (No. 2017YFA0700800).
-  (2012) Discrete-continuous optimization for multi-target tracking. In , pp. 1926–1933. Cited by: §1.
-  (2011) Multiple object tracking using k-shortest paths optimization. IEEE transactions on pattern analysis and machine intelligence 33 (9), pp. 1806–1819. Cited by: §2.1.
-  (2008) Evaluating multiple object tracking performance: the clear mot metrics. Journal on Image and Video Processing 2008, pp. 1. Cited by: §1, §2.2.
-  (2016) Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP), pp. 3464–3468. Cited by: §2.1.
-  (2010) Online multiperson tracking-by-detection from a single, uncalibrated camera. IEEE transactions on pattern analysis and machine intelligence 33 (9), pp. 1820–1833. Cited by: §1.
-  (2019) Multi-object tracking with multiple cues and switcher-aware classification. arXiv preprint arXiv:1901.06129. Cited by: §1, §2.1.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.
-  (1955) The hungarian method for the assignment problem. Naval research logistics quarterly 2 (1-2), pp. 83–97. Cited by: §2.1.
-  (2015) Followme: efficient online min-cost flow tracking with bounded memory and computation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4364–4372. Cited by: §1, §2.1, §2.1.
-  (2009) Learning to associate: hybridboosted multi-target tracker for crowded scene. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2953–2960. Cited by: §2.2.
-  (2016) MOT16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831. Cited by: §2.2, §4.
Online multi-target tracking using recurrent neural networks. In
Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.1.
-  (2007) ETISEO, performance evaluation for video surveillance systems. In 2007 IEEE Conference on Advanced Video and Signal Based Surveillance, pp. 476–481. Cited by: §2.2.
-  (2004) Markov chain monte carlo data association for general multiple-target tracking problems. In 2004 43rd IEEE Conference on Decision and Control (CDC)(IEEE Cat. No. 04CH37601), Vol. 1, pp. 735–742. Cited by: §2.1.
-  (2011) Globally-optimal greedy algorithms for tracking a variable number of objects. In CVPR 2011, pp. 1201–1208. Cited by: §2.1.
-  (2016) Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision, pp. 17–35. Cited by: §1, §2.2.
-  (2008) A consistent metric for performance evaluation of multi-object filters. IEEE transactions on signal processing 56 (8), pp. 3447–3457. Cited by: §2.2.
-  (2017) Deep network flow for multi-object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6951–6960. Cited by: §1.
-  (2019-10) Probabilistic face embeddings. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §3.2.
-  (2005) Evaluating multi-object tracking. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)-Workshops, pp. 36–36. Cited by: §2.2.
-  (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), pp. 480–496. Cited by: §3.3, §4.
-  (2017) Multiple people tracking by lifted multicut and person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3539–3548. Cited by: §2.1.
-  (2017) Simple online and realtime tracking with a deep association metric. In 2017 IEEE International Conference on Image Processing (ICIP), pp. 3645–3649. Cited by: §1, §2.1.
-  (2015) Learning to track: online multi-object tracking by decision making. In Proceedings of the IEEE international conference on computer vision, pp. 4705–4713. Cited by: §2.1.
-  (2019) Online multiple pedestrian tracking using deep temporal appearance matching association. arXiv preprint arXiv:1907.00831. Cited by: §1, §2.1.
-  (2019) Frame-wise motion and appearance for real-time multiple object tracking. arXiv preprint arXiv:1905.02292. Cited by: §1, §2.1.
-  (2008) Global data association for multi-object tracking using network flows. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §2.1.