Multi-Object Tracking using Poisson Multi-Bernoulli Mixture Filtering for Autonomous Vehicles

03/13/2021 ∙ by Su Pang, et al. ∙ 0

The ability of an autonomous vehicle to perform 3D tracking is essential for safe planing and navigation in cluttered environments. The main challenges for multi-object tracking (MOT) in autonomous driving applications reside in the inherent uncertainties regarding the number of objects, when and where the objects may appear and disappear, and uncertainties regarding objects' states. Random finite set (RFS) based approaches can naturally model these uncertainties accurately and elegantly, and they have been widely used in radar-based tracking applications. In this work, we developed an RFS-based MOT framework for 3D LiDAR data. In partiuclar, we propose a Poisson multi-Bernoulli mixture (PMBM) filter to solve the amodal MOT problem for autonomous driving applications. To the best of our knowledge, this represents a first attempt for employing an RFS-based approach in conjunction with 3D LiDAR data for MOT applications with comprehensive validation using challenging datasets made available by industry leaders. The superior experimental results of our PMBM tracker on public Waymo and Argoverse datasets clearly illustrate that an RFS-based tracker outperforms many state-of-the-art deep learning-based and Kalman filter-based methods, and consequently, these results indicate a great potential for further exploration of RFS-based frameworks for 3D MOT applications.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

Overview of the proposed PMBM tracker pipeline. For each frame, many 3D detections are generated by a neural-network-based 3D detector, as the red bounding boxes shown in the top two figures illustrate. Our PMBM tracker could successfully track targets and filter out false positives. In the lower two figures, different bounding box colors correspond to different unique tracked IDs. The green pentagon in the center of each figure represents the ego vehicle pose. All figures are in top-down view. Better viewed in color.

Multiple object tracking (MOT) is a critical module for enabling an autonomous vehicle achieve robust perception of its environment and consequently achieve safe maneuvering within the environment surrounding the vehicle. The main challenges for MOT in autonomous driving applications are threefold: (1) uncertainty in the number of objects; (2) uncertainty regarding when and where the objects may appear and disappear; (3) uncertainty in objects’ states. Traditional filtering based methods, such as Kalman filtering [7, 18, 3]

, perform well in state update and estimation but can hardly model the unknown number of objects, and the so-called

birth and death phenomena of objects. Meanwhile, the emergence of random finite set (RFS)[15, 11, 16] based approaches have opened the door for developing theoretically sound Bayesian frameworks that naturally model all the aforementioned uncertainties accurately and elegantly.

RFS-based MOT algorithms have been shown to be very effective for radar-based MOT applications [17, 13]. In particular, Poisson multi-Bernoulli mixture (PMBM) filtering has shown superior tracking performance and favourable computational cost [19] when compared to other RFS-based approaches. Consequently, under this work, we propose a PMBM filter to solve the amodal MOT problem for autonomous driving applications (Fig 1). Applying RFS-based trackers for 3D LiDAR data and/or for 2D/3D amodal detections (bounding boxes) has not been well explored. Existing works in this area either under-perform state-of-the-art trackers or they have been tested using a small dataset that do not reflect broad and truly challenging scenarios  [8, 10, 6]. We believe that RFS-based methods could provide robust and highly-effective solution for these emerging detection modalities.

The contributions of our paper are as follows: (1) We propose a PMBM filter to solve the amodal MOT problem for autonomous driving applications. To the best of our knowledge, this represents a first attempt for employing an RFS-based approach in conjunction with 3D LiDAR data and neural network-based detectors. (2) We demonstrate that our PMBM tracker is low-complexity, and it can run at an average rate of 20 Hz on a standard desktop. (3) We validate and test the performance of our PMBM tracker using two extensive open datasets provided by two industry leaders – Waymo [14] and Argoverse [2]. These datasets, which have more than 80000 diverse testing frames, clearly demonstrate that our tracker outperforms many state-of-the-art methods under realistic driving conditions. It is worth noting that our PMBM tracker ranks No.2 in average MOTA and No.1 in vehicle MOTA among all the entries that use the organizer provided detections on the Argoverse dataset.

Figure 2: PMBM tracker system architecture. There are four primary components: (1) PMBM Predictions; (2) Data Association; (3) PMBM Update; and (4) Reduction. PPP represents Poisson point process, and MBM stands for multi-Bernoulli mixture. The detail for each component is provided in Section 2.

2 Approach

The PMBM filter is a relatively new approach to the MOT problem. It is able to track objects based on a Bayesian framework, and it models undetected objects and detected objects using two distinct probability distributions. The high-level system of the proposed PMBM MOT tracker architecture is shown in Fig 

2. As shown in the figure, the tracker consists of four primary components: (1) PMBM Predictions; (2) Data Association; (3) PMBM Update; and (4) Reduction.

2.1 Object State

The object states used in this work is defined as , where and represent the 2D location of the object, and are the velocities along and directions respectively. The reasons that we define the states in this compressed way are as follows: first, the value does not change dramatically for consecutive frames; second, the dimension of the objects are already precise from a neural network based detector; therefore, it is not necessary to incorporate all 3D information; third, reducing the state dimension inherently enables the tracking system operate at a lower computational cost for real-time performance. Under this work, our PMBM tracker is designed as a point-based tracker. The center points of the objects, which are initially output from the detector, are tracked using unique tracked IDs, while the bounding box dimensions (height, width, length) are directly extracted from the detection measurements.

2.2 Detected and Undetected Objects

Under the PMBM model, the set of objects at timestamp is the union of detected objects and undetected objects . Detected objects are objects that have been detected at least once. Undetected objects are objects that are not detected. Note that we are not explicitly tracking the undetected objects, which is impossible under a tracking-by-detection framework. Instead, we have a representation of their possible existences. For example, if we consider an autonomous car under a scenario where a large truck blocks part of the view. It is possible that some objects are located in the occluded area behind the truck, and hence, these objects are inherently undetected.

2.3 Data Association Hypotheses

For each timestamp, there are multiple hypotheses for data association. For our measurement-driven framework, each measurement, either it is a newly detected target, a previously detected target, or a false positive detection. We form different global association hypotheses from possible combinations of the single target hypothesis (STH). Gating is used to reduce the total number of hypotheses and only keeps reasonable ones. Murty’s algorithm [12], an extension of the Hungarian algorithm [9] is used to generate best global hypotheses instead of only one.

2.4 PMBM Density

Under the PMBM model, we use Poisson RFS, also named as Poisson point process (PPP) to represent undetected objects, and multi-Bernoulli mixture (MBM) RFS to represent detected objects. The PMBM density is defined as a convolution of a PPP density for undetected objects and a MBM density for detected objects:


where represents all the objects in the surveillance area, and where is the disjoint union set of undetected objects and detected objects . and are the Poisson point process density and multi-Bernoulli mixture density, respectively.

2.5 PMBM Prediction

A crucial aspect of the PMBM filter is its conjugacy property, which was proved in [5]. The notion of conjugacy is quite critical for robust and accurate Bayesian-based MOT. In summary, the conjugacy of the PMBM filter inplies that if the prior is in a PMBM form, then the distribution after the Bayesian prediction and update steps will be of the same distribution form. Therefore, the prediction stage of a PMBM filter can be written as:


where represents the transition density. Constant velocity model is used as the motion model in this work for simplicity. Under the PMBM filter, undetected and detected objects can be predicted independently. We define

as the probability of survival, which models the probability that an object survives from one time step to the next. For undetected objects, the predicted parameters consist of predicted parameters from the previous timestamp and PPP birth parameters. The weight of each undetected object is scaled by

in the prediction step. For detected objects, which are modeled as multi-Bernoulli mixture RFSs, each multi-Bernoulli (MB) process can also be predicted independently of the other MB processes. The probability of existence for each MB-modeled object is decreased by a factor in order to account for the higher uncertainty of existence within the Prediction stage.

2.6 PMBM Update

Furthermore, by adding information from the measurement model , the PMBM density can be updated with:


In the update step, the undetected objects that do not have any measurement associated with them remain undetected. The Bayesian update will thus not change the states or variances of the Poisson distributions since no new information is added. Here we define

as the probability of detection, which models an object ought to be detected with that probability. For undetected objects without measurement associated, the weight is thus decreased with a factor as to account for the decreased probability of existing. For detected object, the predicted state is updated by weighting in the information contained in the measurement.

There are two different types of updates for detected objects: the objects being detected for the first time and the detected objects from the previous timestamp. Our tracker is a measurement-driven framework: an object must be connected to a measurement in order to be classified as detected for the first time. All the undetected PPP intensity components and corresponding gated measurements are considered to generate the fused distribution. Note that the detections provided from a neural network always have confident scores attached to them. This confident score is an invaluable indicator of the object probability of existence. So unlike a standard PMBM filter, we incorporate the detection confident score into the update step of objects detected for the first time. We get a new Bernoulli process for each first-time detected object. As for detected objects from the previous timestamp, if there are measurements associated with them, then for each hypothesis, a standard Kalman filter is used to update the state vector, the updated probability of existence is set to 1 because one can not associate a measurement to an object that does not exist; if there is no measurement associated with an object, which was detected from a previous frame, then we maintain the object predicted state unchanged. Furthermore, we decrease the probability of existence and weight with

. Note that here is related to the associated detection confident score in the past frames. Here, unlike other standard Kalman filter based trackers, the survival time of detected objects without measurement varies based on the tracking status from the previous time period.

Method Split Class MOTA (%) MOTP #False Possitive #Misses #IDS #FRAG
Argoverse Baseline [2] validation Vehicle 63.15 0.37 17385 19811 122 323
Our PMBM Tracker validation Vehicle 68.70 0.375 10644 20782 271 1116
Argoverse Baseline [2] validation pedestrian 41.70 0.257 3783 14411 113 123
Our PMBM Tracker validation pedestrian 45.50 0.249 4935 11653 201 903
Argoverse Baseline [2] test Vehicle 65.90 0.34 15693 23594 200 393
Our PMBM Tracker test Vehicle 71.67 0.34 8278 24165 362 1217
Argoverse Baseline [2] test pedestrian 48.31 0.37 4933 25780 424 387
Our PMBM Tracker test pedestrian 48.56 0.4 5924 24278 783 1750

Table 1: Quantitative comparison of 3D MOT evaluation results on Argoverse dataset
Method Split Class MOTA (Primary) (%) MOTP False Possitive (%) Misses (%)

Waymo Baseline [14]
test All 25.92 0.263 13.98 64.55
Argoverse Baseline [2] test All 29.14 0.270 17.14 53.47
Probabilistic KF [3] test All 36.57 0.270 8.32 54.02
Our PMBM Tracker test All 38.51 0.270 7.74 52.86

Table 2: Quantitative comparison of 3D MOT evaluation results on Waymo dataset

2.7 Reduction

The assignment problem in MOT will theoretically become NP-hard[4], and hence, the reduction of the number of hypotheses is necessary for decreasing the computational complexity and maintain real-time performance. Five reduction techniques are used in this work: pruning, capping, gating, recycling and merging. Pruning is used to remove objects and global hypotheses with low weights. Capping is used to set an upper bound for the number of global hypotheses and detected objects. Gating refers to limiting the search distance for data association, and where Mahalanobis distance is used here instead of Euclidean distance. Recycling is applied to detected objects with lower probability of existence. In that context, instead of discarding these objects, we recycle them by moving them from detected object set to undetected object set. There may be non-unique global hypotheses, and hence, merging would merge these identical global hypotheses into one.

3 Experiment

3.1 Settings

Dataset. We evaluate our method on two popular open dataset provided by two industry leaders: Waymo [14] and Argoverse [2]. Waymo 3D tracking dataset contains 800 training segments, 202 validation segments and 150 testing segments. Each segment includes 200 frames covering 20 seconds. There are 40400 frames in the validation set and 30000 frames in the testing set. Three classes are evaluated: vehicle, pedestrian and cyclist.

Argoverse 3D tracking dataset consists of 113 total number of segments (scenes) and 15-30 seconds for each segment. The data is divided into three sets: 65 segments for training ( 13000 frames), 24 segments for validation ( 5000 frames) and 24 segments for test ( 4200 frames). Since our method is not a learning based method, we don’t need to use the training set; but validation set is used for tuning the parameters. Class vehicle and class pedestrian are evaluated. The field of view of these two datasets are both 360 degrees.

Note that for both Waymo and Argoverse datasets, ground truth labels are only available for training and validation sets. For fairness of evaluation of testing samples, one needs to submit the tracking results to the relevant server, and the evaluation results are subsequently published on the corresponding leaderboard.

Detections. We use precomputed 3D object detections provided by the dataset organizer.

Evaluation Metric.

 Standard evaluation metrics

[1] for MOT are used. The details of the metrics for each dataset can be found in [14, 2].

3.2 Experimental Results

The results for Argovers dataset and Waymo dataset are shown in Table 1 and Table 2. As shown in these two tables, our method outperforms other state-of-the-art tracker significantly. At the time of submission (May 2020), among all the entries that use organizer provided detections, our PMBM tracker ranked No.3 in averaged ranking (primary metric), No.2 in averaged MOTA for all classes, No.1 in Vehicle MOTA, No.5 in Pedestrian MOTA within the Argoverse 3D Tracking Leaderboard111, and No.5 on the Waymo 3D tracking leaderboard 222 For tracking-by-detection approaches, the impact of the quality of input detections is inherently of a paramount importance. But our performance is still very competitive compared to other entries that use better self-generated detections.

It is worth noting that our tracker has superior performance for the vehicle case compared to pedestrian tracking. We believe that this is attributed to the fact that the distance between pedestrians are much smaller than distances among vehicles. Thus, since our tracker is a point-based method, it is unable to process object dimension, and consequently, close distances between hypotheses could result in merging such hypotheses that should be separated. Subsequently, this could lead to erroneous data associations. Therefore, as shown in Table 1, for class pedestrian, our tracker has higher false positives, ID switches and fragmentations, which are due to low quality data association. Meanwhile, for vehicles, all the vehicle detection measurements are further separated due to the fact that two vehicles cannot be overlaid with each other when compared to the pedestrian case.

4 Conclusion

In this paper, we propose a PMBM filter to solve the 3D amodal MOT problem with 3D LiDAR data for autonomous driving applications. Our framework can naturally model the uncertainties in MOT problem. This represents a first attempt for employing an RFS-based approach in conjunction with 3D LiDAR data, neural network-based detectors and with comprehensive testing in large-scale datasets. The experimental results on Waymo and Argoverse datasets demonstrate that our approach outperforms previous state-of-the-art methods by a large margin. Finally, we hope that our results motivate future research on RFS-based trackers for self-driving applications.

Acknowledgement: This work has been supported in part by the Semiconductor Research Corporation (SRC) and by Amazon Robotics under the Amazon Research Award (ARA) program.


  • [1] K. Bernardin and R. Stiefelhagen (2008) Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing 2008, pp. 1–10. Cited by: §3.1.
  • [2] M. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan, et al. (2019) Argoverse: 3d tracking and forecasting with rich maps. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 8748–8757. Cited by: §1, Table 1, Table 2, §3.1, §3.1.
  • [3] H. Chiu, A. Prioletti, J. Li, and J. Bohg (2020) Probabilistic 3d multi-object tracking for autonomous driving. arXiv preprint arXiv:2001.05673. Cited by: §1, Table 2.
  • [4] P. Emami, P. M. Pardalos, L. Elefteriadou, and S. Ranka (2018) Machine learning methods for solving assignment problems in multi-target tracking. arXiv preprint arXiv:1802.06897. Cited by: §2.7.
  • [5] Á. F. García-Fernández, J. L. Williams, K. Granström, and L. Svensson (2018) Poisson multi-bernoulli mixture filter: direct derivation and implementation. IEEE Transactions on Aerospace and Electronic Systems 54 (4), pp. 1883–1901. Cited by: §2.5.
  • [6] K. Granström, S. Renter, M. Fatemi, and L. Svensson (2017) Pedestrian tracking using velodyne data—stochastic optimization for extended object tracking. In 2017 ieee intelligent vehicles symposium (iv), pp. 39–46. Cited by: §1.
  • [7] R. E. Kalman (1960) A new approach to linear filtering and prediction problems. Cited by: §1.
  • [8] B. Kalyan, K. Lee, S. Wijesoma, D. Moratuwage, and N. M. Patrikalakis (2010) A random finite set based detection and tracking using 3d lidar in dynamic environments. In 2010 IEEE International Conference on Systems, Man and Cybernetics, pp. 2288–2292. Cited by: §1.
  • [9] H. W. Kuhn (1955) The hungarian method for the assignment problem. Naval research logistics quarterly 2 (1-2), pp. 83–97. Cited by: §2.3.
  • [10] K. W. Lee, B. Kalyan, S. Wijesoma, M. Adams, F. S. Hover, and N. M. Patrikalakis (2010) Tracking random finite objects using 3d-lidar in marine environments. In Proceedings of the 2010 ACM Symposium on Applied Computing, pp. 1282–1287. Cited by: §1.
  • [11] R. P. Mahler (2007) Statistical multisource-multitarget information fusion. Vol. 685, Artech House Norwood, MA. Cited by: §1.
  • [12] K. G. Murthy (1968) An algorithm for ranking all the assignments in order of increasing costs. Operations research 16 (3), pp. 682–687. Cited by: §2.3.
  • [13] F. Papi, B. Vo, M. Bocquel, and B. Vo (2013) Multi-target track-before-detect using labeled random finite set. In 2013 International Conference on Control, Automation and Information Sciences (ICCAIS), pp. 116–121. Cited by: §1.
  • [14] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al. (2020) Scalability in perception for autonomous driving: waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454. Cited by: §1, Table 2, §3.1, §3.1.
  • [15] B. Vo, S. Singh, and A. Doucet (2005) Sequential monte carlo methods for multitarget filtering with random finite sets. IEEE Transactions on Aerospace and electronic systems 41 (4), pp. 1224–1245. Cited by: §1.
  • [16] B. Vo, B. Vo, and A. Cantoni (2008) Bayesian filtering with random finite set observations. IEEE Transactions on signal processing 56 (4), pp. 1313–1326. Cited by: §1.
  • [17] B. Vo and B. Vo (2011)

    A random finite set conjugate prior and application to multi-target tracking

    In 2011 Seventh International Conference on Intelligent Sensors, Sensor Networks and Information Processing, pp. 431–436. Cited by: §1.
  • [18] X. Weng, J. Wang, D. Held, and K. Kitani (2020) 3D multi-object tracking: a baseline and new evaluation metrics. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §1.
  • [19] Y. Xia, K. Granstrcom, L. Svensson, and Á. F. García-Fernández (2017) Performance evaluation of multi-bernoulli conjugate priors for multi-target filtering. In 2017 20th International Conference on Information Fusion (Fusion), pp. 1–8. Cited by: §1.