Vehicular Multi-object Tracking with Persistent Detector Failures

07/25/2019 ∙ by Michael Motro, et al. ∙ 0

Autonomous vehicles often perceive the environment by feeding sensor data to a learned detector algorithm, then feeding detections to a multi-object tracker that models object motions over time. Probabilistic models of multi-object trackers typically assume that errors in the detector algorithm occur randomly over time. We instead assume that undetected objects and false detections will persist in certain conditions, and modify the tracking framework to account for them. The modifications are tested with a novel lidar-based vehicle detector, and shown to enable real-time detection and tracking without specialized computing hardware.



There are no comments yet.


page 5

page 6

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Interactive robots such as self-driving cars require accurate methods to locate relevant objects such as other traffic participants. They also must predict other participants’ actions or understand their role in the environment. The increasingly complex environments traversed by robots have demanded new paradigms of perception. For instance, self-driving vehicles in urban settings may need to detect several types of stationary and moving objects within tens of meters in all directions. Camera and laser-based perception of urban settings is typically performed by learned algorithms that directly transform raw data into object estimates. The quality of such object detection is proportional to the resolution of the sensor and to the computational resources available.

Given imperfect information about present objects at each time, multi-object tracking (MOT) maintains an estimate of all present relevant objects and infers motion or other information that can be deduced from viewing an object over time. Trackers are often built around a probabilistic model that includes known characteristics of object motion and sensor behavior. The same approach can be applied to MOT by propagating a joint distribution across the states of all objects. The probabilistic approach to MOT can also draw conclusions about objects that have gone undetected due to obstructed view or poor sensing conditions, as it specifies a probability that each tracked object is actually present.

However, several aspects of robotics challenge a practical probabilistic implementation. Practical probabilistic MOT requires principled approximations of this distribution, preferably ones that still keep single-object tracking algorithms and updates as their building block. These approximations are generally easiest to update when all objects move independently and can be sensed independently - meaning that one object does not affect how well a sensor detects another object. However, motion and sensing have clear object dependencies in robotics: locations are dependent in that two objects cannot occupy the same space and will typically avoid collision, and objects that block the line of sight between a sensor and another object will prevent or alter the detection of the latter. The position of the sensor itself may be uncertain if the robot is moving in an unknown environment, or if its sensors are moveable or damaged. Finally, tracking is typically formulated with the assumption that errors in object detection and localization are randomly and independently over time. When detection is being performed by complex but powerful approaches such as machine learning algorithms, errors are more likely to be consistent functions of the sensor and environment.

The inaccuracies of certain kinds of detector may be frequent or persistent enough to limit the benefits of tracking. In other words, tracking adheres to the ‘garbage in garbage out’ [1] principle. The opposite is also true: for highly accurate detectors on limited tasks, simple tracking techniques have been repeatedly shown to outperform more complex trackers built around weaker detectors [2, 3, 4]. Thus a common approach is to put the majority of research effort and implementation cost into high-quality detectors, despite the cost in sensor and computational resources.

The properties of standard vehicular object detectors have been considered in multi-object tracking. For instance, methods to account for line-of-sight occlusion [5] and localization [6]

can be accounted for with straightforward modifications to tracking algorithms. We address the final discrepancy between the tracking model and detector reality: the temporal persistence of detection errors. We show that failure to address this discrepancy may be the reason that complex trackers are outperformed by simple ones. We then derive altered form of the multi-object tracker for missed detections and false detections separately. The resulting tracker is applied to vehicle tracking using rotational lidar data. It is shown to slightly improve on a simple tracker when a state-of-the-art detector is used, but enable large improvements when a simpler but more lightweight detector is used. In particular, a classic computer vision method combined with corrected tracking achieves by far the highest reported performance on the Kitti dataset for real-time 3D detection or tracking without a graphics processing unit.

Ii Vehicular Object Detection

Three types of sensors are widely used in autonomous vehicle perception. Radar is often processed in two stages. The low-level stage performs frequency-based processing of received waves and returns points in space expected to be filled by an object. The high-level stage clusters points into distinct objects and false positives [7]. Machine learning for radar detection is less common in the literature, possibly because less unprocessed radar data is publicly available [8]. Lidar returns were traditionally handled as a set of horizontal line segments or curves, or as an unordered set of points in 3D space termed a point cloud. Recent research on lidar perception focuses instead on learned algorithms. Camera information is difficult to apply to most perception tasks with simple handmade rules, and learned algorithms have been utilized for decades. We focus on learned algorithms for the task of object detection.

An object detection algorithm takes the input from a sensor or sensors of a multi-object environment and outputs a set of objects. Depending on the environment and sensor, this task may be highly separable - that is, the sensory input that provides information about one object is largely irrelevant to the other objects. Additionally, highly similar objects are often unlikely to exist - for instance, solid 3D objects cannot exist in overlapping areas. These characteristics motivate the classification-suppression approach to object detection. A classification model is trained to provide a score indicating the likelihood that a specific region in the environment, termed an anchor, contains an object. During the classification step, this model is applied to many overlapping anchors. Then, in the non-maximum suppression step, only the highest-scoring of overlapping anchors is maintained as a detected object.

False negative errors in object detection occur when the sensor does not provide information about an object, or when the detection algorithm fails to use sensor information to correctly classify an object. False negatives cannot be ‘fixed’ by a tracker, but they can be characterized so that tracked objects are handled correctly while undetected. Object detection algorithm failures are difficult to characterize, especially for complex black box algorithms. However, it is reasonable to assume that inaccuracies in the detector are consistent for highly similar inputs, for instance if the same object is viewed twice in identical surroundings. False positive errors are detections that do not correspond to an actual object. For vehicular applications, it is reasonable to assume that most false positives are stationary objects - or more directly, that most large moving objects should be tracked and detected. False positive detections therefore occur when the stationary features of a location are viewed from a certain perspective.

Ii-1 Boosted Tree Detectors

Pedestrian detection in images is one of the oldest applications of learned object detectors, with research still ongoing [9]. Until 2015, the most popular approaches were variants of the Viola Jones detector [10]. This method convolutionally generates features across the image, then trains a boosted classifier. Boosted classifiers are a set of simple models, each of which returns a classification score. The sum of each model’s score makes up the total model’s score. A few models can usually accurately classify most inputs, a property termed attentional cascade in computer vision [10]. Hence the model combines high precision power for difficult inputs with high speed for the average input, making it particularly suited to object detection. However, the model was not generally found powerful enough to classify on raw pixel data, so engineered multipixel features were generated first. This feature construction stage was typically the most time-consuming stage of the object detection algorithm, and variants such as [11] focused on improving this stage, including by using GPUs [12].

Ii-2 Deep Convolutional Neural Network Detection

Deep convolutional neural networks (CNNs) have become the most popular method for image-based object detection and image or lidar-based traffic scene perception. They still follow the classification-suppression approach but have several clear advantages. Their deep structure allows for direct learning of low-level features, which have been shown to outperform similar handmade features


. Additionally, pooling or strided convolution layers essentially aggregate information in local regions. As adjacent detections would be suppressed anyway, this decreases the amount of computation without a loss in accuracy. A final advantage of neural networks is their amenity to being trained and performed with GPUs. The disadvantages of deep networks are a purported brittleness to structure and hyperparameter choice, and slow inference due to model complexity. For instance, lidar-based object detection requires specialized network implementations to operate more than 10 times a second on a GPU - for instance by sparsifying the convolutional operations

[13, 14].

Ii-3 Lidar Object Detection

While the term lidar applies to any laser-based ranging sensor, we focus on the rotating multi-laser lidars that are commonly used for wide-view or 360-view perception of traffic environments. These sensors return a set of 3D points that follow sparse vertical lines. In early research on autonomous driving, object detection from laser scanners was performed by clustering points into distinct objects and separately determining the identity of each object based on its points. Classifiers based on engineered features [15, 16] or even handmade models [17, 18] can identify vehicles or pedestrians in the immediate vicinity. However, as more ambitious driving capabilities were explored, the goal of perception shifted to include a variety of stationary and moving objects within a broader radius of the car. The first public benchmark of 3D object detection was established on the Kitti dataset [19] in 2017. To date, all of the methods on the public leaderboard are deep CNN object detectors save for [20]. The majority of high-performing methods are lidar-based or lidar-and-image-based CNNs utilizing GPUs. Exceptions include camera-only 3D object detectors [21] and neural networks designed for CPU applications [22].

Iii Tracking

This section presents the standard probabilistic formulation of multi-object tracking with a single object detector. We focus on the update step of tracking rather than the prediction/propagation step as this step is generally independent of the detector’s properties. Hence a single update step is considered, where a hypothesized distribution over the current set of present objects is updated with a set of detections.

Iii-a Single-object Tracking

Each tracked object is parameterized with a random state . This includes observable features such as the object’s location, shape, and class, as well as latent features such as the object’s motion. Each detection has a fixed set of values . In the case that a single object is detected, the detection distribution is assumed. Given an object distribution and a detection, as well the assumption that the detection corresponds to this object, the object’s distribution can be updated with a simple application of Bayes rule

. In practice, the probability distributions assumed in tracking models are Gaussians, point mixtures, or mixtures of Gaussians. For instance, if the current state distribution is a Gaussian, the updated state distribution can be estimated as a Gaussian using a variant of the Kalman filter.

Iii-B Multi-object Tracking

Probabilistic multi-object tracking estimates probability distributions over sets of estimated object states

. Random sets differ from vector-valued random variables in that they have no intrinsic ordering and their cardinality

is also random. A distribution on a set can be written in terms of a cardinality distribution and a set of joint distributions (one for each cardinality).


Where all possible one-to-one mappings of indices

are summed. Most useful multi-object distributions can be described directly without specifying the cardinality or joint distributions. For instance, the multi-Bernoulli distribution specifies a fixed set of

potential objects, each of which have an independent state distribution and an independent probability of existing . The probability of existence of each object is also independent of the object’s state.


In this case, the mapping from to is not one-to-one, as some components in the distribution are not matched to realized objects. The null index is used for indices with no match. It is clear that a mixture of an arbitrarily high number of these distributions could form any multi-object probability distribution. This is the intuition behind multiple hypothesis tracking, and [23] derives the tracking update equations with such a mixture. An equivalent but more compact distribution is presented here. This distribution specifies potential objects, each with independent state distribution , and a joint distribution over the existence of objects . The existence of objects are therefore dependent, but are still independent of any object’s state.


Finally, a Poisson process is frequently added to models. Its purpose is to specify regions in which an unknown number of objects may exist. In practice, these are objects that have recently entered the sensor view and have not yet been detected, or objects with too uncertain a position to be tracked accurately. Each Poisson process is described by a rate parameter and a single object state distribution .


Iii-B1 Standard Multi-object Measurement Model

A multi-object measurement model specifies the probability distribution of a set of detections given a current set of objects . The standard model assumes that each object generates at most one measurement. Additionally, the probability that an object generates a measurement is independent of other objects, and the values of a generated measurement are independent of objects that did not generate it. Finally, some measurements may be erroneous and not correspond to any object. These are termed false positives, and are assumed to be generated by a Poisson process. refers to the probability that an object generates some detection, while is the distribution over measurement given that it was generated by object . and specify the distribution over the set of false measurements.


Alternative measurement models consider detectors such that each object has many measurements [24]

or handle special cases of interdependence between objects and measurements. The multi-object tracking update step calculates the posterior probability

. Given the multiple hypothesis distribution of (4), the joint distribution of objects and detections is


This large expression be modified to fit the form of the object distribution (4) by combining the assignment terms and into a single joint term . However, mapping true objects to joint terms allows for many mappings that would not occur in the joint distribution. For instance, mapping object 1 to 1,2 and object 2 to 1,3 is impossible, as two different objects are mapped to a single potential object . These constraints can be handled by introducing an indicator function that contains the assumed constraints on data association.


Additionally, the relevant expressions can be separated into likelihood constants and posterior distributions over each .


We collectively refer to these as the multi-object tracking update expressions. Note that they are similar to the measurement likelihood and object update expressions for single-object tracking. The multi-object joint distribution can then be written up to some normalizing constant.


This form is clearly equivalent to the multiple hypothesis distribution of (4). The posterior distribution has one potential object for every matching of prior object and measurement , as well as potential objects corresponding to prior objects that were not detected and to newly detected objects. Two common classes of tracker, multiple hypothesis trackers and PDA/multi-Bernoulli trackers, are specific cases of this distribution and update.

Iii-B2 Data Association and Multi-Sensor Tracking

As the distributions of each new potential object are simply updated using single-object trackers, the challenge for multiple object tracking is expressing or approximating the updated object existence distribution . There is no closed-form expression that does not grow in complexity exponentially with each update. However, a wide variety of approximations have been developed to achieve accurate association at high speeds [25, 26, 27, 28].

Tracking with multiple detectors can be performed in a sequential fashion by updating with one detector and using the updated distribution as the prior distribution to the next update. If each detector requires complementary information from other detectors to accurately distinguish between objects, or if their detections are correlated (as we are considering), then sequential multi-detector tracking will be inaccurate. Multiple detectors can also update simultaneously by considering joint associations of the tracked objects and all detectors’ detections. However, the data association step becomes substantially more challenging with an increasing number of simultaneously-updated sensors.

Iv Tracking with Persistent Detector Failures

As discussed in section III-B1, the standard tracking model assumes that each object generates a single detection with a certain probability at each timestep, and that the probability of generating a detection is independent both between objects and across timesteps. It also assumes that false detections occur independently of present objects and other false detections, usually following a Poisson process.

Neither of these assumptions are realistic for frequent updates from complex object detectors. The reality is that detections at nearby timesteps are highly correlated. However, handling all potentially correlated detectors simultaneously would require a complex multi-detector data association step, and may not be accurately approximable in real time. This section introduces more efficient methods to incorporate failure persistence into the standard tracking model.

Iv-a False Negative Persistence through Detectability

As argued in the introduction, the probability of successfully detecting a present object will be correlated across time because that probability is primarily determined by latent features of the object and environment. This suggests that the correlation could be corrected by an augmentation of the object state. Say each object’s state includes a binary feature “detectable”. This feature’s only impact is to alter the probability of detection as such:


The detectability, or the probability that an object is detectable, can then be stored and updated as part of each object’s state distribution. Specifically, the update expressions (9) and (13) are modified to include:


As the “detectable” feature is binary, and its actual causes may be too complex to model, it is reasonable to model its change over time as a discrete Markov chain. The original formulation of independent false negatives can be achieved by setting the Markov transition to stationary - that is, the detectability at each new timestep is the same regardless of the previous timestep.

An example from the Kitti dataset is shown in Figure 1, in which a tracked vehicle that has been detected for some time is lost for three timesteps then detected again. This detection failure is likely due to the distance of the car as well as absorption of lidar scans by black objects. Assume that the object was tracked for some time and its probability of existence is (ignoring multi-hypothesis dependencies, survival, etc.). Were detection failures assumed to occur independently with a likelihood of , the probability of object existence by the final timestep would be around . Alternatively, if objects have a steady-state detectability of 0.95 with a transition half-life of a single timestep, the final probability of existence would be 0.875. A higher existence probability increases the likelihood that the detection in the fifth timestep is associated with the tracked object, rather than considered a previously undetected object. The detectability-augmented tracker also assigns a lower probability to detection failure for steadily detected objects - this means that unlikely detections caused by sudden object motion are more likely to be correctly associated.

Fig. 1: Six timesteps from scene 4 of the Kitti tracking dataset, showing a distant leading vehicle that is undetected for three timesteps. Lidar point returns are overlaid in blue, and VoxelJones detections are displayed as 3D box frames.

We are not aware of anyone who has incorporated detection correlation into a probabilistic tracker, despite the simplicity of doing so. Non-probabilistic trackers from the computer vision community often assume that objects may go undetected for a contiguous window of time, and initially create time segments of well-tracked objects called tracklets [29]

. Tracklets that are likely to correspond to a single object are then combined, and the object’s position in undetected periods can be imputed. This approach to tracking does not inherently reason about the motion of currently-undetected objects, as may be necessary for vehicular applications.

Iv-B Tracking Persistent False Positives

Intuitively, the only way to characterize correlated false positives is to maintain knowledge of previous false positives. This is equivalent to tracking the false positives as well as true objects. We consider each object to potentially belong to one of two classes, “genuine” or “false”. The multi-object tracking model (4) can be altered such that there are two types of Poisson-generated untracked objects.


This alteration will affect the update expression (14) concerning measurements associated with untracked objects.


Note that rather than tracking two new potential objects, one genuine and one false, the multi-object tracking update tracks a single object using a two-part mixture distribution. Equivalently, two separate tracked objects can be kept, one genuine and the other false. These objects have equivalent data association constraints with other objects, as well as the logical constraint that only one of the two exists. The two-object formulation therefore unnecessarily increases the computational complexity of data association, and so is avoided.

The fundamental benefit of tracking false objects is that the false detections at any time frame are less likely to be erroneously associated with true objects. A secondary benefit is the ability to better distinguish between true and false detections by leveraging information across time. The detector only accesses a single instant of information and thus cannot make these distinctions. For 3D object detectors on traffic scenes, we make two simple assumptions: false detections are unlikely to move and unlikely to persist when viewed from different angles. These characteristics are factors of the latent state of the object, rather than measurements. Thus they can be incorporated into the multi-object propagation step, specifically by modifying the existence probability with independent survival terms . These terms differ based on the genuity and latent state of the object, and thus impact the existence distribution for each object as well as the probability of genuity.


When the survival probability of false objects is lower, due the object moving or being viewable from a different perspective, the overall probability of existence for this object will decrease but the relative probability of genuity will increase.

Figure 2 shows a segment of the Kitti dataset in which the detector of Section V-B1 generates a false detection near the road. The cause seems to be a combination of objects such as a bicycle, signpost, and barrier poles. The detector’s confidence score is around 0.15, meaning each detection has a relative probability of 0.15 of being legitimate as opposed to false. The actual tracking result is a function of motion model and measurement model parameters, so we instead set a simple hypothetical example: the probability of false detections or undetected objects is 0.01 of the probability of the each detection originating from an object in the same spot. If false detections are tracked, this one will be maintained until it disappears a few timesteps later (as the viewing perspective changes). If false detections are not tracked, this set of detections will be considered either a sequence of independent false detections or a detected object. The tracked object corresponding to these detections will have existence probability . Figure 3 shows another segment with a sequence of low-confidence detections, in this case correctly corresponding to a vehicle. In this case, the tracked object is quickly determined to be moving at over 6 m/s and its probability of genuity is increased as a result.

Fig. 2: Six timesteps from scene 0 of the Kitti tracking dataset, showing false vehicle detections to the side of the road. Detections are displayed as 3D box frames.
Fig. 3: Five timesteps from scene 9 of the Kitti tracking dataset, showing low-confidence detections of a moving vehicle. Detections are displayed as 3D box frames.

The concept of tracking false detections has been implicitly performed by some trackers. The vehicle tracker in [4] tracks all likely detections, but only reports tracked objects whose associated detections have high scores. The use of temporal information to distinguish between true and false detections has also been performed in a limited fashion: some visual trackers separate relevant objects from background environment based solely on motion [29]. However, as with persistent missed detections, these techniques were not to our knowledge previously discussed explicitly.

V Vehicle tracking implementation

A hypothesis-oriented multiple-hypothesis tracker is applied to vehicle tracking. Vehicles are considered to occupy 2D rectangles flat along the ground, with height considered unimportant for sensing or tracking. This is often termed a bird’s-eye-view (BEV) representation. Vehicles are parameterized by position, orientation, length and width, and speed in the direction of orientation. The sensing vehicle’s motion is assumed to be known accurately, so objects are tracked in absolute positions rather than positions relative to the vehicle. Turning is not explicitly modeled, but this model is capable of tracking turning vehicles with steady detections in standard traffic settings. Each detection provides position, orientation, and shape estimates in addition to a score corresponding to the expected probability that this is a genuine detection. While persistent false detections and missed detections are handled as discussed, errors in the detections’ estimated values are assumed to be independent across time, objects, and features. In other words, the detector error is assumed to be white noise. This assumption is not necessarily more realistic than that of independent failures, and could be addressed by state augmentation

[30]. However, short-term errors in position or shape estimation are not considered as significant as detection failures.

A 2D grid of 3-meter square tiles is maintained by the tracker for several functions. The ground surface of each tile is estimated and used to convert between BEV, 3D, and image-space positions. Line-of-sight sensor occlusion is determined for each tile based on the lidar returns, and used to estimate occlusion probability for tracked objects. An occupancy grid [31] is used to model the expected rate and distribution of untracked vehicles entering the system ( and in section III-B). The occupancy grid is updated by local mixing at every timestep. Finally, the importance of performing detection on each tile at each time can be roughly estimated [32]. Some detectors can operate on a subset of tiles to limit computation.

The tracker is tested on the first 10 scenes from the Kitti tracking training dataset. This subset contains 281 unique vehicles with 8307 total annotations. The discussed tracker is denoted as PDFMHT (Persistent Detector Failures in Multiple Hypothesis Tracking) in the following results.

V-a Tracking on Strong Detector

The tracker is applied to detections from Point-RCNN [33], a deep CNN lidar-based vehicle detector. The same detections were utilized by the tracker in [4], so results from that method are included for comparison. Note that the public benchmark for Kitti only includes tracking in image space, and few methods have stated their performance in terms of BEV or 3D tracking. This is the only easily comparable work with published 3D tracking results; others such as [21] do not explain the details of their test data or performance metrics enough for reliable comparison, and lack public code for reproduction.

We report several common MOT performance metrics [9] in Table I, calculated as in the MOTChallenge benchmark [9]. Tracked objects are considered to correctly match true objects if the BEV overlap (intersection over union) between the two is greater than 0.5. Note that this is more lax than Kitti’s BEV detection criterion of 0.7 overlap, but better considers tracking quality by rewarding correctly-tracked objects even if their precise location or size is not estimated well. Additionally, object detection performance is reported in Table II, using the standard Kitti metric of average precision with a 0.7 BEV overlap criterion. This is purely an estimate of how valuable tracking is for improving detection, while the MOT metrics in the former also consider whether objects are tracked consistently over time. As standard with Kitti, the detection metric is calculated on 3 subsets of vehicles ranked by assumed difficulty of detection. The MOT metrics are performed on all annotated vehicles.

MOTA # FN # FP #Switch MT%
PointRCNN+PDFMHT 69.4 929 1607 7 78
w/o Detectability 68.2 920 1612 103 77
w/o Genuity 30.9 523 5085 136 88
PointRCNN +[4] 65.9 1141 1673 22 72
TABLE I: PointRCNN+Tracker on Kitti, MOT Performance Metrics
0.7 IoU AP Easy AP Moderate AP Hard
no tracking 97 94 93
PDFMHT 90 92 92
[4] 88 91 91
0.5 IoU AP Easy AP Moderate AP Hard
no tracking 99.1 96.7 96.2
PDFMHT 99.2 98.9 98.9
[4] 98.4 97.8 97.7
TABLE II: PointRCNN+Tracker on Kitti, Detection Performance Metrics

Tracking is not seen to improve detection performance under the standard overlap criterion. We attribute this to the positional requirement being strict enough that synthesizing detections over time is not helpful. On the other hand, the looser overlap requirement shows that tracking definitely improves detection, even when the detection is already quite accurate by most standards. These small improvements in performance may be quite valuable for a reliable perception system. The proposed PDFMHT outperforms the simpler tracker on both overlaps. The tracking metrics show that handling persistent missed detections reduces label switches between objects, while handling persistent false detections drastically reduces false positives. Methods such as [4] use score-based cutoffs and that are also resistant to false positives - but PDFMHT still performs better by all metrics.

V-B Tracking on Lightweight Lidar Detector

V-B1 Proposed Detector

We propose a simple Viola-Jones-type object detector for vehicular lidar applications, which we refer to as VoxelJones. One of the challenges for classifying on lidar data is deriving useful features from sets of 3D points. Decision trees make binary splits rather than algebraic transformations, so features are not required to be numeric. We select a simple but intuitive and numerous set of binary features. The box of -3 to 3 meters lengthwise with respect to an object’s center, -2 to 2 meters widthwise, and .125 to 2.625 meters above the ground is discretized into

-meter binary voxels (3D cubes). A voxel is positive if at least one lidar point falls within that region of space. Any box-shaped union of voxels may be used as a feature in the classifier. Put another way, each feature is the presence of a lidar detection within some box region of space. There are 30720 voxels within the considered region and over 800 million possible box features. Of course, these features are only calculated if and when used within the classifier.

We use second-order gradient boosting to train weighted regression trees as in

[34]. The decision trees are of depth 3 and so each contain 7 splits and 8 resulting values. A first classifier of 10 trees operates on anchors spaced every 0.5 meters and radians. The attentional cascade eliminates 99.8% of negative inputs, while erroneously eliminating 14% of positive inputs.Positively classified areas are split into anchors spaced at 0.125 meters and radians, and further classified with 20 more trees. Additional details such as training, preprocessing, and suppression are discussed in Appendix A.

In addition to inference speed, decision trees and box features have the advantage of interpretability. Appendix B includes visual interpretations of the algorithm’s parameters and of example input.

V-B2 Detector Performance

The algorithm was trained with the Kitti object detection training dataset, leaving out images corresponding to the first 10 scenes of the Kitti tracking training dataset. Its performance on the Kitti object detection test benchmark is shown in Table III. It is compared to deep network detections from the Kitti public leaderboard. The metric is the same as used in Table II, with 0.7 overlap. Algorithm speed is also reported in terms of average detections per second. For most methods, including ours, there is little fluctuation in speed across frames.

AP Easy AP Moderate AP Hard FPS GPU
VoxelJones 65.3 54.5 49.9 6 N
[35] 69.4 62.5 55.9 1 Y
[36] 85.8 76.9 68.5 4 Y
[13] 89.3 86.1 79.8 60 Y
[22] 15.4 10 N
TABLE III: Detector Performance on Kitti BEV Benchmark

Our detection method achieves similar performance to the early deep network [35], and superior performance to the only other public method that can be applied in real-time on a CPU [22]. However, state-of-the-art detectors are notably better. We next discuss how to track while handling detection errors.

V-B3 Tracker Performance

We report MOT metrics and performance metrics in Tables IV and V respectively.

MOTA # FN # FP #Switch MT%
PDF-MHT 53.3 2821 1057 4 34
w/o Detectability 51.5 2844 1099 89 33
w/o Genuity 18.2 1511 5210 71 64
2x Faster 51.0 2885 1175 7 33
TABLE IV: VoxelJones+Tracker Performance on Kitti, Tracking Metrics
.7 IoU AP Easy AP Moderate AP Hard
Just Detection 50 47 47
PDF-MHT 42 40 29
2x Faster 39 38 27
.5 IoU AP Easy AP Moderate AP Hard
Just Detection 85 78 77
PDF-MHT 85 78 77
2x Faster 81 74 73
TABLE V: VoxelJones+Tracker Performance on Kitti, Detection Metrics

The tracker is compared to alternate versions without false positive or false negative correlation. Additionally, the tracker is used to subselect a third of the viewed area at every timestep on which to apply the detector. This more than doubles the detector’s speed, with the sublinear improvement explained by costly preprocessing and the fact that the selected regions are less likely to be removed in the attentional cascade.

The tracker’s performance is highly improved by adding persistent false object reasoning. This is reasonable as the object detector used provides many false positives. Detectability slightly improves performance as with the Point-RCNN detector. The detection performance of VoxelJones is not improved by tracking. However, it is worth noting that the tracker consistently maintained tracks (with very few label switches) without sacrificing detection performance. Additionally, the tracker can be used to increase the detector’s speed with only a minor performance penalty. The subselected VoxelJones+tracker pair runs at over 10 updates per second.

Vi Conclusion

We present modifications to the standard probabilistic multi-object tracking formulation that more realistically capture some common properties of robotic object detectors. Objects that are persistently undetected for a short time can be handled by maintaining an estimate of object detectability. False detections that persist in a certain region can be handled by tracking potentially false objects and using a combination of detector confidence and latent features to distinguish between genuine and false tracked objects. This formulation was implemented in a 3D lidar-based vehicle tracker and tested on a public dataset. Tracking is shown to be valuable for further improving the detection performance of state-of-the-art deep networks, and for enabling real-time detection with a classic detection algorithm without a graphics processing unit.

Appendix A Object Detector Implementation

Most lidar points in a given timestep will correspond to the ground. Identifying the shape of the ground eliminates one dimension of uncertainty from object detection and also reduces the number of relevant lidar points. A common ground model assumes the whole environment is on a flat plane [37, 38, 39]. This model is inaccurate in many real locations, so we instead adopt a similar approach to [40] and partition the visible environment into three by three meter tiles. A ground plane is fit to each tile with RANSAC. Tiles without enough ground points to reliably determine a ground plane are assumed to match the nearest fitted ground plane. This procedure takes around 100 ms to find ground for all tiles. However, unlike tracked objects a ground tile is certain once it has been measured. When perception is being performed in real time on a moving vehicle, only a small number of previously unseen tiles need to be fit at each timestep.

The visibility of the environment is stored in a grid of the same shape. Visibility is used to determine the likelihood of successful detection in a grid tile, and is the product of two factors: laser presence and occlusion. Visibility is considered directly proportional to the proportion of the tile in which lasers would fall (which is much lower for distance tiles than nearby tiles). Occlusion naturally extends to other parameters as well. Because lidar data is typically returned in order of the device’s rotation, the visibility grid can be constructed quickly.

Out of overlapping anchors that are classified positively, only the highest-scoring one is considered valid (suppression). Our detection is applied at many locations and orientations, meaning that many pairs of detections are checked for overlap. We store a table of which pairs of nearby anchors overlap. The position and angle resolution of anchors is considered sufficient to accurately characterize vehicle position and resolution, but an additional regression model would be needed to determine vehicle shape. We simply use the median vehicle dimensions of 4m length, 1.76m width, 1.7m height for each detection. A regression model to determine dimension and a simple fitting method based on lidar detections of the object boundary were tested, but neither improved the performance.

Appendix B Lidar Object Detector Interpretability

This section covers two ways in which the VoxelJones object detector can be interpreted visually.

B-a Visualizing the model

As all features for the detector are simply point checks within 3D boxes, they can be visualized as 3D shapes. Figure 4 visualizes all feature boxes from the first tree of our trained model from three perspectives. For each colored box, a split to the left is taken if there is a lidar point within this box. The final scores for this tree are denoted as positive or negative. The first box (red) checks a broad section of where a car would be. Splits to the left check high regions, presumably to negatively score tall and large objects like buildings and trees. If no lidar detections lie in the first box, the next splits search for detections in nearby areas.

Fig. 4: Visualization of box splits for the first tree of the trained VoxelJones object detector.

B-B Visualizing inputs

The concept of feature importance has a clear mathematical formulation for boosted trees, based on the change in score directly caused by splits on a single feature [41]. In our case, the feature space is too broad for this metric to be easily interpretable, but a similar approach can be taken for an input set of lidar points. The points that lie inside a feature box are considered to have importance correlated to the score of that tree, divided by the number of similar points. Points or small sets of points that uniquely contributed to positive or negative classification can thus be located. Figure 5 shows an example of a true detection and false detection with input point importances displayed by color. A single lidar point in the true example hits the roof of the far side of the car, and is considered the most valuable single point. The points that were most significant to negative classification of the false example lay on the thin pole, and curiously on some patches of ground.

Fig. 5: Visualizations of object detector input importance. (a) point cloud and (b) image of a correctly detected vehicle, (c) point cloud and (d) image of a correctly ignored sign. Lighter point colors correspond to positive impact on likelihood of being an object, darker colors negative.


The authors would like to thank Qualcomm Research for supporting this work as part of the project “Robust and Efficient Multi-Object Tracking for Automotive Applications.”


  • [1] M. Quinion, “Garbage in garbage out,” 2005. [Online]. Available:
  • [2] E. Bochinski, V. Eiselein, and T. Sikora, “High-speed tracking-by-detection without using image information,” in 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).   IEEE, 2017, pp. 1–6.
  • [3] G. Gündüz and T. Acarman, “A lightweight online multiple object vehicle tracking method,” in 2018 IEEE Intelligent Vehicles Symposium (IV).   IEEE, 2018, pp. 427–432.
  • [4] X. Weng and K. Kitani, “A Baseline for 3D Multi-Object Tracking,” arXiv:1907.03961, 2019. [Online]. Available:
  • [5] M. Motro and J. Ghosh, “Measurement-wise occlusion in multi-object tracking,” in 2018 21st International Conference on Information Fusion (FUSION).   IEEE, 2018, pp. 2384–2391.
  • [6] F. Moosmann and C. Stiller, “Joint self-localization and tracking of generic objects in 3d range data,” in 2013 IEEE International Conference on Robotics and Automation.   IEEE, 2013, pp. 1146–1152.
  • [7] A. Scheel and K. Dietmayer, “Tracking multiple vehicles using a variational radar model,” arXiv preprint arXiv:1711.03799, 2017.
  • [8] Y. Kang, H. Yin, and C. Berger, “Test your self-driving algorithm: An overview of publicly available driving datasets and virtual testing environments,” IEEE Transactions on Intelligent Vehicles, vol. 4, no. 2, pp. 171–185, 2019.
  • [9] A. Milan, L. Leal-Taixé, I. D. Reid, S. Roth, and K. Schindler, “MOT16: A benchmark for multi-object tracking,” CoRR, vol. abs/1603.00831, 2016. [Online]. Available:
  • [10]

    Y.-Q. Wang, “An analysis of the viola-jones face detection algorithm,”

    Image Processing On Line, vol. 4, pp. 128–148, 2014.
  • [11] P. Dollár, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramids for object detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 8, pp. 1532–1545, 2014.
  • [12] A. D. Costea, R. Varga, and S. Nedevschi, “Fast boosting based detection using scale invariant multimodal multiresolution filtered features,” in

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , Jul. 2017, pp. 993–1002.
  • [13] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “PointPillars: Fast encoders for object detection from point clouds,” Dec. 2018.
  • [14] Y. Yan, Y. Mao, and B. Li, “SECOND: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, Oct. 2018.
  • [15] M. Kusenbach, M. Himmelsbach, and H. Wuensche, “A new geometric 3D LiDAR feature for model creation and classification of moving objects,” in 2016 IEEE Intelligent Vehicles Symposium (IV), Jun. 2016, pp. 272–278.
  • [16] J. Behley, V. Steinhage, and A. B. Cremers, “Laser-based segment classification using a mixture of bag-of-words,” in 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Nov. 2013, pp. 4195–4200.
  • [17] A. Kampker, M. Sefati, A. S. A. Rachman, K. Kreisköther, and P. Campoy, “Towards multi-object detection and tracking in urban scenario under uncertainties.” in VEHITS, 2018, pp. 156–167.
  • [18] A. Asvadi, C. Premebida, P. Peixoto, and U. Nunes, “3D lidar-based static and moving obstacle detection in driving environments: An approach based on voxels and multi-region ground planes,” Rob. Auton. Syst., vol. 83, pp. 299–311, Sep. 2016.
  • [19] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
  • [20] L. Plotkin, “Pydriver: Entwicklung eines frameworks für räumliche detektion und klassifikation von objekten in fahrzeugumgebung,” Ph.D. dissertation, Bachelor’s Thesis, Karlsruhe Institute of Technology, Karlsruhe, Germany, 2015.
  • [21]

    S. Scheidegger, J. Benjaminsson, E. Rosenberg, A. Krishnan, and K. Granström, “Mono-camera 3d multi-object tracking using deep learning detections and pmbm filtering,” in

    2018 IEEE Intelligent Vehicles Symposium (IV).   IEEE, 2018, pp. 433–440.
  • [22] K. Minemura, H. Liau, A. Monrroy, and S. Kato, “Lmnet: Real-time multiclass object detection on cpu using 3d lidar,” in 2018 3rd Asia-Pacific Conference on Intelligent Robot Systems (ACIRS).   IEEE, 2018, pp. 28–34.
  • [23] Á. F. García-Fernández, J. L. Williams, K. Granstrom, and L. Svensson, “Poisson multi-bernoulli mixture filter: direct derivation and implementation,” IEEE Transactions on Aerospace and Electronic Systems, 2018.
  • [24] C. Adam, R. Schubert, and G. Wanielik, “Radar-based extended object tracking under clutter using generalized probabilistic data association,” in 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013), Oct. 2013, pp. 1408–1415.
  • [25] M. Motro and J. Ghosh, “Scaling data association for hypothesis-oriented mht,” in 2019 22nd International Conference on Information Fusion (FUSION).   IEEE, 2019.
  • [26] P. Lenz, A. Geiger, and R. Urtasun, “FollowMe: Efficient online min-cost flow tracking with bounded memory and computation,” in Proceedings of the IEEE International Conference on Computer Vision., 2015, pp. 4364–4372.
  • [27] S. S. Blackman, “Multiple hypothesis tracking for multiple target tracking,” IEEE Aerospace and Electronic Systems Magazine, vol. 19, no. 1, pp. 5–18, 2004.
  • [28]

    K. Date and R. Nagi, “Tracking multiple maneuvering targets using integer programming and spline interpolation,” in

    2018 21st International Conference on Information Fusion (FUSION).   IEEE, 2018, pp. 1293–1300.
  • [29] Z. Wu, N. I. Hristov, T. H. Kunz, and M. Betke, “Tracking-reconstruction or reconstruction-tracking? comparison of two multiple hypothesis tracking approaches to interpret 3d object motion from several camera views,” in 2009 Workshop on Motion and Video Computing (WMVC).   IEEE, 2009, pp. 1–8.
  • [30] B. Friedland, “Treatment of bias in recursive filtering,” IEEE Transactions on Automatic Control, vol. 14, no. 4, pp. 359–367, 1969.
  • [31] M. Schreier, V. Willert, and J. Adamy, “Compact representation of dynamic driving environments for adas by parametric free space and dynamic object maps,” IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 2, pp. 367–384, 2015.
  • [32] J. L. Williams, “Information theoretic sensor management,” Ph.D. dissertation, Massachusetts Institute of Technology, 2007.
  • [33] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation and detection from point cloud,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 770–779.
  • [34]

    T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in

    Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining.   ACM, 2016, pp. 785–794.
  • [35] B. Li, “3d fully convolutional network for vehicle detection in point cloud,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2017, pp. 1513–1518.
  • [36] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object detection network for autonomous driving,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1907–1915.
  • [37] Apollo, “3d obstacle perception.” [Online]. Available:
  • [38] M. Dimitrievski, P. Veelaert, and W. Philips, “Semantically aware multilateral filter for depth upsampling in automotive LiDAR point clouds,” in 2017 IEEE Intelligent Vehicles Symposium (IV), Jun. 2017, pp. 1058–1063.
  • [39] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. Waslander, “Joint 3D proposal generation and object detection from view aggregation,” Dec. 2017.
  • [40] D. Zermas, I. Izzat, and N. Papanikolopoulos, “Fast segmentation of 3D point clouds: A paradigm on LiDAR data for autonomous vehicle applications,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), May 2017, pp. 5067–5073.
  • [41] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning.   Springer series in statistics New York, 2001, vol. 1, no. 10.