## I Introduction

3D multi-object tracking is essential for autonomous driving. It estimates the location, orientation, and scale of all the traffic participants in the environment over time. By taking temporal information into account, a tracking module can filter outliers from frame-based object detection and be more robust to partial or full occlusions. The resulting trajectories may then be used to infer motion patterns and driving behaviours of each traffic participant to improve motion forecasting. This enables safe decision-making in autonomous driving.

Current state-of-the-art in 3D multi-object tracking [yin2020center, chiu2020probabilistic] follows the tracking-by-detection paradigm. These methods first use a 3D object detector to estimate the bounding box location and orientation of each object in each frame. Then they either use the center or Mahalanobis distance [mahalanobis1936distance] as data association metric between detections and existing tracks. However, these metrics only evaluate the distance of objects and the differences of the bounding box size and orientation, while ignoring each object’s geometric and appearance features. Therefore, data association performance highly depends on the accuracy of motion prediction. For objects that are hard to predict precisely, e.g. pedestrians, motorcycles or cars making sharp turns, the Euclidean distance between prediction and the correct detection can be high. Therefore, they may not be matched correctly. [liang2020pnpnet, weng2020gnn3dmot] attempt to improve data association by learning an association metric from the tracker trajectories and the detection features. However, these methods are still unable to outperform the aforementioned simple method based on center distances [yin2020center]

. The results indicate that building a neural network for effective data association is challenging.

We propose to learn how to weigh the Mahalanobis distance [mahalanobis1936distance] and the distance based on geometric and appearance features when comparing a track and detection for data association. These features are extracted from 3D LiDAR point clouds and 2D camera images. Different from [liang2020pnpnet, weng2020gnn3dmot]

, we use the learned metric within a standard Kalman Filter

[kalman1960filter] that is effective for multi-object tracking [chiu2020probabilistic]. Additionally, a Kalman filter offers interpretability and explicit uncertainty estimates that can be used for down-stream decision-making.In addition to data association, track life cycle management is another important component of online tracking systems. Track life cycle management determines when to initialize and terminate each track. This decision significantly affects the number of false positives and identity switches. However, track life cycle management has not attracted much attention from the research community. Prior works either initialize a new track for every unmatched detection [yin2020center], or create temporary tracks and convert them into full tracks given enough consecutive matches [weng2019ab3dmot, chiu2020probabilistic, shenoi2020jrmot, weng2020gnn3dmot, liang2020pnpnet].

We propose to learn whether to initialize a new track from an unmatched detection based on its geometric and appearance features. This approach helps our tracking method avoid initializing new tracks for potential false-positives.

In summary, we propose a probabilistic, multi-modal, multi-object tracking system consisting of different trainable modules (Distance Combination, Track Initialization and Feature Fusion) to provide robust and data-driven tracking results. We evaluate our approach on the NuScenes [caesar2019nuscenes] dataset. When using only 3D LiDAR point clouds as input, our method already outperforms the current state-of-the-art method [yin2020center]. By effectively fusing 2D and 3D input, we can increase performance gains even further. In addition, qualitative results reveal a significant decrease in the number of false positive tracks which is important for decision-making.

## Ii Related Work

### Ii-a 3D Object Detection

Most 3D multi-object tracking systems [yin2020center, chiu2020probabilistic, weng2019ab3dmot, liang2020pnpnet, shenoi2020jrmot, weng2020gnn3dmot, zhou2020tracking] perform tracking on the detection bounding boxes provided by 3D object detectors. Therefore, the choice of 3D object detector is important for the overall performance of each tracking system. 3D object detection can be applied to camera images [chen2016monocular, brazil2019m3d], LiDAR point clouds [zhu2019megvii, zhou2018voxelnet, yan2018second, yan2018pixor, lang2019pointpillar, shi2019pointrcnn], or their combination [liang2019multi, vora2020pointpainting, qi2018frustum]. Monocular 3D object detection models are unlikely to be on par with models that utilize LiDAR or depth information. Therefore, 3D multi-object tracking algorithms [zhou2020tracking, hu2019joint] relying on monocular 3D object detectors are usually unable to outperform tracking methods relying on LiDAR- or depth-based object detectors.

In our proposed tracking system, we use the CenterPoint 3D object detector [yin2020center]. It quantizes LiDAR point clouds and generates the feature map using PointNet [qi2019pointnet, qi2017pointnetplusplus]. The feature map is then fed to a key point detector for locating centers of objects and regressing the size and orientation of the bounding boxes. This detector is one of the top performers in the NuScenes Detection Challenge [caesar2019nuscenes].

### Ii-B 3D Multi-Object Tracking

Most 3D multi-object tracking algorithms adopt the tracking-by-detection framework. They take 3D object detection results as input to the tracking methods. In the data association step, different distance metrics are used to find the matched track-detection pairs. For example, AB3DMOT [weng2019ab3dmot] uses the 3D intersection-over-union (3D IOU) as an extension to the 2D IOU in 2D tracking algorithms [bewley2016simple]. ProbabilisticTracking [chiu2020probabilistic] uses the Mahalanobis distance that takes uncertainty of the tracked state into account. CenterPoint [yin2020center] uses the object center distance as the data association metric. Their tracking algorithm achieves competitive results by proposing a better 3D object detector than the one used in AB3MOT [weng2019ab3dmot] and ProbabilisticTracking [chiu2020probabilistic]. CenterPoint [yin2020center] is currently one of the leading methods in the NuScenes Tracking Challenge [caesar2019nuscenes].

Several other 3D tracking methods proposed to combine the tracker trajectory with object geometric and appearance features. GNN3DMOT [weng2020gnn3dmot] uses a graph neural network and 2D-3D multi-feature learning for data association. PnPNet [liang2020pnpnet] presents an end-to-end trainable model to jointly solve detection, tracking, and prediction tasks. However, these tracking methods are unable to outperform the aforementioned much simpler CenterPoint [yin2020center] tracking algorithm on the NuScenes [caesar2019nuscenes] dataset.

Different from related work, we propose a probabilistic multi-modal multi-object tracking system that learns to combine the Mahalanobis distance [mahalanobis1936distance]

and the deep feature distance for data association. We also propose a data-driven approach for track life cycle management.

## Iii Method

In this section, we introduce our proposed online 3D multi-object tracking method. A flowchart of our proposed algorithm is shown in Figure 1. Building upon ProbabilisticTracking [chiu2020probabilistic] as the baseline, our algorithm takes both LiDAR point clouds and camera images as input and conducts object tracking through Kalman Filters. Our proposed tracking algorithm features three trainable components to robustify data association and track life cycle management: the Feature Fusion Module merges the LiDAR and image features to generate the fused deep features. The Distance Combination Module learns to combine the deep feature distance with the Mahalanobis distance as the final metric for data association. Additionally, we also introduce the Track Initialization Module that learns to decide whether to initialize a new track for each unmatched detection based on the fused 2D and 3D deep features. In the following sections, we will describe each of the core components of our proposed tracking model.

### Iii-a Kalman Filters

We build upon prior work on ProbabilisticTracking [chiu2020probabilistic] and use Kalman Filters [kalman1960filter] for object state estimation. Each object’s state is represented by 11 variables:

(1) |

where is the center position of the object’s 3D bounding box, is the angle between the object’s facing direction and the x-axis, represent the length, width, and height of the bounding box, and represent the difference of between the current frame and the previous frame.

We model the dynamics of the moving objects using a linear motion model, and assume constant linear and angular velocity as well as constant object dimensions, i.e. they do not change during the prediction step. Following the standard Kalman Filter formulation, we define the prediction step as:

(2) | ||||

(3) |

where is the estimated mean of the true state at time , and is the predicted state mean at time . The matrix is the state transition matrix of the process model. The matrix is the state covariance at time , and is the predicted state covariance at time . The matrix is the process model noise covariance.

We use CenterPoint [yin2020center]’s 3D object detector to provide the observations to our Kalman Filter. The per frame 3D object detection results consist of a set of 3D bounding boxes, with each box being represented by 9 variables:

(4) |

where are the bounding box’s center position, orientation, and scale, similar to the definitions in Equation 1. The remaining two variables represent the difference of between the current frame and the previous frame. These two values can be derived by multiplying the detector’s estimated center velocity with the time duration between two consecutive frames. We use a linear observation model with additive Gaussian noise that has zero mean and noise covariance . Using this observation model and the predicted object state , we can predict the next measurement and innovation covariance that represents the uncertainty of the predicted object detection:

(5) | ||||

(6) |

The noise covariance matrices Q and R of the process model and the observation model are estimated from the statistics of the training set data, as proposed in [chiu2020probabilistic]. We refer readers to [chiu2020probabilistic] for more details.

### Iii-B Fusion of 2D and 3D features

This module is designed to fuse the features from 2D camera images and 3D LiDAR point clouds per detection in the key frames. The fused features will be used as input to the Distance Combination Module and the Track Initialization Module. For each detection, we first map its 2D position from the world coordinate system to the 2D location in the 3D object detector’s intermediate feature map coordinate system. From this intermediate feature map, we extract a

LiDAR point cloud feature. Instead of only extracting a single feature vector located at

in the feature map, we extract all the feature vectors inside the associated region centered at in order to utilize more context information provided by the object detector.We then project the 3D detection bounding box to the camera image plane, and extract the corresponding 2D image feature from a COCO

[lin2014microsoft] pre-trained Mask R-CNN [he2017mask]. For each projected 2D bounding box, we extract a dimensional vector from the RoIAlign feature of the projected 2D bounding box and concatenate it with with a one-hot vector indicating to which camera plane (out of 6 in the sensor sweep) the object projects.Finally, we combine the two feature vectors through a multi-layer-perceptron (MLP) and a reshape operation:

(7) |

where is the fused feature of detections, denotes the MLP and the reshape operation depicted in Figure 0(b).

### Iii-C Distance Combination Module

This module provides a learned distance metric for data association between a set of tracks and new detections. The metric combines information from state estimates as well as appearance and geometry features. Specifically, we design a linear combination of the Mahalanobis and deep feature distance:

(8) |

where denotes the Mahalanobis distance matrix where each element contains the distance between each detection and predicted state per track; denotes the feature distance matrix whose elements measure the feature dissimilarity between each detection and each track, and are coefficient matrices. Each element of is computed by:

(9) |

where is the detection , defined in Equation 4, H is the linear observation model, is the th track predicted state mean, and S is the innovation covariance matrix as defined in Equation 6.

We employ a neural network to estimate the deep feature distance and the coefficient matrices given the fused features of the detections and tracks .

#### Iii-C1 Deep Feature Distance

The network learns an distance map from fused features of the detections and tracks :

(10) |

where denotes the convolutional operators in Figure 0(c). We supervise the feature distance learning by treating it as a binary classification problem. If a track and a detection match to the same ground-truth object, that track-detection pair is treated as a positive training sample, otherwise negative one. And we train the network with Binary Cross Entropy Loss:

(11) |

where K is the matching indicator matrix in which indicates a matched feature pair coming from the same object and indicates an unmatched feature pair. Since there is no ground-truth annotation for each track-detection pair, we treat a pair as matched if the tracking box in the previous frame and the detection box in the current frame match to the same nearby ground-truth object. Otherwise, we treat the pair as unmatched.

#### Iii-C2 Combining Coefficients

We learn the coefficient matrices and so that they can adjust the final distance D based on how important each deep feature distance is.

(12) |

where denotes the convolutional operators in Figure 0(c).

Inspired by PnPNet [liang2020pnpnet], we train this module with a combination of max-margin and contrastive losses. For a pair of a positive sample and a negative sample , we define its max-margin loss as follows:

(13) |

where is a constant margin, is the combined distance of positive sample and is the combined distance of negative sample as can be found in distance matrix in Equation 8. The overall contrastive loss is given as follows:

(14) |

where Pos denotes the set of positive track-detection pairs and Neg

denotes the set of negative track-detection pairs. This loss function design encourages the neural network to learn to generate a distance

for every positive track-detection sample to be smaller than the distance of any negative sample by adjusting the elements of and .To also use the learned combined distance to reject unmatched outliers at inference time, we define two other max-margin losses for the positive sample set and negative sample set as follows:

(15) |

(16) |

where and denote constant margins and is the constant threshold used to reject unmatched outliers at inference time. This loss function design encourages the neural network to generate a distance smaller than the threshold for any positive sample, and a distance larger than for any negative sample.

The overall training loss of this neural network is defined as follows:

(17) |

At test time, once we calculate the combined distance, we conduct data association using the greedy matching algorithm from ProbabilisticTracking [chiu2020probabilistic].

In our implementation, we choose , the same threshold value used in [chiu2020probabilistic]. We set , roughly half of . And we set , half of .

### Iii-D Track Initialization Module

Track life cycle management is another important component of multi-object tracking systems. Most prior works either always initialize a new track for every unmatched detection [yin2020center], or create a temporary track and then wait for a constant number of consecutive matches before converting the temporary track to a full track [chiu2020probabilistic, weng2019ab3dmot, shenoi2020jrmot]. AB3DMOT [weng2019ab3dmot] also shows that the choice of such a constant threshold value affects the overall tracking accuracy.

Different from the prior heuristic approaches, we treat the track initialization task as a simple binary classification problem and solve it with a data-driven approach. We propose the

Track Initialization Module, which takes the fused feature of unmatched detections as input, and generates an output confidence score P on whether we should initialize a new track or not:(18) |

where denotes the convolutional operators depicted in Figure 0(d). We train

as a binary classifier using the Cross-Entropy loss:

(19) |

where if there is a ground-truth object close to detection , otherwise . At inference time, we initialize an unmatched detection with new tracker if is larger than . This Track Initialization Module helps our proposed tracking system reduce the number of false-positive tracks as depicted in Figure 2.

## Iv Experimental Results

Tracking method | Modalities | Overall | car |
---|---|---|---|

GNN3DMOT [weng2020gnn3dmot]* | 2D + 3D | 29.84 | - |

PnPNet [liang2020pnpnet] | 2D + 3D | - | 81.5 |

Our proposed method | 2D + 3D | 68.7 | 84.3 |

### Iv-a Dataset

We evaluate our method on the NuScenes dataset [caesar2019nuscenes]. This dataset contains 1000 driving sequences. Each sequence has a length of roughly 20 seconds and contains keyframes sampled at 2Hz. In each keyframe, the dataset provides 3D bounding boxes, instance ids, and semantic class annotations. The dataset also provides the LiDAR sweeps and camera images captured by the whole sensor-suites.

### Iv-B Evaluation Metrics

To evaluate our algorithm performance, we use the Average Multi-Object Tracking Accuracy

(AMOTA), which is also the main evaluation metric used in The NuScenes Tracking Challenge

[caesar2019nuscenes]. AMOTA is the average of tracking accuracy at different recall thresholds, defined as follows:(20) |

where is the number of sample points. And is the sampled recall threshold. MOTAR is the Recall-Normalized Multi-Object Tracking Accuracy, defined as the follows:

(21) |

where is the number of ground-truth positives, is the number of identity switches, is the number of false positives, and is the number of false negatives.

### Iv-C Quantitative Results

We report our results on the NuScenes validation set in Table I. Our proposed tracking method uses CenterPoint [yin2020center]’s 3D object detection results at each frame as the input to our Kalman Filters. We combine the Mahalanobis distance [mahalanobis1936distance] and the trainable LiDAR point cloud and camera image feature distances as the final metric for data association. For a fair comparison with state-of-the-art methods [yin2020center, chiu2020probabilistic, weng2019ab3dmot], we also include our tracking method’s quantitative results when using only LiDAR as input. From Table I, we can see that the quality of input detections is critical to the final tracking performance. CenterPoint [yin2020center] provides better 3D object detection results than MEGVII [zhu2019megvii] on the NuScenes Detection Challenge [caesar2019nuscenes]. Therefore, methods using CenterPoint [yin2020center]’s object detection perform much better than methods using MEGVII [zhu2019megvii].

As can be seen in the last two rows in Table I, when using exactly the same 3D only LiDAR input, our tracking method outperforms CenterPoint [yin2020center] and ProbabilisticTracking [chiu2020probabilistic]. We conclude that our model is able to use the 3D LiDAR point cloud data to learn the fine-grained geometric features of objects, and that our model also successfully learns the effective combination weightings of the geometric feature distance and the Mahalanobis distance. Moreover, by fusing features from both LiDAR and image data, our proposed tracking method can further improve the overall AMOTA and results in a 2.8 performance gain compared with the previous state-of-the-art tracking method CenterPoint [yin2020center]. This performance gain shows that our model is able to learn how to effectively fuse the 3D LiDAR point cloud and 2D camera image input together to achieve better overall tracking accuracy.

We also compare our model with other multi-modal tracking models [weng2020gnn3dmot, liang2020pnpnet], as shown in Table II.

### Iv-D Ablation Study

In this section, we provide an ablation analysis of the different trainable modules to better understand their contribution to the over system performance: the Distance Combination Module, the Track Initialization Module and the Feature Fusion Module. We report our results in Table III. We note that both the Distance Combination Module and the Track Initialization Module yield consistent improvements over the baseline, with the highest numbers achieved when both modules are enabled. Additionally, we record a consistent increase in performance when fusing 2D and 3D features, allowing us to conclude that our model successfully learns how to leverage both appearance and geometry features for object tracking.

Tracking method | Modalities | Overall | bicycle | bus | car | motorcycle | pedestrian | trailer | truck |
---|---|---|---|---|---|---|---|---|---|

Distance Combination Module only | 3D | 67.1 | 46.3 | 81.9 | 84.2 | 63.8 | 74.9 | 53.5 | 65.4 |

Track Initialization Module only | 3D | 66.2 | 45.1 | 78.4 | 84.2 | 66.6 | 75.1 | 52.7 | 61.2 |

Our proposed method | 3D | 67.7 | 47.0 | 81.9 | 84.2 | 66.8 | 75.2 | 53.5 | 65.4 |

Distance Combination Module only | 2D + 3D | 67.6 | 46.5 | 82.0 | 84.3 | 65.4 | 76.3 | 53.1 | 65.4 |

Track Initialization Module only | 2D + 3D | 67.4 | 48.6 | 80.4 | 81.6 | 68.4 | 75.3 | 53.3 | 64.5 |

Our proposed method | 2D + 3D | 68.7 | 49.0 | 82.0 | 84.3 | 70.2 | 76.6 | 53.4 | 65.4 |

### Iv-E Qualitative Results

As indicated in Table I, our method improves over state-of-the-art by 2.8 in overall AMOTA. We note a much more significant improvement on specific classes (i.e. more than 10 on the motorcycle class). We show a Bird’s Eye View (BEV) visualization of one sample sequence of motorcycles in Figure 2. We plot the bounding boxes of motorcycles from every frame of the same driving sequence on BEV images, with different colors representing different tracking ids. We compare our results with CenterPoint [yin2020center].

From Figure 2, we can see that our tracking results have significantly fewer false-positive bounding boxes compared with CenterPoint [yin2020center]’s results. CenterPoint [yin2020center] relies on the center Euclidean distance, and any unmatched detection box is always initialized as a new track. Conversely, our Track Initialization Module is designed to decide whether to initialize a new track based on the fusion of the 3D LiDAR and 2D image features. Additionally, our proposed method uses the Kalman Filter to refine the bounding box locations, orientations, and scales based on the past observation of the objects, while CenterPoint [yin2020center] directly uses the potentially noisy detection bounding boxes as the tracking results, without utilizing the past observations.

While quantitatively we record an 11.0 increase in AMOTA on the motorcycle class compared to CenterPoint [yin2020center], qualitatively, this translates to a significant reduction in the number of false positive tracks which are not penalized too much by the AMOTA metric but which can be crucial for decision making. The main reason behind this qualitative and quantitative discrepancy is that most of the false-positive tracks are composed of false-positive detection boxes with low confidence scores. AMOTA starts to sample tracks from those with higher confidence scores to lower ones. Therefore, a large number of false-positive tracks with low confidence scores will not affect AMOTA too much. For more details of the calculation of AMOTA, please refer to the NuScenes devkit’s [caesar2019nuscenes].

Figure 3 shows our tracking visualization results of motorcycles projected to camera images. (a), (b) are two consecutive frames in Sequence . (c), (d) are from Sequence . The white boxes represent the detections. The colored boxes indicate the tracking results. Different colors indicate different tracking ids. Our model accurately tracks the motorcycles in both sequences. In Sequence , our Distance Combination Module predicts a larger positive value for the tracked motorcycle, indicating a more reliable feature distance where the corresponding objects are large and clearly captured in the 2D images. While in Sequence , the module predicts a smaller on objects that are small and blurred on the 2D images. Additionally, our Track Initialization Module also correctly decides not to initialize new tracks for the false-positive detections in Sequence Frame .

## V Conclusion

In this paper, we propose an online probabilistic, multi-modal, multi-object tracking algorithm for autonomous driving. Our model learns to fuse 2D camera image and 3D LiDAR point cloud features. And these fused features are then used to learn the effective weightings for combining the deep feature distance with the Mahalanobis distance for better data association. Our model also learns to manage track life cycles in a data-driven approach. We evaluate our proposed method on the NuScenes [caesar2019nuscenes] dataset. Our method outperforms the current state-of-the-art baseline both quantitatively and qualitatively.

For future work, we are looking to include information from additional modalities such as for example map data. Furthermore, we will continue evaluating our overall approach with new object detectors as we found that an improved quality of object detection also improves the quality of tracking. There is also the potential for learning better motion models per category that could further improve data association. And finally, we may leverage differentiable filtering frameworks to fine-tune motion and observation models end-to-end with the algorithmic prior of recursive filtering.

## Vi Acknowledgement

Toyota Research Institute ("TRI") provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.